Automatic Provisioning of a Hadoop Cluster on Bare Metal with The Foreman and Puppet

29.4.2014 | 7 minutes reading time

With standard tools, setting up a Hadoop cluster on your own machines still involves a lot of manual labor. This is annoying the first time you have to do it, but its even worse in cases where a cluster (such as a test system) needs to be set up repeatedly or machines enter and exit the cluster often.

The good news is that the tools Foreman , Puppet and Ambari allow to automate this process to a very large extend. Here we give a quick explanation of how this is done, how you can setup the infrastructure to automatically provision a Hadoop cluster on bare metal.

This is the second article in a series on the automation of Hadoop cluster provisioning and configuration management. In the first we described how to deploy your own virtual Hadoop cluster. You might want to start with the first article, in particular if you are not familiar with Ambari.

Prerequisites

A Puppet enabled Foreman server that is ready to use in your desired infrastructure. Including a running dns, tftp and dhcp server. (We are currently using Foreman Version 1.4.2.) Foreman helps managing servers through their lifecycle, from provisioning and configuration to orchestration and monitoring. With Puppet you can easily automate repetitive tasks and quickly deploy applications. You can find further information about such a Foreman server here and how to install such a server here .
A set of machines, discovered by the Foreman server. These machines need no operating system or anything. Any number of machines is possible and you can add additional machines at any time. Caution: The state of these machines will be lost, including the content of their harddrives.

Configuring Foreman

Now you can start to configure the provisioning, to tell Foreman what kind of operating system and services you want on the machines. All except the first and second step take place in the user interface of Foreman.

Needed Files: First you need to download the files that describe the installation of Ambari server and agent (the Hadoop management and monitoring tool). To do this, log into your Foreman server and do the following.

1sudo mkdir /etc/puppet/environments/ambari_dev
2cd /etc/puppet/environments/ambari_dev
3sudo curl "http://vzach.de/data/ambari-provisioning.zip" -o "ambari-provisioning.zip"
4sudo unzip ambari-provisioning.zip

Puppet Config: Add the following lines to the end of your Puppet configuration file (/etc/puppet/puppet.conf) for Foreman to find the downloaded files.

1[ambari_dev]
2    modulepath     = /etc/puppet/environments/ambari_dev/modules
3    config_version =

Operating System: Go to Hosts/Operating systems, click on “New Operating system” and choose the configuration as shown below. This enables the provisioning of CentOS.

OS – CentOS 6.5, Red Hat, x86_64

Partition Table – Kickstart default

Installation Media – CentOS mirror
Provisioning Templates: Go to Hosts/Provisioning templates and do the following for “Kickstart default” and “Kickstart default PXELinux”. These templates automate the CentOS installation, including the installation of Puppet.
1. Click on the entry, go to the tab “Association” and check the box at “CentOS 6.5”.
  
  Association – CentOS 6.5
2. Go to the configured operating system again and choose each template.
  
  Templates – Kickstart default & Kickstart default PXELinux
Puppet Environment: Go to Configure/Environments and click on “Import from …” (your Puppet master should show up there). Now check the entry “ambari_dev” and click on “Update”. This imports the Puppet files that you downloaded in step one.

Environment – ambari_dev
Host Group – Ambari Server: Go to Configure/Host groups, click on “New Host Group” and choose the configuration as shown below. (Some entries depend on your own infrastructure: Puppet CA, Puppet Master, Domain, Subnet, (new) Root Password)
This combines the settings for a special group of machines. Here we define that every machine in the Ambari server group actually runs an Ambari server and agent. Usually you will only need one server in this group.

Host Group – ambari_server, ambari_dev, …

Included Classes – interfering_services, ntp, ambari_server, ambari_agent

Network

OS – x86_64, CentOS 6.5, …
Host Group – Ambari Agent: Create the “ambari_agent” host group like in the previous step (os and network configuration stays the same). Every machine other than the one with the Ambari server will be in this group. The location of services from the Hadoop stack will be defined in Ambari itself. Therefore every machine other than the Ambari server will be provisioned in the same way.

Host Group – ambari_agent, ambari_dev, …

Included Classes – interfering_services, ntp, ambari_agent
Default Values: Go to Configure/Puppet classes, click the entry “ambari_server”, choose the tab “Smart Class Parameter” and click on the entry “ownhostname”. Now enter ambariserver. + the domain of your ambari_server host group (in our case “ambariserver.local.cloud”) and submit the update. Also do this for the class “ambari_agent” with parameter “serverhostname” and the same value (“ambariserver…”).
The only Ambari server you’ll need will be located at the given name by default. Therefore you don’t need to configure that everytime you add a new machine to the cluster. (It is possible to override this value though.)

Default Value – “ambariserver.” + …
Smart Values: For class “ambari_agent” and parameter “ownhostname” enter the value <%= @host.fqdn %>. This trick works only when you disable “safemode_render” under Administer/Settings/Provisioning/safemode_render (set to false). It allows you to automatically parametrize the Puppet files with the hostname of a new machine.

Setting up the machines

Starting the machines now is quiet easy:

Choose one of your discovered hosts and click on “Provision”. Now enter the name ambariserver and choose the host group “ambari_server”. Everything else is automatically filled in. Continue by submitting and make sure the chosen machine is now restarting.
You just started to provision a machine with an Ambari server and an additional Ambari agent on the same node. After this process is done, you could already start to configure your cluster of one machine with Ambari, however, generally you want to add more machines.

Provisioning – ambariserver, ambari_server, …
Additional machines can be provisioned with the “ambari_agent” host group and names chosen by yourself. You can also repeat this step if you want to add new machines to an existing cluster.

Provisioned Hosts

Configure your Hadoop Cluster

As described in our virtual provisioning blog post , you can now go to the Ambari server user interface (port 8080) and continue to install your Hadoop Cluster through its graphical interface. Keep in mind that the hostnames now depend on your own configuration. The “manual registration on the hosts” also shouldn’t bother you here, again we’ve already configured this for you.

Conclusion

With the tools shown here, you can automate the provisioning of machines for your Hadoop cluster – in this way enabling you to save effort in operations and to be more daring when trying out new configurations (after all, you can just setup the entire infrastructure in a few hours with very little manual effort). The configuration shown here is even portable to virtual machines and in this way can be used to create minimal clusters on developer machines or for automated testing.

The approaches shown here go a long way towards realizing the Infrastructure as Code vision for Hadoop – i.e. the description of the entire Hadoop cluster in configuration files that can be maintained and managed together with the rest of the source code. Configuration files that enable everyone – be they dev or ops – to quickly setup an identical infrastructure automatically.

The missing component is only the configuration of the actual Hadoop services using Ambari – but even this can be automated (we look at this some other time).

Authors

Valentin Zacharias and Malte Nottmeyer

Was this post helpful?

Blog author

Malte Nottmeyer

Do you still have questions? Just send me a message.

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Using External Secrets with Crossplane & ArgoCD

Most Crossplane providers need to authenticate themself against Cloud infrastructure providers. But how do we store these Secrets in a GitOps fashion? If external secret stores are a great way of doing this: How do we successfully integrate them with...

Infrastructure as Code
Platform engineering
DevOps
Cloud native

30.9.2024 | 15 [Missing String "readingTime"]

Going full GitOps with Crossplane & ArgoCD

In the last post we already deployed Crossplane with ArgoCD in a GitOps-fashion. But what about Crossplane providers and their configuration? And can't we optimize the boostrapping with the ArgoCD App-of-Apps pattern? We can! And we'll also provision...

Cloud native
Platform engineering
DevOps
Infrastructure as Code

9.9.2024 | 13 [Missing String "readingTime"]

Bootstrapping Crossplane with ArgoCD

After going into detail about why the integration of Crossplane and ArgoCD is a great way to unlock a new level of GitOps, I promised to dive into the details of such a setup. Here we are! Let's have a look at the basic steps how to use Crossplane together...

Infrastructure as Code
Platform engineering
DevOps
Cloud native

2.9.2024 | 11 [Missing String "readingTime"]

From Classic CI/CD to GitOps with ArgoCD & Crossplane

Lately I found a passion in integrating Crossplane with ArgoCD and finally wanted to write about all the steps needed to create a full blown working setup of both. Just as I finished the code and tried to find a good start into the topic, I found that...

DevOps
Platform engineering
Cloud native
Infrastructure as Code

27.8.2024 | 8 [Missing String "readingTime"]

Spring Boot and HTMX: Deployment to AWS Lambda

This is the next part of my series about Spring Boot and HTMX. In this post, I will show you how to deploy the application created in the previous post to AWS Lambda. If you're in a hurry or impatient, you can simply check out the accompanying Git Repo...

Serverless
Spring
AWS
DevOps
Cloud

30.7.2024 | 5 [Missing String "readingTime"]

Create, build & publish Crossplane Configuration Packages with GitHub ...

You already created your first Crossplane Compositions? Pretty nice! But how to store them in Git? How to create and build a Configuration Package from it? And finally: how to publish and consume these Configurations in your Crossplane management cluster...

DevOps
Platform engineering
Cloud native
Infrastructure as Code

3.6.2024 | 14 [Missing String "readingTime"]

Testing Crossplane Compositions with kuttl, Part 2: Given, When, Assert

In the first part of this blog series we learned about kuttl and why it's a great idea to write tests for your Crossplane Compositions. Now it's time to set up the kuttl test steps to finally verify our Composition renders correctly. Crossplane – blog...

Infrastructure as Code
Cloud native
Platform engineering
DevOps

27.5.2024 | 16 [Missing String "readingTime"]

Testing Crossplane Compositions with kuttl, Part 1: Preparing the TestSuite

Does writing Kubernetes Manifests count as writing code? Should we still bother to test it? Sure! And with the Kubernetes Test Tool (kuttl) there's great tooling available. Let's explore how to use it with Crossplane. Crossplane – blog series 1. Tame...

Cloud native
Platform engineering
DevOps
Infrastructure as Code

21.5.2024 | 16 [Missing String "readingTime"]

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 [Missing String "readingTime"]

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 [Missing String "readingTime"]

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 [Missing String "readingTime"]

Charge your APIs Volume 9: Perfecting APIOps - API Monitoring with Checkly

Over the past series of blog posts, we've been exploring the fascinating world of API Operations (APIOps), diving deep into Continuous Integration, Continuous Deployment, load testing, API diffing, and API Portals and Marketplaces. We've built a robust...

GitHub
API
CI/CD

5.7.2023 | 3 [Missing String "readingTime"]

Daniel Kocot

Charge your APIs Volume 8: Expanding APIOps - API Portals and Marketplaces

In our previous blog posts, we've taken an exciting journey through the world of API Operations (APIOps), exploring concepts like Continuous Integration, Continuous Deployment, load testing with k6, and API diffing with Tufin/oasdiff. By integrating ...

GitHub
API
CI/CD

28.6.2023 | 2 [Missing String "readingTime"]

Daniel Kocot

Charge your APIs Volume 7: Enhancing APIOps - API Diffing with Tufin/oasdiff

Throughout our exploration of API Operations (APIOps), we've covered a range of concepts - from Continuous Integration and Deployment to API testing under stress. These pillars of APIOps have brought us invaluable insights, helping to streamline our ...

API
GitHub
CI/CD

21.6.2023 | 2 [Missing String "readingTime"]

Daniel Kocot

Charge your APIs Volume 5: Taking APIOps with Continuous Deployment to...

In our previous exploration of API Operations (APIOps), we navigated the landscape of "Streamlining your API Operations with Continuous Integration", delving into how this practice can refine our approach towards API development and management. We examined...

CI/CD
API
GitHub

7.6.2023 | 3 [Missing String "readingTime"]

Daniel Kocot

DOTNET CI/CD with Gitlab

While CI/CD is easy for .NET if you use Github, it's much more work if you are on Gitlab. While it's possible it's a lot of moving parts and I hope to simplify the process a little bit. Currently, I want a couple of things out of a basic CI/CD pipeline...

GitLab
CI/CD
.NET

10.1.2023 | 6 [Missing String "readingTime"]

IoT fleet management: A comparison of balena and Portainer

When your system contains many IoT devices that are scattered over a large production facility or even distributed over multiple facilities, it is important that you can manage and update the deployed software, access logs and easily provision new devices...

IoT
IIoT
DevOps
Container
Raspberry Pi

10.1.2023 | 8 [Missing String "readingTime"]

Florian Lüdiger

Time to Renovate

How to keep your IT infrastructure up to date and reduce manual effort to a minimum by using Kubernetes, Helm, GitOps (FluxCD), Continuous Integration (GitLab-CI) and Renovate. When we moved into our house, everything was new and shiny. Well – it was...

DevOps
Infrastructure as Code

19.12.2022 | 8 [Missing String "readingTime"]

Introduction to GitOps with ArgoCD

In this post you will learn what GitOps is about and see the steps to create a setup on your laptop to gain some experience with ArgoCD. Using an industry standard container orchestrator such as Kubernetes, this enables developers to continuously deploy...

CI/CD
Kubernetes
GitHub
Open Source
DevOps
Container
Infrastructure as Code
Infrastructure
Spring

31.10.2022 | 10 [Missing String "readingTime"]

Open Policy Agent – Primer

The Open Policy Agent (OPA) is a general-purpose, open-source policy engine, i.e. a collection of components that allows for a uniform and efficient implementation of rules of all kinds. This article shows a small practical example. When was the last...

CI/CD
Software architecture
IT-Security

19.10.2022 | 5 [Missing String "readingTime"]

Marco Paga

Automatic Provisioning of a Hadoop Cluster on Bare Metal with The Foreman and Puppet

Prerequisites

Configuring Foreman

Setting up the machines

Configure your Hadoop Cluster

Conclusion

Authors

Was this post helpful?

Blog author

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Using External Secrets with Crossplane & ArgoCD

Going full GitOps with Crossplane & ArgoCD

Bootstrapping Crossplane with ArgoCD

From Classic CI/CD to GitOps with ArgoCD & Crossplane

Spring Boot and HTMX: Deployment to AWS Lambda

Create, build & publish Crossplane Configuration Packages with GitHub ...

Testing Crossplane Compositions with kuttl, Part 2: Given, When, Assert

Testing Crossplane Compositions with kuttl, Part 1: Preparing the TestSuite

Becoming a Data-Driven Company with Applied Data Products

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

Charge your APIs Volume 9: Perfecting APIOps - API Monitoring with Checkly

Charge your APIs Volume 8: Expanding APIOps - API Portals and Marketplaces

Charge your APIs Volume 7: Enhancing APIOps - API Diffing with Tufin/oasdiff

Charge your APIs Volume 5: Taking APIOps with Continuous Deployment to...

DOTNET CI/CD with Gitlab

IoT fleet management: A comparison of balena and Portainer

Time to Renovate

Introduction to GitOps with ArgoCD

Open Policy Agent – Primer