The SMACK stack is all the rage these days. Instead of just talking about it, this post is going to guide you through the steps for setting up a simple SMACK stack that will enable you to get a hands on experience with the tools.
In the first step, we will setup a DC/OS-managed Mesos datacenter on AWS. This is going to provide the basic infrastructure on which we will then install Kafka, Cassandra and Spark. Then we are going to deploy simple ingestion and digestion jobs that pump data right across the toolchain.
Step 0: Prerequisites
If you want to try the examples yourself, you’ll need the following:
- Vagrant and Virtualbox
- Ansible
- An AWS Account. The AWS resources used in this example exceed the free tier, regular AWS fees apply. You have been warned!
- An EC2 key pair called dcos-intro
- A Twitter access key
The source code is available at https://github.com/ftrossbach/intro-to-dcos .
Step 1: sMack stands for Mesos
DC/OS is Mesosphere’s distributed operating system on top of Apache Mesos. Version 1.7 has just been released as open source . DC/OS provides AWS CloudFormation templates to set up the datacenter. The following diagram gives a high level overview of the resources that are created for the example in this blog post:
In the public subnet that is reachable from the internet, we can see a master node and a public Mesos slave. One master is sufficient as there is no need for high availability (HA). DC/OS also provides templates that provision three masters and thus provide HA capability. The private subnet is only accessible from the public subnet. It contains 5 mesos slaves that will run any application that does not need exposure to the outside world. If you want to know more, you can have a look at my colleague Bernd Zuther’s Github repository . Bernd gives a more detailed overview of DC/OS. He also transformed the CloudFormation template into a Terraform template that is more easily readable.
To get the stack running, we need to check out the project at Github . We have also to provide AWS credentials for AWS in form of the environment variables:
- DCOS_INTRO_AWS_REGION
- DCOS_INTRO_AWS_ACCESS_KEY
- DCOS_INTRO_AWS_SECRET_KEY
Calling vagrant up then sets up a virtual Ubuntu that will serve as our gateway to DC/OS. After the machine is created, the Ansible CloudFormation task takes the template and creates the Stack in AWS. This is happening synchronously, so it will take a couple of minutes. After Ansible is done, we can take a look at our brand new instances:
Ansible also prints the URL of the master load balancer:
TASK [dcos-cli : This is the URL of your master ELB] ****************
ok: [default] => {
"msg": "http://dcos-test-ElasticL-***.elb.amazonaws.com"
}
The most important URLs for accessing your datacenter are:
- DC/OS dashboard: http:///
- The task scheduling framework Marathon: http:///marathon
- Plain Mesos: http:///mesos
Let’s take a look at our brand new DC/OS dashboard:
As we have not yet deployed any applications, all resources are still up for grabs.
Now that we have Mesos running in form of DC/OS, we are going to install the remaining SMACK infrastructure components.
Step 2: smaCk stands for Cassandra
Ansible already installed the DC/OS CLI in our Vagrant box. DC/OS CLI is a command line interface to manage applications in the datacenter. From within the Vagrant box (use vagrant ssh), the Cassandra Mesos framework can be installed by issuing one simple shell command:
dcos package install cassandra
This will start three Cassandra nodes with some defaults regarding memory and disk requirements. Configuration options are described at https://dcos.io/docs/1.7/usage/tutorials/cassandra/ .
If you look at the DC/OS dashboard, you might see that Cassandra reports as unhealthy. It should turn to healthy once all Cassandra nodes are up and running. You can also now see that some resources have been allocated in DC/OS.
Setting up a Cassandra keyspace is unfortunately not yet possible via the DC/OS CLI. We need to log onto one of our master nodes using SSH and the private key of key pair “dcos-intro”:
ssh core@<mesos_master_ip> -i dcos-intro.pem
We use docker to start a cqlsh Cassandra command line:
docker run -ti cassandra:2.2.5 cqlsh node-0.cassandra.mesos
Once in cqlsh, we can create the required keyspace and table:
CREATE KEYSPACE IF NOT EXISTS dcos WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3'};
CREATE TABLE IF NOT EXISTS dcos.tweets(date timestamp, text text, PRIMARY KEY (date, text));
Yes, this is going to be yet another Twitter-based example.
Step 3: smacK stands for Kafka
Similarly to Cassandra, we are also going to use command line tools to set up Kafka.
Issuing the command
dcos package install --package-version=0.9.4.0 kafka
installs the DC/OS Kafka service. We are using version 0.9.4.0 of this service because the most recent version that was released with DC/OS 1.7 caused some incompatibilities with the frameworks used in the example.
Compared to Cassandra, this the installation of the framework will not lead to an automatic startup of Kafka brokers. We are going to use the DC/OS Kafka command line to add and start Kafka brokers:
dcos kafka broker add 0..2 --port 9092
This command adds three brokers – broker-0, broker-1 and broker-2 – that are supposed to run on port 9092. Further configuration options are available at https://docs.mesosphere.com/usage/services/kafka/.
Adding brokers is not the same as starting them, issuing
dcos kafka broker start 0..2
does that job.
We also use the CLI to create a Kafka topic very creatively called “topic” with ten partitions and a replication factor of three:
dcos kafka topic add topic --partitions 10 --replicas 3
Step 4: Smack stands for Spark
You can probably guess by now how to install Spark, but here it is anyway:
dcos package install spark
This sets up Spark on Mesos in so called cluster mode . Spark jobs will be submitted to a MesosClusterDispatcher that will distribute Spark jobs on demand to available cluster resources. There are no dedicated Spark executors!
This completes the installation of the SMACK infrastructure. A look at the Mesos Web UI reveals all running tasks:
The example application
By now we have all the infrastructure available to run a SMACK application, so let’s just do that.
The example application streams tweets via Kafka and Spark into Cassandra. It is logically divided into two parts:
- data ingestion
- data digestion
We’ll see more about how these parts are set up in the following sections.
Step 5: smAck stands for Akka – data ingestion
The application reads tweets using the twitter4j library and pushes them onto Kafka using Akka reactive streams and the reactive kafka extensions. The code is available in the Github repository and we will not go into detail on this as it is a straightforward Akka stream implementation. Much more interesting is the deployment aspect. The application is packaged as a docker container and deployed to a public docker registry at https://hub.docker.com/r/ftrossbach/dcos-intro/ . To deploy the app in our datacenter, we use DC/OS’ scheduler Marathon and the DC/OS CLI. A new deployment in Marathon is defined in JSON. For the ingestion, we use the following configuration:
{
"id": "ingestion",
"container":{
"docker":{
"forcePullImage":true,
"image":"ftrossbach/dcos-intro",
"network":"BRIDGE"
"privileged":false
},
"type":"DOCKER"
},
"cpus":1,
"mem":2048,
"env":{
"KAFKA_HOSTS":"broker-0.kafka.mesos",
"KAFKA_PORT":"9092",
"twitter4j.debug":"true",
"twitter4j.oauth.accessToken":"****",
"twitter4j.oauth.accessTokenSecret":"****",
"twitter4j.oauth.consumerKey":"****",
"twitter4j.oauth.consumerSecret":"****"
}
}
In this file, we specify the Docker container we want to deploy, some variables like the number of CPUs or memory, and most importantly some environment variables for the container.
We trigger the deployment by issuing
dcos marathon app add /vagrant/provisioning/ingestion.json
– you will need to include valid Twitter credentials in that file.
If you want to check the status of the application, you can take a look into Marathon at http:///marathon. The plain Mesos UI at http:///mesos also has this information. You have access to log files in both UIs.
The Marathon page for the ingestion application is reachable at http:///marathon/ui/#/apps/ingestion
If the application reports as “running”, you’re fine. If it is stuck in a deploy-wait-cycle, check stderr.
Step 6: Smack stands for Spark (yet again…) – data digestion
The final step is running the data digestion. In Kafka, we have a distributed high performance message broker, but we need to do something with the data as Kafka itself does not offer any analytic capabilities whatsoever. We use a Spark streaming job that is also included in the example (de.codecentric.dcos_intro.spark.SparkJob). To keep things uncomplicated, this example does not really make use of the true power of Spark but plainly writes the tweets it reads from Kafka into a Cassandra table.
To deploy a Spark job using the DC/OS CLI, the jar must be available in a place where DC/OS can access it (https://dcos.io/docs/1.7/usage/tutorials/spark/ ). A public file on AWS S3 will do the trick for us. Once uploaded, the job can be triggered by issuing
dcos spark run --submit-args=' \
--class de.codecentric.dcos_intro.spark.SparkJob \
https://s3.eu-central-1.amazonaws.com/dcos-intro/dcos-intro-assembly-1.0.jar \
topic node-0.cassandra.mesos 9042 \
broker-0.kafka.mesos:9092'
The Spark Mesos dispatcher has a WebUI that lets us monitor the job status. It is accessible at
http:///service/spark.
Once the job is running, we can log into Cassandra using the same docker command as above when we created the keyspace and query the tweets table (extract):
Conclusion and outlook
Congratulations, you got the SMACK stack up and running and deployed an application.
This post gave you basic steps to setup the stack on AWS and run simple ingestion and digestion jobs. This is great to explore the features of DC/OS and how it enables you to run a fast data application, but of course there are many things left out – our example application is neither fast data nor big data.
If you want to set up the stack productively, you might also want to:
- check out the brand new DC/OS homepage
- read up on Marathon
- explore service discovery via Mesos DNS
- think about the number of Kafka and Cassandra nodes as well as their memory requirements
- restrict access to DC/OS – with this template, the master nodes are accessible to everyone on the internet
- explore the options of the DC/OS CLI (dcos --help from within the Vagrant box)
- have a look of other official DC/OS packages in the Mesosphere universe
In order to clean up, you can delete the Stack on AWS in the CloudFormation console. This will destroy all created resources.
More articles
fromFlorian Troßbach
Your job at codecentric?
Jobs
Agile Developer und Consultant (w/d/m)
Alle Standorte
Gemeinsam bessere Projekte umsetzen.
Wir helfen deinem Unternehmen.
Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.
Hilf uns, noch besser zu werden.
Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.
Blog author
Florian Troßbach
Senior IT Consultant
Do you still have questions? Just send me a message.
Do you still have questions? Just send me a message.