Data Processing with SMACK: Spark, Mesos, Akka, Cassandra, and Kafka

Alibaba Cloud

5.00/5 (3 votes)

Jan 25, 2017

CPOL

11 min read

10824

This article introduces the SMACK (Spark, Mesos, Akka, Cassandra, and Kafka) stack and illustrates how you can use it to build scalable data processing platforms

Data Processing with SMACK: Spark, Mesos, Akka, Cassandra, and Kafka

This article introduces the SMACK (Spark, Mesos, Akka, Cassandra, and Kafka) stack and illustrates how you can use it to build scalable data processing platforms. While the SMACK stack is really concise and consists of only several components, it is possible to implement different system designs within it which list not only purely batch or stream processing, but also contain more complex Lambda and Kappa architectures as well.

What is the SMACK stack?

First, let’s talk a little bit about what SMACK is. Here’s a quick rundown of the technologies that are included in it:

Spark - a fast and general engine for distributed large-scale data processing.

Mesos - a cluster resource management system that provides efficient resource isolation and sharing across distributed applications.

Akka - a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM.

Cassandra - a distributed highly available database designed to handle large amounts of data across multiple datacenters.
Kafka - a high-throughput, low-latency distributed messaging system/commit log designed for handling real-time data feeds.

Storage layer: Cassandra

Although not in alphabetical order, first let’s start with the C in SMACK. Cassandra is well known for its high-availability and high-throughput characteristics and is able to handle enormous write loads and survive cluster nodes failures. In terms of CAP theorem, Cassandra provides tunable consistency/availability for operations.

What is most interesting here is that when it comes to data processing, Cassandra is linearly scalable (increased loads could be addressed by just adding more nodes to a cluster) and it provides cross-datacenter replication (XDCR) capabilities. Actually XDCR provides not only data replication but a set of use cases, including:

Geo-distributed datacenters handling data specific for the region or located closer to customers.
Data migration across datacenters: recovery after failures or moving data to a new datacenter.
Separate operational and analytics workloads.

However, all these features come for their own price and with Cassandra this price is its data model. This could be thought of just as a nested sorted map distributed across cluster nodes by partition key and entries sorted/grouped by clustering columns. An example is provided below:

To get specific data in some range, the full key must be specified and no range clauses allowed except for the last column in the list. This constraint is introduced to limit multiple scans for different ranges which may produce random access to disks and lower down the performance. This means that the data model should be carefully designed against read queries to limit the amount of reads/scans which leads to lesser flexibility when it comes to support of new queries.

But what if one has some tables that need to be joined somehow with another tables? Let's consider the next case: calculate total views per campaign for a given month for all campaigns.

With a given model, the only way to achieve this goal is to read all campaigns, read all events, sum the properties (with matched campaign IDs) and assign them to campaigns. It can be really challenging to implement such applications because the amount of data stored in Casandra may be huge and exceed the memory capacity. Therefore, such sort of data should be processed in a distributed manner and Spark perfectly fits this type of use cases.

Processing layer: Spark

The main abstraction Spark operates with is RDD (Resilient Distributed Dataset, a distributed collection of elements) and the workflow consists of four main phases:

RDD operations (transformations and actions) form DAG (Direct Acyclic Graph).
DAG is split into stages of tasks which are then submitted to the cluster manager.
Stages combine tasks which don't require shuffling/repartitioning.
Tasks run on workers and results then return to the client.

Here's how one can solve the above problem with Spark and Cassandra:

Interaction with Cassandra is performed via Spark-Cassandra-connector, which makes the whole process easy and straightforward. There's one more interesting option to work with NoSQL stores and that’s SparkSQL, which translates SQL statements into a series of RDD operations.

With several lines of code it's possible to implement naive Lambda design which of course could be much more sophisticated, but this example shows just how easy this can be achieved.

Almost MapReduce: bringing processing closer to data

Spark-Cassandra connector is data-locality aware and reads the data from the closest node in a cluster, thus minimizing the amount of data transferred over the network. To fully facilitate Spark-C* connector data locality awareness, Spark workers should be collocated with Cassandra nodes.

The above image illustrates Spark collocation with Cassandra. It makes sense to separate your operational (or Write-heavy) cluster from one for analytics. Here’s why:

Clusters can be scaled independently.
Data is replicated by Cassandra, with no extra-work needed.
The analytics cluster has different Read/Write load patterns.
The analytics cluster could contain additional data (for example, dictionaries) and processing results.
Spark resources impact is limited to only one cluster.

Let's look at the Spark application deployment options one more time:

As can be seen above, there are three main options available for the cluster resource manager:

Spark Standalone — Spark (as the master node) and workers are installed and executed as standalone applications (which obviously introduces some overhead and supports only static resource allocation per worker).
YARN — Works very good if you already have Hadoop.
Mesos — From the beginning, Mesos was designed for dynamic allocation of cluster resources, not only for running Hadoop applications but for handling heterogeneous workloads.

Mesos architecture

The M in SMACK stands for the Mesos architecture. A Mesos cluster consists of master nodes which are responsible for resource offerings and scheduling, and slave nodes which do the actual heavy lifting in the task execution.

In HA mode with multiple master nodes, ZooKeeper is used for leader election and service discovery. Applications executed on Mesos are called frameworks and utilize APIs to handle resource offers and submit tasks to Mesos. Generally the task execution process consists of the following steps:

Slave nodes provide available resources to the master node.
The master node sends resource offers to frameworks.
The scheduler replies with tasks and resources needed per task.
The master node sends tasks to slave nodes.

Bringing Spark, Mesos and Cassandra together

As said before Spark workers should be collocated with Cassandra nodes to enforce data locality awareness, thus lowering the amount of network traffic and Cassandra cluster load. Here's one of the possible deployment scenarios on how to achieve this with Mesos:

Mesos master nodes and ZooKeepers are collocated.
Mesos slave nodes and Cassandra nodes are collocated to enforce better data locality for Spark.
Spark binaries are deployed to all worker nodes and spark-env.sh is configured with proper master endpoints and executor JAR location.
The Spark executor JAR is uploaded to S3/HDFS.

With provided setup, the Spark job can be submitted to the cluster with simple spark-submit invocation from any worker nodes having Spark binaries installed and assembly JAR containing actual job logic uploaded.

There exist options to run Dockerized Spark, so that there's no need to distribute binaries to every single cluster node.

Scheduled and Long-running Task Execution

Every data processing system sooner or later faces the necessity of running two types of jobs: scheduled/periodic jobs like periodic batch aggregations and long-running ones which are the case for stream processing. The main requirement for both of these types is fault tolerance - jobs must continue running even in case of cluster nodes failures. Mesos ecosystem comes with two outstanding frameworks supporting each of this types of jobs. Marathon is a framework for fault-tolerant execution of long-running tasks supporting HA mode with ZooKeeper, able to run Docker and having a nice REST API. Here's an example of using the shell command to run spark-submit for simple job configuration:

Chronos has the same characteristics as Marathon but is designed for running scheduled jobs and in general it is distributed HA cron supporting graphs of jobs. Here's an example of S3 compaction job configuration which is implemented as a simple bashscript:

There are plenty of frameworks already available or under active development (such as Hadoop, Cassandra, Kafka, Myriad, Storm and Samza) which are targeted to integrate widely used systems with Mesos resource management capabilities.

Ingesting the Data

Up till now we have the storage layer designed, resource management set up and jobs are configured. The only thing which is not there yet is the data to process:

Assuming that incoming data will arrive at high rates, the endpoints which will receive it should meet the following requirements:

Provide high throughput/low latency
Be resilient
Allow easy scalability
Support back pressure

On second thoughts, back pressure is not a must, but it would be nice to have this as an option to handle load spikes. Akka perfectly fits the requirements and basically it was designed to provide this feature set. Here is a quick run-down of the benefits you can expect to get from Akka:

Actor model implementation for JVM
Message-based and asynchronous
Enforcement of the non-shared mutable state
Easily scalable from one process to cluster of machines
Actors form hierarchies with parental supervision
Not only concurrency framework: akka-http, akka-streams, and akka-persistence

Here's a simplified example of three actors which handle JSON HttpRequest, parse it into the domain model case class, and save it to Cassandra:

It looks like only several lines of code are needed to make everything work, but while writing raw data (events) to Cassandra with Akka, the following problems may be caused:

Cassandra is still designed for fast serving but not batch processing, so pre-aggregation of incoming data is needed
Computation time of aggregations/rollups will grow with amount of data
Actors are not suitable for performing aggregation due to stateless design model
Micro-batches could partially solve the problem
Some sort of reliable buffer for raw data is still needed

Kafka acts as a buffer for incoming data

For keeping incoming data with some retention and its further pre-aggregation/processing, some sort of distributed commit log could be used. In this case, consumers will read data in batches, process it and store it into Cassandra in form of pre-aggregates.

Here's an example of publishing JSON data through HTTP to Kafka with akka-http:

Consuming the data: Spark Streaming

While Akka still could be used for consuming stream data from Kafka, having Spark in your ecosystem brings Spark Streaming as an option to solve the following problems:

It supports a variety of data sources
It provides at-least-once semantics
Exactly-once semantics available with Kafka Direct and idempotent storage

Here's an example of consuming event stream from Kinesis with Spark Streaming:

Designing for failure: backups and patching

Usually this is the most boring part of any system, but it's really important to protect data from loss in every possible way when the datacenter is unavailable or analysis is performed on the datacenter breakdowns.

So why not store the data in Kafka/Kinesis?

At the moment of writing this article, Kinesis is the only one solution that can retain data without backups when all processing results have been lost. While Kafka supports a long retention period, cost of hardware ownership should be considered because for example S3 storage is much cheaper than multiple instances running Kafka and S3 SLA are really good.

Apart from having backups, the restoring/patching strategies should be designed upfront and tested so that any problems with data could be quickly fixed. Programmers' mistakes in aggregation job or duplicated data deletion may break the accuracy of the computation results. Therefore, it is very important to have the capability of fixing such errors. One thing to make all these operations easier is to enforce idempotency in the data model so that multiple repetition of the same operations produce the same results (for example, SQL update is an idempotent operation while counter increment is not).

Here is an example of Spark job which reads S3 backup and loads it into Cassandra:

SMACK: The Big Picture

This concludes our broad description of SMACK. To allow you to better visualize the design of a data platform built with SMACK, here’s a visual depiction of the architecture:

In the above article we talked about some of the basic functions of using SMACK. To finish with here is a quick rundown of it’s main advantages:

Concise toolbox for wide variety of data processing scenarios
Battle-tested and widely used software with large support communities
Easy scalability and replication of data while preserving low latencies
Unified cluster management for heterogeneous loads
Single platform for any kind of applications
Implementation platform for different architecture designs (batch, streaming, Lambda, or Kappa)
Really fast time-to-market (for example, for MVP verification)

Thanks for reading. If anyone has experience developing applications using SMACK. Please leave some comments.