Apache Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing the realtime computation. Storm is simple, can be used with any programming language, is used by many companies, and is a lot of fun to use!
Components of a Storm cluster
Apache Storm cluster is superficially similar to a Hadoop cluster. Whereas on Hadoop you run “MapReduce jobs”, on Storm you run “topologies”. “Jobs” and “topologies” themselves are very different — one key difference is that a MapReduce job eventually finishes, whereas a topology processes messages forever (or until you kill it).
There are two kinds of nodes in a Storm cluster:
- Master node (Nimbus)
The master node runs a daemon called “Nimbus” that is similar to Hadoop’s “JobTracker”. Nimbus is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures.
Nimbus service is an Apache Thrift service enabling you to submit the code in any programming language. This way, you can always utilize the language that you are proficient in, without the need of learning a new language to utilize Apache Storm.
Nimbus service relies on Apache ZooKeeper service to monitor the message processing tasks as all the worker nodes update their tasks status in Apache ZooKeeper service.
- Worker nodes (Supervisor)
Each worker node runs a daemon called the “Supervisor”. The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Each worker process executes a subset of a topology; a running topology consists of many worker processes spread across many machines.
All coordination between Nimbus and the Supervisors is done through a Zookeeper cluster. Additionally, the Nimbus daemon and Supervisor daemons are fail-fast and stateless. Even though stateless nature has its own disadvantages, it actually helps Storm to process real-time data in the best possible and quickest way.
Storm is not entirely stateless though. It stores its state in Apache ZooKeeper. Since the state is available in Apache ZooKeeper, a failed nimbus can be restarted and made to work from where it left. Usually, service monitoring tools like monit will monitor Nimbus and restart it if there is any failure.
Apache Storm also has an advanced topology called Trident Topology with state maintenance and it also provides a high-level API like Pig.
Components of a Storm cluster
To do realtime computation on Storm, you create what are called “topologies”. A topology is a graph of computation and is implemented as DAG (Directed Acyclic Graph) data structure.
Each node in a topology contains processing logic (bolts), and links between nodes indicate how data should be passed around between nodes (streams).
When a topology is submitted to a Storm cluster, Nimbus service on master node consults the supervisor services on different worker nodes and submits the topology. Each supervisor, creates one or more worker processes, each having its own separate JVM. Each process runs within itself threads which we call Executors.
The thread/executor processes the actual computational tasks: Spout or Bolt.
Running a topology is straightforward. First, you package all your code and dependencies into a single jar. Then, you run a command like the following:
storm jar all-my-code.jar org.apache.storm.MyTopology arg1 arg2
Streams represent the unbounded sequences of tuples(collection of key-value pairs) where a tuple is a unit of data.
A stream of tuples flows from spout to bolt(s) or from bolt(s) to another bolt(s). There is various stream grouping techniques to let you define how the data should flow in topology like Global grouping etc.
Spout is the entry point in a storm topology. It represents the source of data in Storm. Generally, spouts will read tuples from an external source and emit them into the topology. You can write spouts to read data from data sources such as database, distributed file systems, messaging frameworks or message queue as Kafka, from where it gets continuous data, converts the actual data into a stream of tuples & emits them to bolts for actual processing. Spouts run as tasks in worker processes by Executor threads.
Spouts can broadly be classified as following –
- Reliable – These spouts have the capability to replay the tuples (a unit of data in the data stream). This helps applications achieve ‘at least once message processing’ semantic as in case of failures, tuples can be replayed and processed again. Spouts for fetching the data from messaging frameworks are generally reliable as these frameworks provide the mechanism to replay the messages.
- Unreliable – These spouts don’t have the capability to replay the tuples. Once a tuple is emitted, it cannot be replayed irrespective of whether it was processed successfully or not. This type of spouts follows ‘at most once message processing’ semantic.
All processing in topologies is done in bolts. Bolts can do anything from filtering, functions, aggregations, joins, talking to databases, and more.
Bolts can do simple stream transformations. Doing complex stream transformations often requires multiple steps and thus multiple bolts. For example, transforming a stream of tweets into a stream of trending images requires at least two steps: a bolt to do a rolling count of retweets for each image, and one or more bolts to stream out the top X images (you can do this particular stream transformation in a more scalable way with three bolts than with two).
Bolts can also emit more than one stream.
What makes a running topology: worker processes, executors and tasks
Storm distinguishes between the following three main entities that are used to actually run a topology in a Storm cluster:
- Worker processes
- Executors (threads)
Here is a simple illustration of their relationships:
A worker process executes a subset of a topology. A worker process belongs to a specific topology and may run one or more executors for one or more components (spouts or bolts) of this topology. A running topology consists of many such processes running on many machines within a Storm cluster.
An executor is a thread that is spawned by a worker process. It may run one or more tasks for the same component (spout or bolt).
A task performs the actual data processing — each spout or bolt that you implement in your code executes as many tasks across the cluster. The number of tasks for a component is always the same throughout the lifetime of a topology, but the number of executors (threads) for a component can change over time. This means that the following condition holds true:
#threads ≤ #tasks. By default, the number of tasks is set to be the same as the number of executors, i.e. Storm will run one task per thread.
This pretty much sums up the architecture of Apache Storm. Hope it was helpful.