Apache Storm: The Hadoop of Real-Time


Apache Storm is an open source & distributed stream processing computation framework written predominantly in the Clojure programming language. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm allows developers to build powerful applications that are highly responsive and can find trends between topics on twitter, monitoring spikes in payment failures, and so on.

Storm is simple & the best part about Apache Storm is a number of things that can be done with it. It is compatible with multiple languages, is extremely fast for processing through large data sets, is scalable, fault-tolerant, and packed with more amazing features.

History Of Storm

Storm was originally created by Nathan Marz and team at BackType. BackType is a social analytics company. Later, Storm was acquired and open-sourced by Twitter. In a short time, Apache Storm became a standard for the distributed real-time processing system that allows you to process a large amount of data, similar to Hadoop. Apache Storm is written in Java and Clojure. It is continuing to be a leader in real-time analytics.

Nathan open sourced Storm to GitHub on September 19th, 2011 during his talk at Strange Loop, and it quickly became the most watched JVM project on GitHub. Production deployments soon followed, and the Storm development community rapidly expanded.

At the time Storm was introduced, Big Data analytics largely involved batch processing in map-reduce on Apache Hadoop or one of the higher level abstractions like Apache Pig and Cascading.

Before Storm was written, the usual way of processing data in real time was using queues and worker thread approaches. For example, some threads will be continuously writing data to some queues like rabbitMq and some worker threads will be continuously reading data from these queues and processing them. The output might be written again to some other queues and chained as input to some other worker threads to process further.

Such design is possible but obviously very fragile. Much of the time would be spent in maintaining the entire framework, serializing/deserializing messages, dealing with data loss, resolving many other issues rather than doing the actual processing work.

Nathan Marz came up with the nice idea of creating the abstraction to all these in an efficient way in a program where we have to just create SPOUT and BOLT to do necessary processing and submit the job as TOPOLOGY and the framework will take care of everything else.

Some of the really beautiful abstractions he came up with, I feel is :

  1. Streams: The stream is the core abstraction in Storm. A stream is an unbounded sequence of tuples that is processed and created in parallel in a distributed fashion. Streams are defined with a schema that names the fields in the stream’s tuples. By default, tuples can contain integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays. You can also define your own serializers so that custom types can be used natively within tuples.Every stream is given an id when declared. Since single-stream spouts and bolts are so common, OutputFieldsDeclarer has convenience methods for declaring a single stream without specifying an id. In this case, the stream is given the default id of “default”.
  2. Tracking algorithm: It has a very efficient algo which guaranteed that every message will be processed. It ensured, no matter how much a message is going to process downstream, fixed amount of space (about 20 bytes) would be needed to keep track of the state of every message tuple.

Benefits of using Storm

Storm advantages include:

  • Allows real-time stream processing.
  • Scalability – where throughput rates of even one million 100 byte messages per second per node can be achieved.
  • Low latency – Storm performs data refresh and end-to-end delivery response in seconds or minutes depends upon the problem.
  • Reliable – Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures.
  • Easy to operate – standard configurations are suitable for production on day one. Once deployed, Storm is easy to operate.
  • Fault-tolerant: The ability of fault-tolerant is extremely important for storm as it processes massive data all time and should not be interrupted by a minimal failure, such as hardware fail in nodes of the storm cluster. Storm can redeploy tasks when it is necessary.
  • Data guarantee: No data loss is one of the essential requirements for a data processing
    system. The risk of losing data would not be accepted in the use of most fields, especially for those ask for accurate results. Storm makes sure that all the data would be processed as they are designed during their processing in the topology.
  • Ease of use in deploying and operating the system.
  • Support for multiple programming languages.
  • Fraud can be detected the moment it happens and proper measures can be taken to limit the damage.

Similarities among Hadoop and Storm

  • All three are open-source processing frameworks
  • All these frameworks can be used for Business Intelligence and Big Data Analytics
  • Each of these frameworks provides fault tolerance and scalability.
  • Both are distributed.
  • These frameworks are preferred choices for Big Data Developers due to their simple installation methods.
  • Hadoop and Storm have implementation in JVM based programming languages – Java and Clojure respectively.

Apache Storm vs Hadoop

Basically, Hadoop and Storm frameworks are used for analyzing big data. Both of them complement each other and differ in some aspects. Apache Storm does all the operations except persistency, while Hadoop is good at everything but lags in real-time computation. The following table compares the attributes of Storm and Hadoop.

Storm Hadoop
Real-time stream processing Batch processing
Master/Slave architecture with ZooKeeper based coordination. The master node is called as nimbus and slaves are supervisors. Master-slave architecture with/without ZooKeeper based coordination. Master node is job tracker and slave node is task tracker.
Stateless Stateful
Storm topology runs until shutdown by the user or an unexpected unrecoverable failure.

MapReduce jobs are executed in a sequential order and completed eventually.

A Storm streaming process can access tens of thousands messages per second on cluster. Hadoop Distributed File System (HDFS) uses MapReduce framework to process the vast amount of data that takes minutes or hours.

Use Cases of Apache Storm

Twitter

Storm powers Twitter’s publisher analytics product, processing every tweet and click that happens on Twitter to provide analytics for Twitter’s publisher partners. Storm integrates with the rest of Twitter’s infrastructure, including Cassandra, the Kestrel infrastructure, and Mesos. Many other projects are underway using Storm, including projects in the areas of revenue optimization, anti-spam, and content discovery.

Wego

Wego is a travel metasearch engine located in Singapore. Travel related data comes from many sources all over the world with different timing. Storm helps Wego to search real-time data, resolves concurrency issues and find the best match for the end-user.

Yahoo!

Yahoo! is developing a next generation platform that enables the convergence of big-data and low-latency processing. While Hadoop is our primary technology for batch processing, Storm empowers stream/micro-batch processing of user events, content feeds, and application logs.

NaviSite

Navsite is using Apache Storm as part of their server event log monitoring & auditing system. The log messages from thousands of servers are sent to RabbitMQ cluster and Storm is used to compare each message with a set of regular expressions. If there is a match, then the message is sent to a bolt that stores data in MongoDB. At the moment, 5-10k messages per second are being handled, however the existing RabbitMQ + Storm clusters have been tested up to about 50k per second.

I highly recommend reading this excellent post by Nathan Marz to any Developer in which he beautifully explains what was his experience, how he came up with an idea of the storm, what issues he faced and how he took things forward.

http://nathanmarz.com/blog/history-of-apache-storm-and-lessons-learned.html

 

Advertisements
This entry was posted in Scala. Bookmark the permalink.

One Response to Apache Storm: The Hadoop of Real-Time

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s