Spark Streaming is one of the most essential parts of the Big Data ecosystem. It is a software framework from Apache Spark Foundation used to manage Big Data. Basically, it ingests the data from sources like Twitter in real-time, processes it using functions and algorithms, and pushes it out to store it in databases and other places. Spark Streaming extends the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like
Need of Spark Streaming-
- Streaming data is received from data sources (e.g. live logs, system telemetry data, IoT device data, etc.) into some data ingestion system like Apache Kafka, Amazon Kinesis, etc.
- The data is then processed in parallel on a cluster.
- Results are given to downstream systems like HBase, Cassandra, Kafka, etc.
- A Stateful Stream Processing System is a system that needs to update its state with the stream of data. Latency should be low for such a system, and even if a node fails, the state should not be lost (for example, computing the distance covered by a vehicle based on a stream of its GPS location, or counting the occurrences of word “spark” in a stream of data).
Spark Streaming Architecture
1. Fast failure and straggler recovery–
Computation of lost data in case of node failure in traditional systems is not easy. It has to restart the failed operator on another node. As in spark, the computation is discretized into small tasks, so that it can run anywhere without affecting correctness.
2. Unification of batch, streaming and interactive analytics-
DStream is just a series of Spark RDDs, that allows batch, streaming workloads to interoperate seamlessly. We can apply Spark functions on each batch of streaming data, that can be interactively queried on demand. Since spark’s worker memory stores it
3. Advanced analytics like machine learning and interactive SQL
We can also integrate it with advanced processing libraries, such as SQL, machine learning, graph processing. RDDs, which are generated by DStreams can also convert into data frames. Afterward, they are queried with SQL. Faster recovery from failures by re-launching the failed tasks in parallel on other free nodes.
The ability to batch data and leverage spark engines leads to almost higher throughput. Latencies of spark streaming are as low as a few hundred milliseconds.
In this blog, we have learned about basic things related to Spark Streaming. We also get to know about spark streaming architecture, why Spark Streaming needs, and the advantages of spark streaming.