Receivers in Apache Spark Streaming

Reading Time: 2 minutes

Receivers are special objects in Spark Streaming. The receiver’s goal is to consume data from data sources and move it to Spark. We create Receivers by streaming context as long-running tasks on different executors.

We can build receivers by extending the abstract class Receiver. To start or stop the receiver there are two methods:-

onStart()

This method contains all important things like opening connections, creating threads, etc. for reading files. onStart() must be non-blocking. For proper start and stop of the receiver, We implement data retrieval by using a new Thread. Otherwise, even if the streaming context exceeds its timeout, it won’t be stopped because of the blocking onStart() method.

onStop()

onStop(), respectively used to, stop data consumption.

We implement data retrieval using a new thread and it will start with the onStart() method. It will send to the spark by using two methods that will move data from the receiver to the spark context.

There are two types of receivers in spark:-

  • Reliable Receiver
  • Unreliable Receiver

Reliable Receiver

This receiver acknowledges data sources when we receive data and replicate it successfully in Spark storage. In the case of the reliable receiver, data is sent through the store(…) method taking in parameter collection-like objects (Iterator, ByteBuffer, or ArrayBuffer). It’s a blocking method that doesn’t return as long as Spark doesn’t notify the receiver about successful data save. After returning, the receiver can acknowledge the source of data reception.

Unreliable Receiver

In this situation, the acknowledgement is not sent to the source.

An unreliable receiver uses the store(…) method taking in the parameters of a single object. This method is not blocking but it doesn’t send data immediately to spark. Instead of that, it keeps data in memory and sends it as a batch to spark after accumulating some number of items.

Sometimes receiver can fail, in this case, we can restart the receiver through the restart(…) method. We can’t restart immediately. We can only schedule.

Conclusion

This blog gives information on the in-between process of consuming data from data sources and sending it to the spark context and receivers are the main objects to make this process possible. So this blog provides information about different types of receivers and methods that we implement on receivers.

Written by 

Rakhi Pareek is a Software Consultant at Knoldus. She believes in continuous learning with new technologies. Her current practice area is Scala. She loves to maintain a diary to put on her thoughts daily. Her hobby is doodle art.