This blog gives an overview of Apache Beam.
What is Apache Beam?
- Apache Beam is an open-source, unified model for defining both batches as well as streaming data-parallel processing pipelines.
- Moreover available open-source Beam SDKs, can help us to easily build a program for our pipeline.
- Apache Flink, Apache Spark, and Cloud DataFlow are some of the possible runners to run the program.
Why use Apache Beam?
- Data Engineers mostly used Beam so as to create Extract,Transform and Load (ETL) tasks and for Data Integration.
- It is also useful in embracing parallel data processing tasks.
- In data processing tasks, the problem decompose into smaller chunks and processed independently, in other words we are running Map-Reduce jobs.
- As a result,Apache Beam achieves parallel processing.
Features of Apache Beam
- Unified – Single Programming for Batch + Streaming
- Extensible – Write and Share new SDKs,IO connectors and transformation libraries.
- Portable – Exectues pipelines on mutiple execution environment,
- Open Source – Community Based Developement
Apache Beam SDKs
- The Beam SDKs uses same classes for both bounded and unbounded data representation and transformation.Therefore, making it an unified data model.
- Currently Beam supports the following language specific SDKs –
- A scala Interface is also available at Scio.
Apache Beam Pipeline Runners
- Firstly, all we will need an appropriate runner for the backend.
- They translate the data processing pipeline into API compatible with the backend of User’s choice.
- This features helps Beam to provide a protable Programming Layers.
- Currently Beam Supports following runners:-
Apache Beam Programming Guide
To use beam, firstly create a driver program which defines the pipeline, including all of the inputs, transforms, and outputs
It also sets the execution options for the pipeline usually through command line arguments.
Moreover, it includes the Pipeline Runner to determine the backend of the pipeline to run on.
The Beam SDK’s provide a number of abstractions,as a result, it simplifies distributed processing.
Commons Abstractions includes :-
- Pipeline :- A pipeline encapsulates whole data processing tasks, from start to finish. In other words, a single program which includes input data, transforming that data and writing output data. As a result, all Beam programs must create a pipeline.
- PCollection:- A PCollection represents a distributed datasets that Beam pipeline operates on. Firstly, the data can be like coming from a fixed source i.e. from a file or from a unbounded source like Kafka.
- PTransform:- A PTransform explains a data processing operation, or a step, in pipeline. Meanwhile every PTransform has one or more PCollection as an input, performs operations and produces zero or more output PCollection objects.
- IOTransform:- Beam contains huge number of IOs – library transforms so as to read or write data to external systems.
A typical Beam programs works as follows:-
- Firstly, create a pipeline object ans set Pipeline Options including Runner.
- Secondly, create an initial PCollection for pipeline data.
- Thirdly, apply PTransform to input PCollection.
- Lastly, use IOTransform to write the final output to external systems.
- At last, run the pipeline using the designated Pipeline runner.
References
https://beam.apache.org/get-started/quickstart-java/
https://beam.apache.org/documentation/programming-guide/