Apache Beam, the Java SDK way

Reading Time: 4 minutes

Apache Beam stands for B(batch) – EAM(strEAM).

Apache Beam is an open source, integrated model for both batch and streaming data-parallel processing pipelines. Using one of the Beam SDK (Java, Python and GO) which are also open source, you create a program that describes the pipeline.
The pipeline is then used by one of Beam-based back-end processing systems, including Apache Flink, Apache Spark, and Google Cloud Dataflow.

Beam is especially useful for embarrassing data processing tasks, where the problem can be broken down into many small amounts of data that can be processed independently and parallely.
You can also use Beam for Extract, Transform, and Load (ETL) functions and pure data integration. These functions are useful for moving data between different storage media and data sources, converting data into the most desirable format, or uploading data to a new system.

Apache Beam.img

Apache Beam SDKs

Beam SDKs provide an integrated editing model that manages and converts data sets of any size, whether input is a limited data set from a batch data source, or an infinite data set from a streaming data source.
Beam SDKs use the same classes to represent both bounded and unbounded data, with the same modifications to work on that data. You can use the Beam SDK of your choice to create a program that defines your data processing pipeline.
Beam currently supports :

  • Java SDK
  • Python SDK
  • GO SDK

Apache Beam Pipeline Runners

Beam Pipeline Runners translates the data processing pipeline that you define into your Beam program into an API that accompanies distributed processing.
When you start your Beam program, you will need to specify a runner which suits your needs when you want to use your pipeline.
Beam currently supports the following runners:

  • Direct Runner
  • Apache Spark Runner
  • Apache Flink Runner
  • Google Cloud Dataflow Runner
  • Apache Flink logo
  • Google Cloud Dataflow logo
  • Apache Nemo Runner
  • Hazelcast Jet logo
  • Twister2 logo
  • Apache Samza Runner
  • Apache Samza logo
  • Twister2 Runner
  • Hazelcast Jet Runner
  • Apache Spark logo

Apache Beam Model

  • PCollection: represents a collection of data, which could be bounded or unbounded in size.
  • PTransform: represents a computation that transforms input PCollections into output PCollections.
  • Pipeline: manages a directed acyclic graph of PTransforms and PCollections that is ready for execution.
  • PipelineRunner: specifies where and how the pipeline should execute.

Apache Beam Environment Setup Requirements

  • Java JDK 8, 11 or 17 .
  • Maven
  • An IDE such as Intellij or Eclipse.

First Apache Beam Project using Java SDK

1) Open an IDE (we would use Intellij), and create a new Project
2) Go to POM.xml file and add dependencies for beam-sdk and beam-runner
3) we will now convert a .txt document to a .docx document using Apache Beam
(.txt) file that we will convert to (.docx) file using Apache Beam
4) Create a class named BeamDemoRunner , with the following java code
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.values.PCollection;

public class BeamDemoRunner {
    public static void main(String[] args) {
        //create a Pipeline object
        Pipeline pipeline=Pipeline.create();
        //creating a PCollection object which represents a distributed data set
        //and reading a text file (.txt)
        PCollection<String> output=pipeline.apply(
                TextIO.read().from("/home/prakhar/BeamPOC/input/sample-text-file-input.txt")
        );
        //converting the text file into a word document (.docx)
        output.apply(
                TextIO.write().to("/home/prakhar/BeamPOC/output/sample-text-file-output.docx")
                        //if wont use withNumShards , as I told, PCollection is a distibuted set
                        //it will output multiple files instead of 1 single file
                        .withNumShards(1)
                        //to generate a file with extension .docx
                        .withSuffix(".docx")
        );
        pipeline.run();
    }
}
5) After running the above program , check the output folder
The output folder contains a (.docx) file that our program generated.
6) Output.docx
the .docx file contain the same text but in different order.

Github link for the above Demo : https://github.com/rastogiprakhar/BeamTutorial

Knoldus Blogs : https://blog.knoldus.com/

References :
Official Apache Beam Github: https://github.com/apache/beam
Official Apache Beam Documentation: https://beam.apache.org/documentation/

knoldus

Written by 

Prakhar is a Software Consultant at Knoldus . He has completed his Masters of Computer Applications from Bharati Vidyapeeth Institute of Computer Applications and Management, Paschim Vihar . He likes problem solving and exploring new technologies .

Leave a Reply