Get Acquaint with Powerful Apache Beam & Snippet

Reading Time: 2 minutes

The Apache Beam Programming Guide is intended for Beam users who want to use the Beam SDKs to create data processing pipelines. It provides guidance for using the Beam SDK classes to build and test your pipeline. The programming guide is not intended as an exhaustive reference, but as a language-agnostic, high-level guide to pro grammatically building your Beam pipeline. As the programming guide is filled out, the text will include code samples in multiple languages to help illustrate how to implement Beam concepts in your pipelines.

To use Beam, you need to first create a driver program using the classes in one of the Beam SDKs. Your driver program defines your pipeline, including all of the inputs, transforms, and outputs; it also sets execution options for your pipeline . These include the Pipeline Runner, which, in turn, determines what back-end your pipeline will run on.

Terms

Pipeline :- A pipeline encapsulates whole data processing tasks, from start to finish. In other words, a single program which includes input data, transforming that data and writing output data. As a result, all Beam programs must create a pipeline.
PCollection:- A PCollection represents a distributed datasets that Beam pipeline operates on. Firstly, the data can be like coming from a fixed source i.e. from a file or from a unbounded source like Kafka.
PTransform:- A PTransform explains a data processing operation, or a step, in pipeline. Meanwhile every PTransform has one or more PCollection as an input, performs operations and produces zero or more output PCollection objects.
IOTransform:- Beam contains huge number of IOs – library transforms so as to read or write data to external systems.

Foot Steps to construct a Pipeline:

  • Create a Pipeline object and set the pipeline execution options, including the Pipeline Runner.
  • Create an initial PCollection for pipeline data, either using the IOs to read data from an external storage system, or using a Create transform to build a PCollection from in-memory data.
  • Apply PTransforms to each PCollection. Transforms can change, filter, group, analyze, or otherwise process the elements in a PCollection. A transform creates a new output PCollection without modifying the input collection. A typical pipeline applies subsequent transforms to each new output PCollection in turn until processing is complete. However, note that a pipeline does not have to be a single straight line of transforms applied one after another: think of PCollections as variables and PTransforms as functions applied to these variables: the shape of the pipeline can be an arbitrarily complex processing graph.
  • Use IOs to write the final, transformed PCollection(s) to an external source.
  • Run the pipeline using the designated Pipeline Runner.

Data Source: From where we read data.

Output: To path where data is getting dumped after processing.

PipelineOptions options = PipelineOptionsFactory.create();

Pipeline p = Pipeline.create(options);

Code extract to put data into text file

public class WriteToFile {
    public static void main(String[] args) {
        System.out.println("hello...");

        PipelineOptions options = PipelineOptionsFactory.create();
        Pipeline p = Pipeline.create(options);
        List<String> list = Arrays.asList("a","b","c");
        p.apply(Create.of(list)).apply(TextIO.write().to("/home/knoldus/Downloads/rkr/textfile").withSuffix(".txt"));
        p.run().waitUntilFinish();
        System.out.println("h");
    }
}

Code Snippet Details: This code first create the pipeline and then it will pick elements from the List and write it to textfile.txt file but you need to pass the right path and in proper format.

References:

https://beam.apache.org/documentation/programming-guide/

https://chengzhizhao.medium.com/reading-apache-beam-programming-guide-4-transforms-part-1-12181d5ab563

Written by 

I love to write blogs.I have worked as a JAVA developer and now i am learning a new tech stack.