Apache Beam is a unified programming model that handles both stream and batch data in the same way. We can create a pipeline in beam any of the following beam SDK’s (Python/Java/Go languages) which can run on top of any supported execution engine namely Apache Spark, Apache Flink, Apache Apex, Apache Samza, Apache Gearpump, and Google Cloud dataflow(there are many more to join in future).
With Apache beam, you can apply the same operations whether it is bounded data from some batch data source like HDFS file or it is unbounded data from some streaming source like Kafka.
How does the apache driver program work?
First, create a Pipeline object and set the pipeline execution (which runner to use Apache Spark, Apache Apex, etc.).
Second, create Pcollection from some external storage or in-memory data. Then apply PTransforms to transform each element in Pcollection to produce output Pcollection.
You can filter, group, analyze or do any other processing on data.
Finally, store the final Pcollection to some external storage system using IO libraries. When we run this driver program, it creates a workflow graph out of the pipeline, which then executes as an asynchronous job on the underlying runner.

PipelineOptions options = PipelineOptionsFactory.create();
// Then create the pipeline using pipeline options
Pipeline p = Pipeline.create(options);
// Create the PCollection reading from file
PCollection < String >
lines = p.apply
("TextFile", TextIO.read().from("protocol://inputPath"));
//Apply transformation
PCollection < String >
output = lines.apply(Some transformation)
//Write final Pcollection to some external source
output.apply(TextIO.write().to("protocol://outputPath"));
Beam Transformations
Pardo
Pardo transformation applies a processing function on each element in PCollection to produce zero, one, or more elements in the resulting PCollection.
You can filter, format, compute, extract, and type-convert elements in PCollection using the Pardo function.
To apply Pardo, you need to create a class extending DoFn that will contain a method annotated with @ProcessElement, which function will contain the processing logic to apply on each element of PCollection and give back the result.
PipelineOptions options = PipelineOptionsFactory.create();
// Then create the pipeline using pipeline options
Pipeline p = Pipeline.create(options);
// Create Pcollection from a text file.
PCollection < String >
words = p.apply
(“TextFile ", TextIO.read ().from("
protocol: //inputPath "));
// The DoFn to count length of each element in the input PCollection.
static class WordLengthFn extends DoFn < String, Integer > {
@ProcessElement
public void processElement(@Element String word, OutputReceiver < Integer >
out) {
out.output(word.length());
}
}
// Apply a ParDo to the PCollection "words" to compute lengths for each word.
PCollection < Integer >
wordLengths = words.apply(
ParDo.of(new ComputeWordLengthFn()));
}
GroupByKey
GroupByKey group all values associated with a particular key. Suppose we have the below data where the key is the month, the value is the name of the person whose birthday falls in that month, and we want to group all people whose birthday falls in the same month.
To use GroupByKey on unbounded data, you can use windowing or triggers to operate grouping on a finite set of data falling in that particular window. For example, if you have defined a fixed window size of 3 minutes, then all data that comes in 3 minutes span will be grouped based on the key.
CoGroupByKey
CoGroupByKey joins two or more set of data that has the same key. For example, you have one file that contains the person’s name as a key with phone number as value and a second file that has the person’s name as a key with an email address as value.
So joining these two files based on person’s name using CoGroupByKey will result in a new Pcollection that will have person’s name as key with phone and email address as values. As discussed in GroupByKey, for unbounded data we have to use windowing and triggers to aggregate data using CoGroupByKey.
Flatten
Flatten merges list of PCollection into single PCollection.
// Flatten takes a list of Pcollection and returns a single PCollection PCollection
<String>
pc1 = … PCollection
<String>
pc2 = … PCollection
<String>
pc3 = … PCollectionList
<String>
collections = PCollectionList.of(pc1).and(pc2).and(pc3); PCollection
<String> mergedPCollections = collections.apply(Flatten.
<String>pCollections());
Partition
As opposite to Flatten, it splits a single PCollection into multiple smaller PCollection according to the partition function that the user provides.
// It takes desired number of result partitions and a PartitionFn
PCollection < Student >
students = …
// Split students up into 10 partitions, by percentile:
PCollectionList < Student >
studentsByPercentile =
students.apply(Partition.of(10, new PartitionFn < Student >
() {
public int partitionFor(Student student, int numPartitions) {
return student.getPercentile() * numPartitions / 100;
}
}));
Conclusion
No doubt, Apache beam is the future of parallel processing, and its “write once and execute anywhere” is making it even more popular in the big data development solutions ecosystem.
Currently, there is limited support and integration with backend execution runner but in the future, more and more frameworks will get integrated to make distributed programming more stable and unified.
References
Knoldus Blogs: https://blog.knoldus.com/
Apache Beam Documentation: https://beam.apache.org/documentation/