Debugging Apache Beam Pipeline

Reading Time: 2 minutes

Overview

Apache Beam is known as one of the widely used frameworks for Stream and Batch processing in a distributed environment and provides some very unique features.

It is an open-source, unified bulk data processing framework that supports data processing through various SDKs that allow the execution of pipelines in different processing engines/runners.

Beam Apache runners :

  • Spark
  • Flink
  • Apex
  • Google Cloud Dataflow
  • DirectRunner. A local runner for tests and debug

Click here to learn in-depth about Apache-beam.


Challenges

One of the challenges with Beam pipelines is to debug and test its remote execution on the various nodes especially with streaming .It become difficult to examine the data that if it matches the expected outputs at various points.


How to Resolve

There are multiple approaches, one can follow to debug/test beam pipelines. Few commonly used methods :

Local Testing/Debugging

We can perform local testing/debugging of pipelines using beam SDK’s in various ways from the lowest to the highest levels. We can do followings –

  1. Debug a pipeline in the IDE(Intellij etc) by using beam-runner-direct-java in a maven project.
  2. Test the individual functions used in the pipeline.
  3. Test an entire Transform as a unit.
  4. Construct a Direct Acyclic Graph(DAG) of transformation via pipeline object.
  5. Perform an end-to-end test for an entire pipeline by creating TestPipelines inside of the JUNIT test to run locally or against a remote pipeline runner.

Remote Testing/Debugging

Beam pipeline local testing is not sufficient for distributed environments and often gets into havoc. These systems become extremely difficult to troubleshoot as scaled-up.

We can troubleshoot the above issue by remote Testing/Debugging .Few commonly used method for Remote debugging as below :

  • By Maintaining a Tracing in the situations of inconsistent results .
  • Creating Unique Tags.
  • Observability into an Apache Beam pipeline
  • By using CloudDebuggerOptions of SDKs.
  • By enabling sampling we can assure that the production workflows are working as expected in a production environment .

References

Written by 

I work as a Vice President, Engineering at Knoldus Inc. I am a Result-driven Techno-Functional Professional with over 19 years of extensive IT experience in Project Management, IT Delivery Operations, Team Management & Leadership. For the majority of my career (15Years), I worked in Newyork, USA with major global Financial Clients like MasterCard, London Stock of Exchange, Deutsche Bank. I am a Sun Certified Enterprise Architect (JAVA) professional with core expertise in Managing and Executing Real-time Trading /Foreign exchange / Capital Market/ Investment Banking projects using technologies under Java/J2EE umbrella.