Apache Beam

Apache Beam: Ways to join PCollections

Reading Time: 4 minutes Joining multiple sets of data into a singular entity is very often when working with data pipelines. In this blog, We will cover how we can perform Join operations between datasets in Apache Beam. There are different ways to Join PCollections in Apache beam – Extension-based joins Group-by-key-based joins Join using side input Let’s understand the above different way’s to perform Join with examples. We Continue Reading

Apache Beam: Side input Pattern

Reading Time: 3 minutes Apache Beam is a unified programming model for defining both batch and streaming data-parallel processing pipelines. It is a modern way of defining data processing pipelines. It has rich sources of APIs and mechanisms to solve complex use cases. In some use cases, while we define our data pipelines the requirement is, the pipeline should use some additional inputs. For example, In streaming analytics applications, it Continue Reading

Stateful processing with Apache Beam

Reading Time: 6 minutes Overview Beam lets us process unbounded, out-of-order, global-scale data with portable high-level pipelines. Stateful processing is a new feature of the Beam model that expands the capabilities of Beam. With these new features, we can unlock newer use cases and newer efficiencies Quick Recap In Beam, a big data processing pipeline is a directed, acyclic graph of parallel operations called PTransforms processing data from PCollections. The boxes are PTransforms and the edges Continue Reading