apache

An Overview of Apache Beam Features

Reading Time: 3 minutes We’ll talk about Apache Beam in this guide and discuss its fundamental concepts. We will begin by showing the features and advantages of using Apache Beam, and then we will cover basic concepts and terminologies. Ever since the concept of big data got introduced to the programming world, a lot of different technologies and frameworks have emerged. The processing of data can be categorized into Continue Reading

Apache Beam ParDo Transformations

Reading Time: 2 minutes What is a PCollection? A PCollection represents a distributed data set that your Beam pipeline operates on. The data set can be bounded, meaning it comes from a fixed source like a file, or unbounded, meaning it comes from a continuously updating source via a subscription or other mechanism. Your pipeline typically creates an initial PCollection by reading data from an external data source, but you can also create a PCollection form of Continue Reading

Dataframe and Datasets: Apache Spark’s Developers Friendly Structured APIs

Reading Time: 4 minutes This is a two-part blogs in which first we’ll be covering Dataframe API and in the second part Datasets. Spark 2.x introduced the concept of structuring the spark by introducing two concepts: – to express some computation by using common patterns found in data analysis, such as filtering, selecting, counting, aggregating, and grouping. And the second one of order and structure your data in a Continue Reading

Apache Airflow: DAG Structure and Data Pipeline

Reading Time: 6 minutes What is a DAG in Apache Airflow? In this blog, we are going to see what is the basic structure of DAG in Apache Airflow and we will also Configure our first Data pipeline. A DAG in apache airflow stands for Directed Acyclic Graph which means it is a graph with nodes, directed edges, and no cycles. An Apache Airflow DAG is a data pipeline Continue Reading

nifi

Apache Nifi – The Ingestion tool

Reading Time: 3 minutes What is Apache NiFi ? Apache Nifi is an open source software for automating and managing the data flow between systems, which Leveraging the concept of Extract,Transform and Load. Apache Nifi a powerful as well as reliable system to process and distribute data. Additionally Apache Nifi has a web-based user interface for design, control, feedback, and monitoring of dataflows. History of Apache NiFi Based on Continue Reading

Apache Cassandra: CQL Commands

Reading Time: 4 minutes In previous two blogs of Apache Cassandra series, we have already explained the Basics of Apache Cassandra and How Cassandra Reads and Writes. Now here in this blog we will cover another important topic in Apache Cassandra i.e., CQL commands. So let us name this blog as “Apache Cassandra: CQL commands“. We recommend to go through the other two blogs of this series before diving Continue Reading

Log4j CVE-2021-45105: All we know is WRONG!!

Reading Time: 3 minutes Apache security team disclosed a third Log4j2 vulnerability the night between Dec 17 and 18 by the Apache security team. This vulnerability is termed CVE-2021-45105. According to the security advisory, 2.16.0, which fixed the two previous vulnerabilities, is susceptible to a DoS attack caused by a Stack-Overflow in Context Lookups in the configuration file’s layout patterns. What is this CVE about? What can you do Continue Reading

Apache Airflow – A Workflow Manager

Reading Time: 4 minutes As the industry is becoming more data driven, we need to look for a couple of solutions that would be able to process a large amount of data that is required. A workflow management system provides an infrastructure for the set-up, performance and monitoring of a defined sequence of tasks, arranged as a workflow application. Workflow management has become such a common need that most Continue Reading

Scoverage Analysis | Scala | SBT

Reading Time: 3 minutes Scoverage… what it is, how to use it and for which build tool it is available. So, In this blog we are gonna discussing all these along with its implementation in SBT. What is scoverage ? “scoverage” is an Apache’s free licensed code coverage tool for Scala language that put forward the statement and branch coverage. It is available for SBT, Maven, and Gradle. Advantage Continue Reading

Reading Avro files using Apache Flink

Reading Time: 2 minutes In this blog, we will see how to read the Avro files using Flink. Before reading the files, let’s get an overview of Flink. There are two types of processing – batch and real-time. Batch Processing: Processing based on the data collected over time. Real-time Processing: Processing based on immediate data for an instant result. Real-time processing is in demand and Apache Flink is the Continue Reading

Using Apache Flink for Kinesis to Kafka Connect

Reading Time: 3 minutes In this blog, we are going to use kinesis as a source and kafka as a consumer. Let’s get started. Step 1: Apache Flink provides the kinesis and kafka connector dependencies. Let’s add them in our build.sbt: Step 2: The next step is to create a pointer to the environment on which this program runs. Step 3: Setting parallelism of x here will cause all Continue Reading

Writing Java APIs using Apache Atlas Client

Reading Time: 2 minutes In the previous blog, Data Governance using Apache ATLAS we discussed the advantages and use cases of using Apache Atlas as a data governance tool. In continuation to it, we will be discussing on building our own Java APIs which can interact with Apache Atlas using Apache atlas client to create new entities and types in it. How to create new Entities and Types using Continue Reading