Apache Spark

Delta Lake: Schema Enforcement & Evolution

Reading Time: 4 minutes Nowadays data is constantly evolving and changing. As well as the business problems and requirements are evolving, the shape or the structure of the data is also changing. When that happens, we want to be in control of how the data or schema changes. But how we can achieve this? Delta Lake has good ways to control how schema changes. With Delta Lake, users have Continue Reading

fetching data from different sources using Spark 2.1

Spark: createDataFrame() vs toDF()

Reading Time: 2 minutes There are two different ways to create a Dataframe in Spark. First, using toDF() and second is using createDataFrame(). In this blog we will see how we can create Dataframe using these two methods and what’s the exact difference between them. toDF() toDF() method provides a very concise way to create a Dataframe. This method can be applied to a sequence of objects. To access Continue Reading

Cluster vs Client: Execution modes for a Spark application

Reading Time: 3 minutes Whenever we submit a Spark application to the cluster, the Driver or the Spark App Master should get started. And the Driver will be starting N number of workers. Spark driver will be managing spark context object to share the data and coordinates with the workers and cluster manager across the cluster. Cluster Manager can be Spark Standalone or Hadoop YARN or Mesos. Workers will Continue Reading

fetching data from different sources using Spark 2.1

Spark: Type Safety in Dataset vs DataFrame

Reading Time: 4 minutes With type safety, programming languages prevents type errors, or we can say that type safety means the compiler will validate type while compiling, and throw an error when we try to assign a wrong type to a variable. Spark, a unified analytics engine for big data processing provides two very useful API’s DataFrame and Dataset that is easy to use, and are intuitive and expressive which makes Continue Reading

kafka with spark

Dynamic Partition Pruning in Spark 3.0

Reading Time: 6 minutes Dynamic Partition Pruning in Spark 3.0 With the release of Spark 3.0, big improvements were implemented to enable Spark to execute faster and there came many new features along with it. Among them, dynamic partition pruning is one. Before diving into the features which are new in Dynamic Partition Pruning let us understand what is Partition Pruning. Partition Pruning in Spark In standard database pruning Continue Reading

Apache Spark: Delta Lake as a Solution – Part II

Reading Time: 3 minutes Well, we have already covered the missing features in Apache Spark & also the causes of the issue in executing Delta Lake in Part1. However, today we will be talking about What Delta Lake is & how it provides the solution to all those problems discussed herein Delta Lake as a Solution: Part1.As we all know that Spark is just a processing engine, it doesn’t Continue Reading

Apache Spark: Delta Lake as a Solution – Part I

Reading Time: 3 minutes Today, everyone is talking about Delta Lake. Why? Ever tried to find the answer to this question? Yes or No doesn’t matter, don’t worry here in Part1 we will be discussing the same & also will be targetting the following questions: What are the features missing from Apache Spark? What kind of issues it causes in executing Data Lake? Answering the above questions will definitely Continue Reading

Streaming from Kafka to PostgreSQL through Spark Structured Streaming

Reading Time: 3 minutes Hello everyone, in this blog we are going to learn how to do a structured streaming in spark with kafka and postgresql in our local system. We will be doing all this using scala so without any furthur pause, lets begin. Setting up the necessities first: Dependencies Set up the required dependencies for scala, spark, kafka and postgresql. 2. PostgreSQL setup Lets start fresh by Continue Reading

Apache Spark: Handle Corrupt/Bad Records

Reading Time: 3 minutes Most of the time writing ETL jobs becomes very expensive when it comes to handling corrupt records. And in such cases, ETL pipelines need a good solution to handle corrupted records. Because, larger the ETL pipeline is, the more complex it becomes to handle such bad records in between. Corrupt data includes: Missing information Incomplete information Schema mismatch Differing formats or data types Apache Spark: Continue Reading

Amazon EMR

Reading Time: 3 minutes Businesses worldwide are discovering the power of new big data processing and analytics frameworks like Apache Hadoop and Apache Spark, but they are also discovering some of the challenges of operating these technologies in on-premises data lake environments. They may also have concerns about the future of their current distribution vendor. Common problems of on-premises big data environments include a lack of agility, excessive costs, Continue Reading

Apache Spark: Tricks to Increase Job Performance

Reading Time: 2 minutes Apache Spark is quickly adopting the Real-world and most of the companies like Uber are using it in their production. Spark is gaining its popularity in the market as it also provides you with the feature of developing Streaming Applications and doing Machine Learning, which helps companies get better results in their production along with proper analysis using Spark. Although companies are using Spark in Continue Reading