Spark: Streaming Datasets

Reading Time: 3 minutes Spark providing us a high-level API – Dataset, which makes it easy to get type safety and securely perform manipulation in a distributed and a local environment without code changes. Also, spark structured streaming, a high-level API for stream processing allows us to stream a particular Dataset which is nothing but a type-safe structured streams. In this blog, we will see how we can create Continue Reading

fetching data from different sources using Spark 2.1

Spark: createDataFrame() vs toDF()

Reading Time: 2 minutes There are two different ways to create a Dataframe in Spark. First, using toDF() and second is using createDataFrame(). In this blog we will see how we can create Dataframe using these two methods and what’s the exact difference between them. toDF() toDF() method provides a very concise way to create a Dataframe. This method can be applied to a sequence of objects. To access Continue Reading

fetching data from different sources using Spark 2.1

Spark: Type Safety in Dataset vs DataFrame

Reading Time: 4 minutes With type safety, programming languages prevents type errors, or we can say that type safety means the compiler will validate type while compiling, and throw an error when we try to assign a wrong type to a variable. Spark, a unified analytics engine for big data processing provides two very useful API’s DataFrame and Dataset that is easy to use, and are intuitive and expressive which makes Continue Reading

Spark: ACID Transaction with Delta Lake

Reading Time: 3 minutes Spark doesn’t provide some of the most essential features of a reliable data processing system such as Atomic APIs and ACID transactions as discussed in the blog Spark: ACID compliant or not. Spark welcomes a solution to the problem by working with Delta Lake. Delta Lake plays an intermediary service between Apache Spark and the storage system. Instead of directly interacting with the storage layer, Continue Reading

Using Vertica with Spark-Kafka: Write using Structured Streaming

Reading Time: 3 minutes In two previous blogs, we explored about Vertica and how it can be connected to Apache Spark. The first blog in this mini series was about reading data from Vertica using Spark and saving that data into Kafka. The next blog explained the reverse flow i.e. reading data from Kafka and writing data to Vertica but in a batch mode. i.e reading data from Kafka Continue Reading

Spark: RDD vs DataFrames

Reading Time: 3 minutes Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.One use of Spark SQL is to execute SQL queries. When running SQL from within another Continue Reading

Having Issue How To Order Streamed Dataframe ?

Reading Time: 3 minutes A few days ago, i have to perform aggregation on streaming dataframe. And the moment, i apply groupBy for aggregation, data gets shuffled. Now the situation arises how to maintain order? Yes, i can use orderBy with streaming dataframe using Spark Structured Streaming, but only in complete mode. There is no way of doing ordering of streaming data in append mode and update mode. I Continue Reading

Difference between RDD , DF and DS in Spark

Reading Time: 3 minutes In this blog I try to cover the difference between RDD, DF and DS. much of you have a little bit confused about RDD, DF and DS. so don’t worry after this blog everything will be clear. With Spark2.0 release, there are 3 types of data abstractions which Spark officially provides now to use: RDD, DataFrame and DataSet. so let’s start some discussion about it. Continue Reading