Knoldus Blogs

Apache Spark’s Developers Friendly Structured APIs: Dataframe and Datasets

June 15, 2022June 16, 2022Studio-ScalaApache Spark, APIs, class, DataFrame, datasets, filter, scala, Set, Spark, transformations

Reading Time: 3 minutes This is the second part of the blog series on Spark‘s structured APIs Dataframe & Datasets. In the first part we covered Dataframe and I recommend you go read that blog first if you are new to spark. In this blog we’ll cover the Spark Datasets API, so let’s get started. The Datasets API Datasets are also the combination of two characteristics: typed and untyped Continue Reading

Basics of Machine Learning and it’s Algorithms -You Need to Know

May 10, 2021May 10, 2021Studio-ScalaAlgorithms, Data, datasets, Machine Learning, reinforcement learning, supervised learning, Unsupervised Learning

Reading Time: 6 minutes Machine Learning and it’s Algorithms Hi folks! Are you intrigued about Machine Learning and its Algorithms? If yes, Welcome. You have come to the right place. In this blog you will learn about machine learning and it’s algorithms. By the end of the blog, you will have the basic understanding of this field Machine Learning The term is self-explanatory enough that there is going to Continue Reading

Spark: Type Safety in Dataset vs DataFrame

May 16, 2020May 16, 2020Apache Spark, Big Data and Fast Data, Studio-ScalaDataFrame, datasets, RDD, Resilient Distributed Dataset

Reading Time: 4 minutes With type safety, programming languages prevents type errors, or we can say that type safety means the compiler will validate type while compiling, and throw an error when we try to assign a wrong type to a variable. Spark, a unified analytics engine for big data processing provides two very useful API’s DataFrame and Dataset that is easy to use, and are intuitive and expressive which makes Continue Reading

Spark: ACID Transaction with Delta Lake

February 5, 2020February 5, 2020Apache Spark, Big Data and Fast Data, Java, NoSql, Spark, Studio-ScalaACID, Apache Spark, Big Data, DataFrame, datasets, delta lake, transaction

Reading Time: 3 minutes Spark doesn’t provide some of the most essential features of a reliable data processing system such as Atomic APIs and ACID transactions as discussed in the blog Spark: ACID compliant or not. Spark welcomes a solution to the problem by working with Delta Lake. Delta Lake plays an intermediary service between Apache Spark and the storage system. Instead of directly interacting with the storage layer, Continue Reading

Spark: ACID compliant or not

January 24, 2020March 12, 2021Apache Spark, Java, Spark, Studio-ScalaACID, Apache Spark, Big Data, data science, Database, DataFrame, datasets, transaction, Tutorial

Reading Time: 4 minutes Spark is not ACID compliant

Knolx: Structured Streaming in Spark

April 15, 2019Functional Programming, Studio-Scaladataframes, datasets, knolx, RDD, Structured Streaming

Reading Time: < 1 minute Knoldus has organized a session on 08th February 2019. The topic was “Understanding Spark Structured Streaming”. Many people attended and enjoyed the session. In this blog post, I am going to share the slides & video of the session. Slides: Video: If you have any query, then please feel free to comment below.

Spark: Introduction to Datasets

March 4, 2019Apache Spark, Big Data and Fast Data, Spark, Studio-ScalaBig Data, dataframes, datasets, RDDs, Spark, Structured Streaming

Reading Time: 3 minutes As I have already discussed in my previous blog Spark: RDD vs DataFrames about the shortcomings of RDDs and how DataFrames overcome them. Now we’ll try to have a look at the shortcomings of DataFrames and how Dataset APIs can overcome them. DataFrames:- A DataFrame is a distributed collection of data, which is organized into named columns. Conceptually, it is equivalent to the relational tables with Continue Reading

Spark: RDD vs DataFrames

February 26, 2019Apache Spark, Big Data and Fast Data, Spark, Studio-ScalaBig Data, DataFrame, datasets, RDDs in Spark, Spark, Spark Streaming, Spark Structured Streaming

Reading Time: 3 minutes Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.One use of Spark SQL is to execute SQL queries. When running SQL from within another Continue Reading

Difference between RDD , DF and DS in Spark

July 28, 2017Apache Spark, Spark, Studio-ScalaDataFrame, datasets, difference between rdd df ds in spark, FD, RDD, Spark

Reading Time: 3 minutes In this blog I try to cover the difference between RDD, DF and DS. much of you have a little bit confused about RDD, DF and DS. so don’t worry after this blog everything will be clear. With Spark2.0 release, there are 3 types of data abstractions which Spark officially provides now to use: RDD, DataFrame and DataSet. so let’s start some discussion about it. Continue Reading

The Dominant APIs of Spark: Datasets, DataFrames and RDDs

April 16, 2017April 17, 2017SparkApache Spark, dataframes, datasets, performance optimization, RDD, space optimization, spark apis

Reading Time: 4 minutes While working with Spark often we come across the three APIs: DataFrames, Datasets and RDDs. In this blog I will discuss the three in terms of use case, performance and optimization. It is essential to keep in mind that there is seamless transformation available between the three DataFrames, Datasets and RDDs. Implicitly the RDD forms the apex of both DataFrame and Datasets. The inception of Continue Reading