Spark

Developing programming and coding technologies working in a software engineers.

How to convert Spark RDD into DataFrame and Dataset

Reading Time: 4 minutes In this blog, we will be talking about Spark RDD, Dataframe, Datasets, and how we can transform RDD into Dataframes and Datasets. What is RDD? A RDD is an immutable distributed collection of elements of your data. It’s partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. RDDs are so integral to the Continue Reading

man working on laptop while woman takes notes

Joins in Spark SQL with examples

Reading Time: 4 minutes Spark SQL Spark SQL is a module in Apache Spark. It allows users to process structured data using a SQL-like syntax. It integrates seamlessly with the Spark ecosystem, including Spark Streaming and MLlib. One of the main benefits of using Spark SQL is that it permits to users to integrate SQL queries with the programming language of their choice, such as Scala, Python, or Java. Continue Reading

Programmers working on computer program

Apache Spark Best Practices and Performance Tuning

Reading Time: 2 minutes We all know that Apache spark is a Big data processing engine that works on the model of in-memory computation. When we are dealing with extensive data even if we are able to reduce the use of even 1 MB of memory per minute it will result in thousands of dollars per month. Hence it becomes essential to learn the spark best practices and optimization Continue Reading

Apache Spark’s Developers Friendly Structured APIs: Dataframe and Datasets

Reading Time: 3 minutes This is the second part of the blog series on Spark‘s structured APIs Dataframe & Datasets. In the first part we covered Dataframe and I recommend you go read that blog first if you are new to spark. In this blog we’ll cover the Spark Datasets API, so let’s get started. The Datasets API Datasets are also the combination of two characteristics: typed and untyped Continue Reading

Dataframe and Datasets: Apache Spark’s Developers Friendly Structured APIs

Reading Time: 4 minutes This is a two-part blogs in which first we’ll be covering Dataframe API and in the second part Datasets. Spark 2.x introduced the concept of structuring the spark by introducing two concepts: – to express some computation by using common patterns found in data analysis, such as filtering, selecting, counting, aggregating, and grouping. And the second one of order and structure your data in a Continue Reading

Spark 3.0 – Adaptive Query Execution With Example

Reading Time: 4 minutes Introduction Adaptive Query Execution (AQE) is one of the greatest features of Spark 3.0 which reoptimizes and adjusts query plans based on runtime statistics collected during the execution of the query. Need of AQE With each major release of Spark, it’s been introducing new optimization features in order to better execute the query to achieve greater performance. Before spark 3.0, cost-based optimization uses table statistics to determine the Continue Reading

RDD to DataFrame Conversion in Spark

Reading Time: 2 minutes Overview In this tutorial, we’ll learn how to convert an RDD to a DataFrame in Spark. We’ll look into the details by calling each method with different parameters. Along the way, we’ll see some interesting examples that’ll help us understand concepts better. RDD and DataFrame in Spark RDD and DataFrame are two major APIs in Spark for holding and processing data. It provides us with low-level APIs for processing distributed data. Continue Reading

Welcome to the world of Apache Spark

Reading Time: 5 minutes Welcome to another very important & interesting topic of big data Apache Spark. What is Apache Spark? Spark has been called a “general-purpose distributed data processing engine” for big data and machine learning. It lets you process big data sets faster by splitting the work up into chunks and assigning those chunks across computational resources. Why would you want to use Spark? Spark has some Continue Reading

Abstraction in Java

Reading Time: 2 minutes What is Abstraction? Abstraction is one of the most important feature of OOPs (Object Oriented Programming). There are four essential features in OOPs i.e., encapsulation, inheritance, abstraction and polymorphism. Here we will discuss about one of the four features i.e., Abstraction. It is the process of showing necessary details to the user and hiding all the details of implementation. In other words, we can say Continue Reading

Getting Started with Spark 3

Reading Time: 4 minutes Introduction to Apache Spark Big Data processing frameworks like Apache Spark provides an interface for programming data clusters using fault tolerance and data parallelism. Apache Spark is broadly used for the speedy processing of large datasets. Apache Spark is an open-source platform, built by a broad group of software developers from 200 plus companies. Over 1000 plus developers have contributed since 2009 to Apache Spark.  Continue Reading

Comparing Data Streaming Frameworks | Scala

Reading Time: 4 minutes In this Era of Technology, where the amount of data is growing exponentially and every bit of data holds value. Even, according to some reports, the number of bytes being generated and stored till now in the world has already exceeded the star counts in the sky. As every bit is useful so, it is very important to store them without losing any bit. When Continue Reading

Scala vs Python for Apache Spark: An In-depth Comparison

Reading Time: 5 minutes Imagine the first day of a new Apache Spark project. The project manager looks at the team and says: which one to choose, scala or python. So let’s start with “scala vs python for spark”.  You may wonder if this is a tricky question. What does the enterprise demand say? Is this like asking iOS or Android? Is there a right or wrong answer? So Continue Reading

Install/Configure Hadoop HDFS,YARN Cluster and integrate Spark with it

Reading Time: 5 minutes In our current scenario, we have 4 Node cluster where one is master node (HDFS Name node and YARN resource manager) and other three are slave nodes (HDFS data node and YARN Node manager) In this cluster, we have implemented Kerberos, which makes this cluster more secure. Kerberos services are already running in the different server which would be treated as KDC server. In all Continue Reading