Knoldus Blogs

Using Vertica with Spark-Kafka: Write using Structured Streaming

Reading Time: 3 minutes In two previous blogs, we explored about Vertica and how it can be connected to Apache Spark. The first blog in this mini series was about reading data from Vertica using Spark and saving that data into Kafka. The next blog explained the reverse flow i.e. reading data from Kafka and writing data to Vertica but in a batch mode. i.e reading data from Kafka Continue Reading

Using Vertica with Spark-Kafka: Writing

July 2, 2019July 16, 2019Apache Kafka, Apache Spark, Database, HDFS, Spark, Studio-ScalaApache Kafka, Apache Spark, kafka, Spark, Spark Kafka vertica, Spark SQL, spark sql kafka, Spark vertica, Vertica, Write to vertica

Reading Time: 4 minutes In previous blog of this series, we took a glance over the basic definition of Spark and Vertica. We also did a code overview for reading data from Vertica using Spark as DataFrame and saving the data into Kafka. In this blog we will be doing the reverse flow i.e. working on reading the data from Kafka as a DataFrame and writing that DataFrame into Continue Reading

Using Vertica with Spark-Kafka: Reading

July 2, 2019July 8, 2019Apache Kafka, Apache Spark, Big Data and Fast Data, Database, SQL, Studio-ScalaApache Kafka, Apache Spark, Database, kafka, Spark, Spark SQL, spark sql kafka, Vertica

Reading Time: 4 minutes We live in a world of Big data where the size of data is so big even for small results. This is the result of an increase in data collection on a rapid scale in the modern world. This massiveness of data brings the requirements of such tools which can work upon such a big chunk of data. I am pretty sure that you guys Continue Reading

Do you really need Spark? Think Again!

June 14, 2019Apache Spark, Big Data and Fast Data, Functional Programming, ML, AI and Data Engineering, Spark, Studio-Scala, Tech BlogsApache Spark, Big Data, Big Data Analytics, HDFS, scala, Spark Streaming, Spark with Scala

Reading Time: 5 minutes With the massive amount of increase in big data technologies today, it is becoming very important to use the right tool for every process. The process can be anything like Data ingestion, Data processing, Data retrieval, Data Storage, etc. Today we are going to focus on one of those popular big data technologies i.e., Apache Spark. Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark Continue Reading

Spark: Introduction to Datasets

March 4, 2019Apache Spark, Big Data and Fast Data, Spark, Studio-ScalaBig Data, dataframes, datasets, RDDs, Spark, Structured Streaming

Reading Time: 3 minutes As I have already discussed in my previous blog Spark: RDD vs DataFrames about the shortcomings of RDDs and how DataFrames overcome them. Now we’ll try to have a look at the shortcomings of DataFrames and how Dataset APIs can overcome them. DataFrames:- A DataFrame is a distributed collection of data, which is organized into named columns. Conceptually, it is equivalent to the relational tables with Continue Reading

Spark Streaming vs. Structured Streaming

February 28, 2019Apache Spark, Big Data and Fast Data, Spark, Streaming, Streaming Solutions, Studio-ScalaApache Spark, Spark Streaming, Spark Structured Streaming, Streaming, Streaming Spark, Structured Streaming

Reading Time: 6 minutes Fan of Apache Spark? I am too. The reason is simple. Interesting APIs to work with, fast and distributed processing, unlike map-reduce no I/O overhead, fault tolerance and many more. With this much, you can do a lot in this world of Big data and Fast data. From “processing huge chunks of data” to “working on streaming data”, Spark works flawlessly in all. In this Continue Reading

Spark: RDD vs DataFrames

February 26, 2019Apache Spark, Big Data and Fast Data, Spark, Studio-ScalaBig Data, DataFrame, datasets, RDDs in Spark, Spark, Spark Streaming, Spark Structured Streaming

Reading Time: 3 minutes Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.One use of Spark SQL is to execute SQL queries. When running SQL from within another Continue Reading

Optimizing Indexing speed in Elasticsearch with Spark

January 27, 2019January 27, 2019Apache Spark, Spark, Studio-ScalaApache Spark, Benchmark, elasticsearch, indexing

Reading Time: 4 minutes optimize indexing
bulk indexing elasticsearch

Knolx: How Spark does it internally?

January 14, 2019Apache Spark, Big Data and Fast Data, Studio-ScalaDAG in Spark, RDDs in Spark, Spark Internals, Spark with Scala, What is Spark?

Reading Time: < 1 minute Knoldus has organized a 30 min session on Oct 12 at 3:30 PM. The topic was How Spark does it internally? Many people have joined and enjoyed the session. I am going to share the slides and the video here. Please let me know if you have any question related to linked slides. How Spark Does It Internally? from Knoldus Inc. Here’s the video of the Continue Reading

Apache Spark 2.4: Adding a little more Spark to your code

December 16, 2018Apache Spark, Java, Spark, Streaming, Studio-ScalaApache Spark, avro, higher order function, scala, spark 2.4

Reading Time: 5 minutes Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark recently released its fifth release in the 2.x version line i.e Spark 2.4. We were lucky enough to experiment with it so soon in one of our projects. Today we will try to highlight the major changes in this version that we explored as well as experienced in our project. In our Continue Reading

Tuning a Spark Application

November 14, 2018Apache Spark, Big Data and Fast Data, HDFS, Spark, Studio-Scala

Reading Time: 4 minutes Having trouble optimizing your Spark application? If yes, then this blog will surely guide you on how you can optimize it and what parameters should be tuned so that our spark application gives the best performance. Spark applications can cause a bottleneck due to resources such as CPU, memory, network etc. We need to tune our memory usage, data structures tuning, how RDDs need to Continue Reading

HDFS: A Conceptual View

November 12, 2018Apache Spark, Big Data and Fast Data, HDFS, Spark, Studio-ScalaBigData, Hadoop Distributed File System, What is HDFS

Reading Time: 5 minutes There has been a significant boom in distributed computing over the past few years. Various components communicate with each other over network inspite of being deployed on different physical machines. A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored on the local client machine. The DFS makes it convenient to share information Continue Reading

Spark: Why should we use SparkSession ?

October 10, 2018November 16, 2018Apache Spark, Spark, Studio-Scala, Tech Blogs

Reading Time: 5 minutes Spark 2.0 is the next major release of Apache Spark. This brings major change for the level of abstraction for the spark API and libraries. The release has the major change for the ones who want to make use of all the advancement in this release, So in this blog post, I’ll be discussing Spark-Session. Need Of Spark-Session