Knoldus Blogs

Apache Spark: Tricks to Increase Job Performance

February 19, 2020Analytics, Apache Spark, Big Data and Fast Data, Database, ML, AI and Data Engineering, NoSql, Spark, SQL, Streaming, Streaming Solutions, Studio-ScalaApache Spark, Big Data, Big Data Analytics, data analysis, data engineering, fast data analytics, Spark Job, Spark Performance, Spark SQL, Spark Streaming

Reading Time: 2 minutes Apache Spark is quickly adopting the Real-world and most of the companies like Uber are using it in their production. Spark is gaining its popularity in the market as it also provides you with the feature of developing Streaming Applications and doing Machine Learning, which helps companies get better results in their production along with proper analysis using Spark. Although companies are using Spark in Continue Reading

Spark: ACID Transaction with Delta Lake

February 5, 2020February 5, 2020Apache Spark, Big Data and Fast Data, Java, NoSql, Spark, Studio-ScalaACID, Apache Spark, Big Data, DataFrame, datasets, delta lake, transaction

Reading Time: 3 minutes Spark doesn’t provide some of the most essential features of a reliable data processing system such as Atomic APIs and ACID transactions as discussed in the blog Spark: ACID compliant or not. Spark welcomes a solution to the problem by working with Delta Lake. Delta Lake plays an intermediary service between Apache Spark and the storage system. Instead of directly interacting with the storage layer, Continue Reading

Time Travel: Data versioning in Delta Lake

February 2, 2020February 2, 2020Analytics, Apache Spark, Big Data and Fast Data, Java, Spark, Studio-ScalaApache Spark, Big Data, Big Data Analytics, BigData, data lake, Data Management, data science, delta lake, Spark, Time Travel

Reading Time: 3 minutes In today’s Big Data world, we process large amounts of data continuously and store the resulting data into data lake. This keeps changing the state of the data lake. But, sometimes we would like to access a historical version of our data. This requires versioning of data. Such kinds of data management simplifies our data pipeline by making it easy for professionals or organizations to Continue Reading

Spark: ACID compliant or not

January 24, 2020March 12, 2021Apache Spark, Java, Spark, Studio-ScalaACID, Apache Spark, Big Data, data science, Database, DataFrame, datasets, transaction, Tutorial

Reading Time: 4 minutes Spark is not ACID compliant

Analytics on the edge – How Apache Mesos enabled ships to crunch data

January 2, 2020October 14, 2020Studio-ScalaApache Mesos, Apache Spark, docker, Zeppelin

Reading Time: 5 minutes Introduction & the Problem One of our key customers, a large cruise line has ships sail with capacity running into few thousands of people on board. They are going through a successful digital transformation which includes managing full life cycle of a guest on mobile, data science-driven personalization, etc and we are fortunate to be part of the whole journey. These ships generate varieties of Continue Reading

Apache Spark: Repartitioning v/s Coalesce

December 30, 2019Apache Spark, Big Data and Fast Data, Database, HDFS, ML, AI and Data Engineering, NoSql, Spark, Studio-ScalaAnalytics, Apache Spark, Big Data Analytics, data analysis, fast data analytics, partitioning in apache spark, real time analytics, Spark SQL

Reading Time: 3 minutes Does partitioning help you increase/decrease the Job Performance? Spark splits data into partitions and computation is done in parallel for each partition. It is very important to understand how data is partitioned and when you need to manually modify the partitioning to run spark applications efficiently. Now, diving into our main topic i.e Repartitioning v/s Coalesce What is Coalesce? The coalesce method reduces the number Continue Reading

Kryo Serialization in Spark

December 12, 2019December 17, 2019Apache Spark, Studio-ScalaApache Spark, scala, Serialization

Reading Time: 4 minutes Spark provides two types of serialization libraries: Java serialization and (default) Kryo serialization. For faster serialization and deserialization spark itself recommends to use Kryo serialization in any network-intensive application. Then why is it not set to default : Why Kryo is not set to default in Spark? The only reason Kryo is not set to default is because it requires custom registration. Although, Kryo is Continue Reading

Diving deeper into Delta Lake

October 14, 2019October 14, 2019Apache Kafka, Apache Spark, Big Data and Fast Data, github, Spark, Streaming, Studio-Scala, Tech BlogsApache Spark, Big Data, delta lake, kafka, Kafka Streams, scala, Spark Streaming

Reading Time: 6 minutes Delta Lake is an open-source storage layer that brings reliability to data lakes. It has numerous reliability features including ACID transactions, scalable metadata handling, and unified streaming and batch data processing.

Delta Lake To the Rescue

October 7, 2019October 7, 2019Apache Spark, Big Data and Fast Data, Java, python, Spark, Streaming, Studio-ScalaApache Spark, batch processing, Big Data, delta lake, Stream Processing

Reading Time: 4 minutes Welcome Back. In our previous blogs, we tried to get some insights about Spark RDDs and also tried to explore some new things in Spark 2.4. You can go through those blogs here: RDDs – The backbone of Apache Spark Spark 2.4: Adding a little more Spark to your code In this blog, we will be discussing something called a Delta Lake. But first, let’s Continue Reading

Big Data Evolution: Migrating on-premise database to Hadoop

July 11, 2019July 11, 2019Apache Spark, Big Data and Fast Data, HDFS, Studio-Scala, TableauAnalytics, apache hadoop, Apache Hive, Apache Spark, Big Data, Big Data Analytics, data analysis, Hadoop, Hadoop Distributed File System, HDFS, Hive, MySql, NoSql Database, Spark, Spark with Scala, Tableau

Reading Time: 4 minutes We are now generating massive volumes of data at an accelerated rate. To meet business needs, address changing market dynamics as well as improve decision-making, sophisticated analysis of this data from disparate sources is required. The challenge is how to capture, store and model these massive pools of data effectively in relational databases. Big data is not a fad. We are just at the beginning Continue Reading

Using Vertica with Spark-Kafka: Write using Structured Streaming

July 3, 2019July 16, 2019Apache Kafka, Apache Spark, Big Data and Fast Data, Functional Programming, HDFS, Spark, Streaming, Streaming Solutions, Studio-ScalaApache Kafka, Apache Spark, DataFrame, Kafka Spark, Spark, Spark SQL, spark sql kafka, Spark Structured Streaming, Spark to Vertica, Streaming, Structured Streaming, Vertica, Write to vertica

Reading Time: 3 minutes In two previous blogs, we explored about Vertica and how it can be connected to Apache Spark. The first blog in this mini series was about reading data from Vertica using Spark and saving that data into Kafka. The next blog explained the reverse flow i.e. reading data from Kafka and writing data to Vertica but in a batch mode. i.e reading data from Kafka Continue Reading

Using Vertica with Spark-Kafka: Writing

July 2, 2019July 16, 2019Apache Kafka, Apache Spark, Database, HDFS, Spark, Studio-ScalaApache Kafka, Apache Spark, kafka, Spark, Spark Kafka vertica, Spark SQL, spark sql kafka, Spark vertica, Vertica, Write to vertica

Reading Time: 4 minutes In previous blog of this series, we took a glance over the basic definition of Spark and Vertica. We also did a code overview for reading data from Vertica using Spark as DataFrame and saving the data into Kafka. In this blog we will be doing the reverse flow i.e. working on reading the data from Kafka as a DataFrame and writing that DataFrame into Continue Reading

Using Vertica with Spark-Kafka: Reading

July 2, 2019July 8, 2019Apache Kafka, Apache Spark, Big Data and Fast Data, Database, SQL, Studio-ScalaApache Kafka, Apache Spark, Database, kafka, Spark, Spark SQL, spark sql kafka, Vertica

Reading Time: 4 minutes We live in a world of Big data where the size of data is so big even for small results. This is the result of an increase in data collection on a rapid scale in the modern world. This massiveness of data brings the requirements of such tools which can work upon such a big chunk of data. I am pretty sure that you guys Continue Reading