SQL

Amazon EMR

Reading Time: 3 minutes Businesses worldwide are discovering the power of new big data processing and analytics frameworks like Apache Hadoop and Apache Spark, but they are also discovering some of the challenges of operating these technologies in on-premises data lake environments. They may also have concerns about the future of their current distribution vendor. Common problems of on-premises big data environments include a lack of agility, excessive costs, Continue Reading

Apache Spark: Tricks to Increase Job Performance

Reading Time: 2 minutes Apache Spark is quickly adopting the Real-world and most of the companies like Uber are using it in their production. Spark is gaining its popularity in the market as it also provides you with the feature of developing Streaming Applications and doing Machine Learning, which helps companies get better results in their production along with proper analysis using Spark. Although companies are using Spark in Continue Reading

Apache Spark: Read Data from S3 Bucket

Reading Time: < 1 minute Amazon S3 Accessing S3 Bucket through Spark Edit spark-default.conf file You need to add below 3 lines consists of your S3 access key, secret key & file system spark.hadoop.fs.s3a.access.key “s3keys” spark.hadoop.fs.s3a.secret.key “yourkey” spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem

Logical and Physical Plan in Spark

Understanding Spark’s Logical and Physical Plan in layman’s term

Reading Time: 6 minutes This blog pertains to Apache SPARK 2.x, where we will find out how Spark SQL works internally in layman’s terms and try to understand what is Logical and Physical Plan. Also, we will be looking into Catalyst Optimizer. So let’s get started. First, let’s see what Apache Spark is. The official definition of Apache Spark says that “Apache Spark™ is a unified analytics engine for large-scale Continue Reading

Apache Spark

Deep Dive into Apache Spark Transformations and Action

Reading Time: 4 minutes In our previous blog of Apache Spark, we discussed a little about what Transformations & Actions are? Now we will get deeper into the topic and will understand what actually they are & how they play a vital role to work with Apache Spark? What is Spark RDD? Spark introduces the concept of an RDD (Resilient Distributed Dataset), an immutable fault-tolerant, distributed collection of objects Continue Reading

Tale of Apache Spark

Reading Time: 6 minutes Data is being produced extensively in today’s world and it is going to be generated more rapidly in future. 90% of total data that is produced in the world is produced in last two years only and it is estimated that in 2020 world’s total data would reach 45 ZB and data generated each day would be enough that if we try to store it Continue Reading

RAP: Let’s discuss the architecture and technical details

Reading Time: 4 minutes In the previous blog, we discussed the journey and the features of RAP. In this blog, we will discuss the architecture and technical details. RAP Architecture: We have tried to follow the Domain-Driven Design and Reactive principles in designing RAP architecture. All RAP team members completed all the Reactive Architecture courses launched by Lightbend on cognitive classes to ensure adherence to reactive principles. These courses Continue Reading

Defining your workflow: Why Not Airflow?

Reading Time: 4 minutes What is Apache Airflow? Airflow is a platform to programmatically author, schedule & monitor workflows or data pipelines. These functions achieved with Directed Acyclic Graphs (DAG) of the tasks. It is an open-source and still in the incubator stage. It was initialized in 2014 under the umbrella of Airbnb since then it got an excellent reputation with approximately 800 contributors on GitHub and 13000 stars. Continue Reading

Using Vertica with Spark-Kafka: Reading

Reading Time: 4 minutes We live in a world of Big data where the size of data is so big even for small results. This is the result of an increase in data collection on a rapid scale in the modern world. This massiveness of data brings the requirements of such tools which can work upon such a big chunk of data. I am pretty sure that you guys Continue Reading

KSQL: Streams and Tables

Reading Time: 3 minutes By now you must be familiar with KSQL and how to get started with it. If not, check out the Part1 KSQL: Getting started with Streaming SQL for Apache Kafka of this series. In this blog, we’ll move one step forward to get an understanding of the Dual streaming model to see what abstractions does KSQL use to process the data. All the data that we Continue Reading