Big Data

Big Data Evolution: Migrating on-premise database to Hadoop

Reading Time: 4 minutes We are now generating massive volumes of data at an accelerated rate. To meet business needs, address changing market dynamics as well as improve decision-making, sophisticated analysis of this data from disparate sources is required. The challenge is how to capture, store and model these massive pools of data effectively in relational databases. Big data is not a fad. We are just at the beginning Continue Reading

Do you really need Spark? Think Again!

Reading Time: 5 minutes With the massive amount of increase in big data technologies today, it is becoming very important to use the right tool for every process. The process can be anything like Data ingestion, Data processing, Data retrieval, Data Storage, etc. Today we are going to focus on one of those popular big data technologies i.e., Apache Spark. Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark Continue Reading

Commit Log: A commitment that Cassandra provides.

Reading Time: 5 minutes Welcome back, everyone. I have been working on Cassandra for quite some time now but never actually got to explore its working in depth. We know that its decentralized nature, as well as its ability to handle such a large volume of writes, makes it really commendable. But how does it manage to be efficient? How is it able to achieve what it is so Continue Reading

Spark: Introduction to Datasets

Reading Time: 3 minutes As I have already discussed in my previous blog Spark: RDD vs DataFrames about the shortcomings of RDDs and how DataFrames overcome them. Now we’ll try to have a look at the shortcomings of DataFrames and how Dataset APIs can overcome them. DataFrames:- A DataFrame is a distributed collection of data, which is organized into named columns. Conceptually, it is equivalent to the relational tables with Continue Reading

Spark: RDD vs DataFrames

Reading Time: 3 minutes Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.One use of Spark SQL is to execute SQL queries. When running SQL from within another Continue Reading

Knoldus-Clutch-AI-Big-Data-Top

Knoldus Joins Clutch’s Research of Top AI & Big Data Companies in 2018

Reading Time: 2 minutes The advent of the digital economy is a development that has changed the landscapes of every industry across the world. There is a new key ingredient for success; the best performing businesses are those with the best digital platforms, built to drive performance and bring customer interaction to new heights. At Knoldus, we are a team of developers and innovators dedicated to helping businesses reach Continue Reading

Is Apache Cassandra really the Database you need?

Reading Time: 6 minutes Welcome back, everyone. It has been quite some time since I have been working with Cassandra. To be honest, it is a quite cool database. Its decentralized nature, as well as its ability to handle such a large volume of writes, is really commendable. But as we know nothing is perfect. So is the Cassandra Database. What I mean by this is that you cannot Continue Reading

The curious case of Cassandra Reads

Reading Time: 5 minutes In our previous blog, we discovered how Cassandra handles its write queries. Now it’s time to understand how it ensures all the read requests are fulfilled. Let’s first have an overall view of Cassandra. Apache Cassandra is a free and open-source distributed NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of Continue Reading

Cassandra Writes: A Mystery?

Reading Time: 5 minutes Apache Cassandra is a free and open-source distributed NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. It is a peer to peer database where each node in the cluster constantly communicates with each other to share and receive information (node status, data ranges and so on). There is no Continue Reading

Difference between Apache Hadoop and Apache Spark Mapreduce

Reading Time: 4 minutes The term Big Data has created a lot of hype already in the business world. Hadoop and Spark are both Big Data frameworks – they provide some of the most popular tools used to carry out common Big Data-related tasks. In this blog, we will cover what is the difference between Apache Hadoop and Apache Spark MapReduce. Introduction Spark – It is an open source Continue Reading

fetching data from different sources using Spark 2.1

What’s new in Apache Spark 2.2

Reading Time: 2 minutes Apache recently released a newer version of Spark i.e Apache Spark 2.2. The new version comes with new improvements as well as the addition of new functionalities. The major addition to this release is Structured Streaming. It has been marked as production ready and its experimental tag has been removed. Some of the high-level changes and improvements : Production ready Structured Streaming Expanding SQL functionalities New Continue Reading

Having Issue How To Order Streamed Dataframe ?

Reading Time: 3 minutes A few days ago, i have to perform aggregation on streaming dataframe. And the moment, i apply groupBy for aggregation, data gets shuffled. Now the situation arises how to maintain order? Yes, i can use orderBy with streaming dataframe using Spark Structured Streaming, but only in complete mode. There is no way of doing ordering of streaming data in append mode and update mode. I Continue Reading

Spark Structured Streaming: A Simple Definition

Reading Time: 3 minutes “Structured Streaming”, nowadays we are hearing this term in Apache Spark ecosystem quite a lot, as it is being preached as next big thing in scalable big data world. Although, we all know that Structured Streaming means a stream having structured data in it, but very few of us knows what exactly it is and where we can use it. So, in this blog post Continue Reading

Knoldus Pune Careers - Hiring Freshers

Get a head start on your career at Knoldus. Join us!