Big Data

Cassandra Data Modeling – Primary , Clustering , Partition , Compound Keys

Reading Time: 5 minutes In this post we are going to discuss more about different keys available in Cassandra . Primary key concept in Cassandra is different from Relational databases. Therefore it is worth spending time to understand this concept. Lets take an example and create a student table which had a student_id as a primary key column. 1) primary key  create table person (student_id int primary key, fname Continue Reading

Spark – IoT : Combining Big Data Analysis with IoT

Reading Time: 3 minutes Welcome back , folks ! Time for some new gig ! I think that last series i.e. Scala – IOT was pretty amazing , which got an overwhelming response from you all which resulted in pumping up the idea of this new web-series Spark-IOT. So let’s get started, What was the motivation ? I have been active in the IoT community here, and I found Continue Reading

Hive-Metastore : A Basic Introduction

Reading Time: 3 minutes As we know database is the most important and powerful part for any organisation. It is the collection of Schema, Tables, Relationships, Queries and Views. It is an organized collection of data. But can you ever think about these question – How does database manage all the tables? How does database manage all the relationship? How do we perform all operations so easy? Is there Continue Reading

Is using Accumulators really worth ? Apache Spark

Reading Time: 2 minutes Before jumping right into the topic you must know what Accumulators are ? for that you can refer this blog. Now we know what and why of Accumulators lets jump to the main point. Description :- Spark automatically deals with failed or slow machines by re-executing failed or slow tasks. Example :- if the node running a partition of a map() operation crashes, Spark will rerun it Continue Reading

Broadcast variables in Spark, how and when to use them?

Reading Time: 2 minutes As documentation for Spark Broadcast variables states, they are immutable shared variable which are cached on each worker nodes on a Spark cluster.  In this blog, we will demonstrate a simple use case of broadcast variables. When to use Broadcast variable? Think of a problem as counting grammar elements for any random English paragraph, document or file. Suppose you have the Map of each word as specific Continue Reading

Aggregating Neighboring vertices with Apache Spark GraphX Library

Reading Time: 2 minutes To get the problems addressed by “Neighborhood Aggregation”, we can think of the queries like: “Who has the maximum number of followers under 20 on twitter?” In this blog, we will learn how to aggregate properties of neighboring vertices on a graph with Apache Spark’s GraphX Library. The spark shell will be enough to understand the code example. So, let us get back on the problem statement. Let Continue Reading

Saving Spark DataFrames on Amazon S3 got Easier !!!

Reading Time: < 1 minute In our previous blog post, Congregating Spark Files on S3, we explained that how we can Upload Files(saved in a Spark Cluster) on Amazon S3. Well, I agree that the method explained in that post was a little bit complex and hard to apply. Also, it adds a lot of boilerplate in our code. So, we started working on simplifying it & finding an easier way to provide a wrapper around Spark Continue Reading

Congregating Spark files on S3

Reading Time: 2 minutes We all know that Apache Spark is a fast and general engine for large-scale data processing and it is because of its speed that Spark was able to become one of the most popular frameworks in the world of big data. Working with Spark is a pleasant experience as it has a simple API for Scala, Java, Python and R. But, some tasks, in Spark, are still tough rows Continue Reading

BlinkDB by Databricks Engineer @ Knoldus

Reading Time: < 1 minute On 24 Nov, 2015, Sameer Agarwal, Software Engineer at Databricks, gave us an introduction of BlinkDB in the MeetUp organized by Knoldus. It was a great session. We are thankful to Sameer. It was quite inspiring and appreciated by all attendees. BlinkDB is a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. It allows users to trade-off query accuracy Continue Reading

Meetup: An Overview of Spark DataFrames with Scala

Reading Time: < 1 minute Knoldus organized a Meetup on Wednesday, 18 Nov 2015. In this Meetup, an overview of Spark DataFrames with Scala, was given. Apache Spark is a distributed compute engine for large-scale data processing. A wide range of organizations are using it to process large datasets. Many Spark and Scala enthusiasts attended this session and got to know, as to why DataFrames are the best fit for building an application in Spark with Scala Continue Reading

Simplifying Sorting with Spark DataFrames

Reading Time: 2 minutes In our previous blog post, Using Spark DataFrames for Word Count, we saw how easy it has become to code in Spark using DataFrames. Also, it has made programming in Spark much more logical rather than technical. So, lets continue our quest for simplifying coding in Spark with DataFrames via Sorting. We all know that Sorting has always been an inseparable part of Analytics. Whether it is E-Commerce or Applied Continue Reading

Introduction to Machine Learning with Spark (Clustering)

Reading Time: 2 minutes In this blog, we will learn how to group similar data objects using K-means clustering offered by Spark Machine Learning Library. Prerequisites The code example needs only Spark Shell to execute. What is Clustering Clustering is like grouping data objects in some random clusters (with no initial class of group defined) on the basis of similarity or the natural closeness to each other. The “closeness” Continue Reading

Using Spark DataFrames for Word Count

Reading Time: 2 minutes As we all know that, DataFrame API was introduced in Spark 1.3.0, in March 2015. Its goal was to make distributed processing of “Big Data” more intuitive, by organizing distributed collection of data (known as RDD) into named columns. This enabled both, Engineers & Data Scientists, to use Apache Spark for distributed processing of “Big Data”, with ease. Also, DataFrame API came with many under the hood optimizations Continue Reading