Author: Himanshu Gupta

Upgrade your Spark REST Server with Akka HTTP & Spark 2.0

Reading Time: 2 minutes About an year ago, in one of our blog – Spark with Spray Starter Kit we explained about creating REST Services with Spark and Spray. But, from past one year there has not been much development on Spray which tells us that soon Spray will be out-of-phase. So, we decided to Upgrade our Spark REST Server with Akka HTTP and Spark 2.0. Akka HTTP is a suite of Continue Reading

Deploy a Spark Application on Cluster

Reading Time: 2 minutes In one of our previous blog, Setup a Apache Spark Cluster in your Single Standalone Machine, we showed how to setup a standalone cluster for running spark applications. But we never discussed on how to deploy our Spark applications on that cluster. So, in this blog, we will see how to deploy our Spark application on a cluster and use it to run our spark jobs. For Continue Reading

KnolX: Unit Testing of Spark Applications

Reading Time: < 1 minute Knoldus organized a KnolX session on Wednesday, 13 April 2016. In this KnolX session, we explored the different methods of writing unit tests for Spark applications. This session also talks about how unit testing of Spark applications is done, as well as tells about the best way to do it. This includes writing unit tests with and without Spark Testing Base package, which is a spark Continue Reading

Boost Factorial Calculation with Spark

Reading Time: 2 minutes We all know that, Apache Spark is a fast and a general engine for large-scale data processing. It can process data up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. But, is that the only task (i.e., MapReduce) for which Spark can be used ? The answer is: No. Spark is not only a Big Data processing engine. It is a framework which provides Continue Reading

Saving Spark DataFrames on Amazon S3 got Easier !!!

Reading Time: < 1 minute In our previous blog post, Congregating Spark Files on S3, we explained that how we can Upload Files(saved in a Spark Cluster) on Amazon S3. Well, I agree that the method explained in that post was a little bit complex and hard to apply. Also, it adds a lot of boilerplate in our code. So, we started working on simplifying it & finding an easier way to provide a wrapper around Spark Continue Reading

Congregating Spark files on S3

Reading Time: 2 minutes We all know that Apache Spark is a fast and general engine for large-scale data processing and it is because of its speed that Spark was able to become one of the most popular frameworks in the world of big data. Working with Spark is a pleasant experience as it has a simple API for Scala, Java, Python and R. But, some tasks, in Spark, are still tough rows Continue Reading

Fundamentals of eXtreme Programming

Reading Time: < 1 minute The term Extreme Programming (XP) was coined by Kent Beck, in late 1990s.  The purpose behind inventing XP was to find a way to deliver high quality software, developed by small teams, and to keep up with the changing requirements of customer. Although, XP is more than a decade old now, it is still being practiced by a lot of developers around the world. The following slide deck and Continue Reading

BlinkDB by Databricks Engineer @ Knoldus

Reading Time: < 1 minute On 24 Nov, 2015, Sameer Agarwal, Software Engineer at Databricks, gave us an introduction of BlinkDB in the MeetUp organized by Knoldus. It was a great session. We are thankful to Sameer. It was quite inspiring and appreciated by all attendees. BlinkDB is a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. It allows users to trade-off query accuracy Continue Reading

Meetup: An Overview of Spark DataFrames with Scala

Reading Time: < 1 minute Knoldus organized a Meetup on Wednesday, 18 Nov 2015. In this Meetup, an overview of Spark DataFrames with Scala, was given. Apache Spark is a distributed compute engine for large-scale data processing. A wide range of organizations are using it to process large datasets. Many Spark and Scala enthusiasts attended this session and got to know, as to why DataFrames are the best fit for building an application in Spark with Scala Continue Reading

Simplifying Sorting with Spark DataFrames

Reading Time: 2 minutes In our previous blog post, Using Spark DataFrames for Word Count, we saw how easy it has become to code in Spark using DataFrames. Also, it has made programming in Spark much more logical rather than technical. So, lets continue our quest for simplifying coding in Spark with DataFrames via Sorting. We all know that Sorting has always been an inseparable part of Analytics. Whether it is E-Commerce or Applied Continue Reading

Using Spark DataFrames for Word Count

Reading Time: 2 minutes As we all know that, DataFrame API was introduced in Spark 1.3.0, in March 2015. Its goal was to make distributed processing of “Big Data” more intuitive, by organizing distributed collection of data (known as RDD) into named columns. This enabled both, Engineers & Data Scientists, to use Apache Spark for distributed processing of “Big Data”, with ease. Also, DataFrame API came with many under the hood optimizations Continue Reading

Spark with Spray Starter Kit

Reading Time: 3 minutes Over the last few months, Spark has gained a lot of momentum in Big Data world. It has won a lot competitions & surveys, like Daytona Gray Sort 100TB competition or becoming top level Apache Project and many more. Irrespective of whether it is a product which is a fast/general engine for large-scale data processing, Spark has found its use everywhere. The best part about Spark being that is it can Continue Reading

Gnip using Spark Streaming :- An Apache Spark Utility to pull Tweets from Gnip in realtime

Reading Time: 2 minutes We all are familiar with Gnip, Inc. which provides data from dozens of social media websites via a single API. It is also known as the Grand Central Station for social media web. One of its popular API is PowerTrack which provides Tweets from Twitter in realtime along with the ability to filter Twitter’s full firehose, giving its customers only what they are interested in. This Continue Reading