Apache Spark

Introduction to Accumulators : Apache Spark

Reading Time: 2 minutes Whats the Problem  : Function like map() , filter() can use variables defined outside them in the driver program but each task running on the cluster gets a new copy of each variable, and updates from these copies are not propagated back to the driver. The Solution : Spark provides two type of shared variables. 1.    Accumulators 2.    Broadcast variables Here we are Continue Reading

Broadcast variables in Spark, how and when to use them?

Reading Time: 2 minutes As documentation for Spark Broadcast variables states, they are immutable shared variable which are cached on each worker nodes on a Spark cluster.  In this blog, we will demonstrate a simple use case of broadcast variables. When to use Broadcast variable? Think of a problem as counting grammar elements for any random English paragraph, document or file. Suppose you have the Map of each word as specific Continue Reading

UDF overloading in spark

Reading Time: 2 minutes UDF are User Defined Function which are register with hive context to use custom functions in spark SQL queries. For example if you want to prepend some string in any other string or column then you can create a following UDF def addSymbol(input:String, symbol:String)={ symbol+input } Now to register above function in hiveContext we need to register UDF as follows hiveContext.udf.register(“addSymbol”,(input:String,symbol:String)=>addSymbol(input,symbol)) Now you can use Continue Reading

Aggregating Neighboring vertices with Apache Spark GraphX Library

Reading Time: 2 minutes To get the problems addressed by “Neighborhood Aggregation”, we can think of the queries like: “Who has the maximum number of followers under 20 on twitter?” In this blog, we will learn how to aggregate properties of neighboring vertices on a graph with Apache Spark’s GraphX Library. The spark shell will be enough to understand the code example. So, let us get back on the problem statement. Let Continue Reading

Logging Spark Application on standalone cluster

Reading Time: < 1 minute Logging of the application is much important to debug application, and logging spark application on standalone cluster is little bit different. We have two components for our spark application – Driver and Executer. Spark default use log4j logger to log  application. So whenever we use spark on local machine or spark-shell its use default log4j.properties from /spark/conf/log4j.properties by default spark logging rootCategory=INFO, console. But when Continue Reading

A sample ML Pipeline for Clustering in Spark

Reading Time: 2 minutes Often a machine learning task contains several steps such as extracting features out of raw data, creating learning models to train on features and running predictions on trained models, etc.  With the help of the pipeline API provided by Spark, it is easier to combine and tune multiple ML algorithms into a single workflow. Whats is in the blog? We will create a sample ML pipeline Continue Reading

Congregating Spark files on S3

Reading Time: 2 minutes We all know that Apache Spark is a fast and general engine for large-scale data processing and it is because of its speed that Spark was able to become one of the most popular frameworks in the world of big data. Working with Spark is a pleasant experience as it has a simple API for Scala, Java, Python and R. But, some tasks, in Spark, are still tough rows Continue Reading

Meetup: An Overview of Spark DataFrames with Scala

Reading Time: < 1 minute Knoldus organized a Meetup on Wednesday, 18 Nov 2015. In this Meetup, an overview of Spark DataFrames with Scala, was given. Apache Spark is a distributed compute engine for large-scale data processing. A wide range of organizations are using it to process large datasets. Many Spark and Scala enthusiasts attended this session and got to know, as to why DataFrames are the best fit for building an application in Spark with Scala Continue Reading

Simplifying Sorting with Spark DataFrames

Reading Time: 2 minutes In our previous blog post, Using Spark DataFrames for Word Count, we saw how easy it has become to code in Spark using DataFrames. Also, it has made programming in Spark much more logical rather than technical. So, lets continue our quest for simplifying coding in Spark with DataFrames via Sorting. We all know that Sorting has always been an inseparable part of Analytics. Whether it is E-Commerce or Applied Continue Reading

MeetUp on “An Overview of Spark DataFrames with Scala”

Reading Time: < 1 minute Knoldus is organizing an one hour session on 18th Nov 2015 at 6:00 PM. Topic would be An Overview of Spark DataFrames with Scala. All of you are invited to join this session. Address:- 30/29, First Floor, Above UCO Bank, Near Rajendra Place Metro Station,  New Delhi, India Please click here for more details.

Using Spark DataFrames for Word Count

Reading Time: 2 minutes As we all know that, DataFrame API was introduced in Spark 1.3.0, in March 2015. Its goal was to make distributed processing of “Big Data” more intuitive, by organizing distributed collection of data (known as RDD) into named columns. This enabled both, Engineers & Data Scientists, to use Apache Spark for distributed processing of “Big Data”, with ease. Also, DataFrame API came with many under the hood optimizations Continue Reading

Demystifying Asynchronous Actions in Spark

Reading Time: 3 minutes What if we want to execute 2 actions concurrently on different RDD’s, Spark actions are always synchronous. Like if we perform two actions one after other they always execute in sequentially like one after other. Let see example val rdd = sc.parallelize(List(32, 34, 2, 3, 4, 54, 3), 4) rdd.collect().map{ x =&gt; println("Items in the lists:" + x)} val rddCount = sc.parallelize(List(434, 3, 2, 43, Continue Reading

Scala Days 2015 Pulse – Sentiment Analysis of the Sessions

Reading Time: 2 minutes As you might have read that the Pulse Application from Knoldus created a flutter at Scala Days 2015. In this post let us dig deeper into some findings. We are already getting some cool requests to open source the code which we would do soon. We would just like to strip out the dictionaries that we have used. We collected 5K+ tweets over the period of Continue Reading

Knoldus Pune Careers - Hiring Freshers

Get a head start on your career at Knoldus. Join us!