Apache Spark

Logging Spark Application on standalone cluster

Reading Time: 1 minute Logging of the application is much important to debug application, and logging spark application on standalone cluster is little bit different. We have two components for our spark application – Driver and Executer. Spark default use log4j logger to log  application. So whenever we use spark on local machine or spark-shell its use default log4j.properties from /spark/conf/log4j.properties by default spark logging rootCategory=INFO, console. But when Continue Reading

A sample ML Pipeline for Clustering in Spark

Reading Time: 2 minutes Often a machine learning task contains several steps such as extracting features out of raw data, creating learning models to train on features and running predictions on trained models, etc.  With the help of the pipeline API provided by Spark, it is easier to combine and tune multiple ML algorithms into a single workflow. Whats is in the blog? We will create a sample ML pipeline Continue Reading

Congregating Spark files on S3

Reading Time: 2 minutes We all know that Apache Spark is a fast and general engine for large-scale data processing and it is because of its speed that Spark was able to become one of the most popular frameworks in the world of big data. Working with Spark is a pleasant experience as it has a simple API for Scala, Java, Python and R. But, some tasks, in Spark, are still tough rows Continue Reading

Meetup: An Overview of Spark DataFrames with Scala

Reading Time: 1 minute Knoldus organized a Meetup on Wednesday, 18 Nov 2015. In this Meetup, an overview of Spark DataFrames with Scala, was given. Apache Spark is a distributed compute engine for large-scale data processing. A wide range of organizations are using it to process large datasets. Many Spark and Scala enthusiasts attended this session and got to know, as to why DataFrames are the best fit for building an application in Spark with Scala Continue Reading

Simplifying Sorting with Spark DataFrames

Reading Time: 2 minutes In our previous blog post, Using Spark DataFrames for Word Count, we saw how easy it has become to code in Spark using DataFrames. Also, it has made programming in Spark much more logical rather than technical. So, lets continue our quest for simplifying coding in Spark with DataFrames via Sorting. We all know that Sorting has always been an inseparable part of Analytics. Whether it is E-Commerce or Applied Continue Reading

MeetUp on “An Overview of Spark DataFrames with Scala”

Reading Time: 1 minute Knoldus is organizing an one hour session on 18th Nov 2015 at 6:00 PM. Topic would be An Overview of Spark DataFrames with Scala. All of you are invited to join this session. Address:- 30/29, First Floor, Above UCO Bank, Near Rajendra Place Metro Station,  New Delhi, India Please click here for more details.

Using Spark DataFrames for Word Count

Reading Time: 2 minutes As we all know that, DataFrame API was introduced in Spark 1.3.0, in March 2015. Its goal was to make distributed processing of “Big Data” more intuitive, by organizing distributed collection of data (known as RDD) into named columns. This enabled both, Engineers & Data Scientists, to use Apache Spark for distributed processing of “Big Data”, with ease. Also, DataFrame API came with many under the hood optimizations Continue Reading

Demystifying Asynchronous Actions in Spark

Reading Time: 3 minutes What if we want to execute 2 actions concurrently on different RDD’s, Spark actions are always synchronous. Like if we perform two actions one after other they always execute in sequentially like one after other. Let see example val rdd = sc.parallelize(List(32, 34, 2, 3, 4, 54, 3), 4) rdd.collect().map{ x => println("Items in the lists:" + x)} val rddCount = sc.parallelize(List(434, 3, 2, 43, Continue Reading

Scala Days 2015 Pulse – Sentiment Analysis of the Sessions

Reading Time: 2 minutes As you might have read that the Pulse Application from Knoldus created a flutter at Scala Days 2015. In this post let us dig deeper into some findings. We are already getting some cool requests to open source the code which we would do soon. We would just like to strip out the dictionaries that we have used. We collected 5K+ tweets over the period of Continue Reading

Tuning apache spark application with speculation

Reading Time: 2 minutes What happen if spark job will be slow its a big question for application performance so we can optimize the jobs in spark with speculation, Its basically start a copy of job in another worker if the existing job is slow.It will not stop the slow execution of job both the workers execute the job simultaneously. To make our job speculative we need to set Continue Reading

Start/Deploy Apache Spark application programmatically using spark launcher

Reading Time: 1 minute Sometimes we need to start our spark application from the another scala/java application. So we can use SparkLauncher. we have an example in which we make spark application and run it with another scala application. Let see our spark application code. import org.apache.spark.SparkConf import org.apache.spark.SparkContext object SparkApp extends App{ val conf=new SparkConf().setMaster(“local[*]”).setAppName(“spark-app”) val sc=new SparkContext(conf) val rdd=sc.parallelize(Array(2,3,2,1)) rdd.saveAsTextFile(“result”) sc.stop() } This is our simple spark Continue Reading

Stateful transformation on Dstream in apache spark with example of wordcount

Reading Time: 2 minutes Sometimes we have a use-case in which we need to maintain state of paired Dstream to use it in next Dstream . So we are taking a example of stateful wordcount in socketTextStreaming. Like in wordcount example if word “xyz” comes twice is in first Dstream or window, it reduce it and its value is 2 but its state will lost in the next Dstream Continue Reading

Shufflling and repartitioning of RDD’s in apache spark

Reading Time: 3 minutes To write the optimize spark application you should carefully use transformation and actions, if you use wrong transformation and action will make your application  slow. So when you are writing application some points should be remember to make your application more optimize. 1. Number of partitions when creating RDD By default spark create one partition for each block of the file in HDFS it is Continue Reading

Knoldus Pune Careers - Hiring Freshers

Get a head start on your career at Knoldus. Join us!