Author Archives: sandeep

About sandeep

I m working as an software consultant in Knoldus Software LLP . I m working on scala, play, spark,hive, hdfs, hadoop and many big data technologies.

Knolx:Kick-Start with SMACK Stack


Hi all, Knoldus has organized a 30 min session. The topic was Kick-Start with SMACK Stack. Many people have joined and enjoyed the session. I am going to share the slides here. Please let me know if you have any question … Continue reading

Posted in Scala | Leave a comment

Apache spark internals


In this slide, we will see internal architecture of spark cluster i.e what is driver, worker,executer and cluster manager, how spark program will be run on cluster and what are jobs,stages and task. For video of above session click here

Posted in Scala | Leave a comment

Email spam detection using apache spark mllib


In this blog we will see the real use case of spark mllib that is email spam detection. With the help of using the apache spark mllib component we will detect that email will goes in spam folder or primary … Continue reading

Posted in apache spark, big data, Scala, Spark | Tagged , , | 3 Comments

UDF overloading in spark


UDF are User Defined Function which are register with hive context to use custom functions in spark SQL queries. For example if you want to prepend some string in any other string or column then you can create a following … Continue reading

Posted in apache spark, big data, Scala, Spark | Tagged , , | 3 Comments

Cassandra cluster on local machine with CCM


CCM (Cassandra Cluster Manager) is the tool for creating a cassandra cluster on local machine without any difficulty, But CCM is only use for the test cluster for local machine it will not use for production purpose. So we start … Continue reading

Posted in Scala | Leave a comment

Logging Spark Application on standalone cluster


Logging of the application is much important to debug application, and logging spark application on standalone cluster is little bit different. We have two components for our spark application – Driver and Executer. Spark default use log4j logger to log  … Continue reading

Posted in apache spark, Scala, Spark | Tagged , | 5 Comments

Spark-shell on yarn resource manager: Basic steps to create hadoop cluster and run spark on it


In this blog we will install and configure hdfs and yarn with minimal configuration to create a local machine cluster. After that we will try to submit job to yarn cluster with the help of spark-shell, So lets start. Before … Continue reading

Posted in Scala | 5 Comments

Ganglia Cluster Monitoring: monitoring spark cluster


Ganglia is cluster monitoring tool to monitor the health of distributed cluster of spark and hadoop. I know you all have question that we already have a Application UI (http://masternode:4040) and Cluster UI (http://masternode:8080) then why we need ganglia? So … Continue reading

Posted in Scala | 1 Comment

Demystifying Asynchronous Actions in Spark


What if we want to execute 2 actions concurrently on different RDD’s, Spark actions are always synchronous. Like if we perform two actions one after other they always execute in sequentially like one after other. Let see example In the … Continue reading

Posted in apache spark, Scala, Spark | Tagged , , , | 10 Comments

Tuning apache spark application with speculation


What happen if spark job will be slow its a big question for application performance so we can optimize the jobs in spark with speculation, Its basically start a copy of job in another worker if the existing job is … Continue reading

Posted in apache spark, big data, Scala, Spark | Tagged , , , , , | 3 Comments