Author: Anubhav Tarar

Spark Stream-Stream Join

Tuning spark on yarn

Reading Time: 2 minutes In this blog we will learn how to tuning yarn with spark in both mode yarn-client and yarn-cluster,the only requirement to get started is that you must have a hadoop based yarn-spark cluster with you. In case you want to create a cluster you can follow this blog here. 1. yarn-client mode:  In client mode, the driver runs in the client process, and the application master is only used Continue Reading

A Step-by-step guide for setting MultiNode Mesos Cluster with Spark and Hdfs on EC2

Reading Time: 3 minutes Apache Mesos is open source project for managing computer clusters originally developed at the University Of California. It sits between the application layer and operating system to manage the application works efficiently on the large-scale distributed environment. In this blog, we will see how to setup mesos client and master on ec2 from scratch. Step1: Launch ec2 with the configuration below : Ami Server: Ubuntu server (ami-41e0b93b) Continue Reading

How Does Spark Use MapReduce?

Reading Time: 2 minutes In this talk we will talk about a interesting scenario did spark use mapreduce or not?answer to the question is yes,it use mapreduce but only the idea not the exact implementation lets talk about a example to read a text file from spark what we all do is spark.sparkContext.textFile(“fileName”) but do you know how does it actually works try to control click on this method Continue Reading

How To Use Hive With Out Hadoop

Reading Time: < 1 minute Reason for writing this blog is to answer the Most Common Question Can We use Hive With Out hadoop,so lets started it answer is yes Starting with release 0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically you just have to do two things first create your warehouse in local system and give the default fs name to local put these Continue Reading

How to query external hive Metastore From Spark

Reading Time: < 1 minute In this Blog we will learn how can we access tables from hive metastore in spark,so now just lets get started start your hive metastore as  as service with following command hive –service metastore by default it will start metastore on port 9083 go to your spark client and tell driver to conncet to metastore listening on port 9083  when starting the hive thrift server Continue Reading

Spark On Mesos(Installation)

Reading Time: 2 minutes In this Article We Will Learn How to Use Mesos On spark,so lets get started all you required is spark on your machine as a prerequisite,here are the steps to configure 1.Download Latest Mesos Version from here 2.extract the jar 3.Install Mesos dependencies. $ sudo apt-get -y install build-essential python-dev python-six python-virtualenv libcurl4-nss-dev libsasl2-dev libsasl2-modules maven libapr1-dev libsvn-devInstall other Mesos dependencies. 4.Install libz which is Continue Reading

Why Dataset Over DataFrame?

Reading Time: < 1 minute In this Blog We Will Learn What is Really The Advantage That Dataset Api in spark 2 has over Dataframe api DataFrame is weakly typed and developers aren’t getting the benefits of the type system thats why the Dataset Api is Introduced in spark 2  to understand this thing please look at following scenario suppose you want to read the result from a csv file Continue Reading

Create Your Own MetastoreEvent Listeners in Hive With Scala

Reading Time: 2 minutes HIve MetaStore Event Listeners are used to Detect the every single event that takes place whenever an event is executed in hive, in case You want some action to take place for an event you can override MetaStorePreEventListener and provide it your own Implementation in this article, we will learn how to create our own metastore event listeners in the hive using scala and sbt so let’s get Continue Reading

How To Use Vectorized Reader In Hive

Reading Time: 2 minutes Reason For Writing This Blog is That  I tried to use Vectorized Reader In Hive But Faced some problem with its documentation,thats why decided to write this block Introduction Vectorized query execution is a Hive feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins. A standard query execution system processes one row at a time. This involves Continue Reading

Play-Spark2 A simple Application

Reading Time: 3 minutes In This Blog We Will Create  a very simple application with Play FrameWork And Spark. Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Scala, Java, and Python that make parallel jobs easy to write, and an optimized engine that supports general computation graphs. It also supports a rich set of higher-level tools including Shark and Spark Streaming. Play Continue Reading

Partitioning in Apache Hive

Reading Time: 2 minutes Partitions Hive is a good tool for performing queries on large datasets, especially datasets that require full table scans. But quite often there are instances where users need to filter the data on specific column values.thats where Partitioning comes into play a partition is nothing but a directory which contains the chunk of data when we do partitioning, we create a partition for each unique value Continue Reading

UnderStanding Optimized Logical Plan In Spark

Reading Time: 2 minutes LogicalPlan is a tree that represents both schema and data,these trees are manipulated and optimized by catalyst framework There are three types of logical plans ○ Parsed logical plan ○ Analysed Logical Plan ○ Optimized logical Plan Analysed Logical plan goes through series of rules to resolve and optimize plan is produced Optimized plan normally allows spark to plug in set of optimization rules Even developer can plug Continue Reading

Starting Hive-Client Programmatically With Scala

Reading Time: 2 minutes Hive defines a simple SQL-like query language to querying and managing large datasets called Hive-QL ( HQL ). It’s easy to use if you’re familiar with SQL Language. Hive allows programmers who are familiar with the language to write the custom MapReduce framework to perform more sophisticated analysis. In this Blog,we will learn how to create a hive client with scala to execute basic hql commands,first create Continue Reading