Big Data

Installing and Running Presto

Reading Time: 4 minutes Hi Folks ! In my previous blog, I had talked about Getting Introduced with Presto. In today’s blog, I shall be talking about setting up(installing) and running presto. The basic pre-requisites for setting up Presto are: Linux or Mac OS X Java 8, 64-bit Python 2.4+ Installation Download the Presto Tarball from here Unpack the Tarball After unpacking you will see a directory presto-server-0.175 which Continue Reading

Partition-Aware Data Loading in Spark SQL

Reading Time: 3 minutes Data loading, in Spark SQL, means loading data in memory/cache of Spark worker nodes. For which we use to write following code: val connectionProperties = new Properties() connectionProperties.put(“user”, “username”) connectionProperties.put(“password”, “password”) val jdbcDF = spark.read .jdbc(“jdbc:postgresql:dbserver”, “schema.table”, connectionProperties) In here we are using jdbc function of DataFrameReader API of Spark SQL to load the data from table into Spark Executor’s memory, no matter how many rows are Continue Reading

Short Interview With SMACK Tech Stack !!!

Reading Time: 3 minutes Hello guy’s, today’s we conduct short interview with SMACK about its architecture and there uses. Let’s start with of some introduction. Interviewer: How would you describe your self ? SMACK: I am SMACK (Spark, Mesos, Akka, Cassandra and Kafka) and belongs to all open source technologies. Mesosphere and Cisco collaboration bundles these technologies together and create a product called Infinity.  Which is used to solved Continue Reading

Tableau: Getting into Tableau Public

Reading Time: 2 minutes Big Data visualization and Business Intelligence got so easy using Tableau, millions and billions of records can be analyzed in just one go whether your data format is excel, csv, text or database, Tableau make it easy for you. So finally you have make your mind to generate visualizations using Tableau and want to know what are the heights of Tableau in visualizations?. You are Continue Reading

Business Intelligence-Data Visualization: Tableau

Reading Time: 3 minutes Spark, Bigdata, NoSQL, Hadoop are some of the most using and top in charts technologies that we frequently use in Knoldus, when these terms used than one thing comes into picture is ‘Huge Data, millions/billions of records’ Knoldus developers use these terms frequently, managing (and managing means here- storing data, rectifying data, normalizing it, cleaning it and much more) such amount of data is really Continue Reading

Setting Up Multi-Node Hadoop Cluster , just got easy !

Reading Time: 3 minutes In this blog,we are going to embark the journey of how to setup the Hadoop Multi-Node cluster on a distributed environment. So lets do not waste any time, and let’s get started. Here are steps you need to perform. Prerequisite: 1.Download & install Hadoop for local machine (Single Node Setup) http://hadoop.apache.org/releases.html – 2.7.3 use java : jdk1.8.0_111 2. Download Apache Spark from : http://spark.apache.org/downloads.html choose spark release Continue Reading

Cassandra Data Modeling – Primary , Clustering , Partition , Compound Keys

Reading Time: 5 minutes In this post we are going to discuss more about different keys available in Cassandra . Primary key concept in Cassandra is different from Relational databases. Therefore it is worth spending time to understand this concept. Lets take an example and create a student table which had a student_id as a primary key column. 1) primary key  create table person (student_id int primary key, fname Continue Reading

Spark – IoT : Combining Big Data Analysis with IoT

Reading Time: 3 minutes Welcome back , folks ! Time for some new gig ! I think that last series i.e. Scala – IOT was pretty amazing , which got an overwhelming response from you all which resulted in pumping up the idea of this new web-series Spark-IOT. So let’s get started, What was the motivation ? I have been active in the IoT community here, and I found Continue Reading

Hive-Metastore : A Basic Introduction

Reading Time: 3 minutes As we know database is the most important and powerful part for any organisation. It is the collection of Schema, Tables, Relationships, Queries and Views. It is an organized collection of data. But can you ever think about these question – How does database manage all the tables? How does database manage all the relationship? How do we perform all operations so easy? Is there Continue Reading

Is using Accumulators really worth ? Apache Spark

Reading Time: 2 minutes Before jumping right into the topic you must know what Accumulators are ? for that you can refer this blog. Now we know what and why of Accumulators lets jump to the main point. Description :- Spark automatically deals with failed or slow machines by re-executing failed or slow tasks. Example :- if the node running a partition of a map() operation crashes, Spark will rerun it Continue Reading

Broadcast variables in Spark, how and when to use them?

Reading Time: 2 minutes As documentation for Spark Broadcast variables states, they are immutable shared variable which are cached on each worker nodes on a Spark cluster.  In this blog, we will demonstrate a simple use case of broadcast variables. When to use Broadcast variable? Think of a problem as counting grammar elements for any random English paragraph, document or file. Suppose you have the Map of each word as specific Continue Reading

Aggregating Neighboring vertices with Apache Spark GraphX Library

Reading Time: 2 minutes To get the problems addressed by “Neighborhood Aggregation”, we can think of the queries like: “Who has the maximum number of followers under 20 on twitter?” In this blog, we will learn how to aggregate properties of neighboring vertices on a graph with Apache Spark’s GraphX Library. The spark shell will be enough to understand the code example. So, let us get back on the problem statement. Let Continue Reading

Saving Spark DataFrames on Amazon S3 got Easier !!!

Reading Time: 1 minute In our previous blog post, Congregating Spark Files on S3, we explained that how we can Upload Files(saved in a Spark Cluster) on Amazon S3. Well, I agree that the method explained in that post was a little bit complex and hard to apply. Also, it adds a lot of boilerplate in our code. So, we started working on simplifying it & finding an easier way to provide a wrapper around Spark Continue Reading

Knoldus Pune Careers - Hiring Freshers

Get a head start on your career at Knoldus. Join us!