Author: Anuj Saxena

Optimizations In Spark: For BETTER OR For WORSE

Reading Time: 5 minutes This blog focuses on some of the problems faced while working with the Spark SQL

Let’s get to know Data Streaming: A dev’s point of view

Reading Time: 5 minutes Streaming of data is the need of the hour. This blog focuses on the developer’s need to process this stream, benefits, and the challenges it introduces.

Using Vertica with Spark-Kafka: Write using Structured Streaming

Reading Time: 3 minutes In two previous blogs, we explored about Vertica and how it can be connected to Apache Spark. The first blog in this mini series was about reading data from Vertica using Spark and saving that data into Kafka. The next blog explained the reverse flow i.e. reading data from Kafka and writing data to Vertica but in a batch mode. i.e reading data from Kafka Continue Reading

Using Vertica with Spark-Kafka: Writing

Reading Time: 4 minutes In previous blog of this series, we took a glance over the basic definition of Spark and Vertica. We also did a code overview for reading data from Vertica using Spark as DataFrame and saving the data into Kafka. In this blog we will be doing the reverse flow i.e. working on reading the data from Kafka as a DataFrame and writing that DataFrame into Continue Reading

Using Vertica with Spark-Kafka: Reading

Reading Time: 4 minutes We live in a world of Big data where the size of data is so big even for small results. This is the result of an increase in data collection on a rapid scale in the modern world. This massiveness of data brings the requirements of such tools which can work upon such a big chunk of data. I am pretty sure that you guys Continue Reading

Spark Streaming vs. Structured Streaming

Reading Time: 6 minutes Fan of Apache Spark? I am too. The reason is simple. Interesting APIs to work with, fast and distributed processing, unlike map-reduce no I/O overhead, fault tolerance and many more. With this much, you can do a lot in this world of Big data and Fast data. From “processing huge chunks of data” to “working on streaming data”, Spark works flawlessly in all. In this Continue Reading

Spark Structured Streaming with Elasticsearch

Reading Time: 3 minutes There’s been a lot of time we have been working on streaming data. Using Apache Spark for that can be much convenient. Spark provides two APIs for streaming data one is Spark Streaming which is a separate library provided by Spark. Another one is Structured Streaming which is built upon the Spark-SQL library. We will discuss the trade-offs and differences between these two libraries in Continue Reading

MachineX: Logistic Regression with KSAI

Reading Time: 2 minutes Logistic Regression, a predictive analysis, is mostly used with binary variables for classification and can be extended to use with multiple classes as results also. We have already studied the algorithm in deep with this blog. Today we will be using KSAI library to build our logistic regression model. Setup

Running Spark on DC/OS

Reading Time: 6 minutes Devops engineers for long needed an open source tool to make it easy to deploy the code developed through all the ups and downs to reach this far and is considerably more capable of evolving (pun intended). As we all know in this world of agile we need to shift our requirements after a short duration of time. Be it addition of a feature or tweaking Continue Reading

KnolX: Machine Learning with Artificial Neural Networks

Reading Time: < 1 minute Hi all, Knoldus has organized a 30 min session on 8th December 2017 at 4:15 PM. The topic was Machine Learning with Artificial Neural Networks. Many people have joined and enjoyed the session. I am going to share the slides here. Please let me know if you have any question related to linked slides.   Machine Learning with Artificial Neural Networks from Knoldus Inc. Here’s the video of the Continue Reading

What is Deep Learning??

Reading Time: 4 minutes This term “Deep Learning”, is on fire for past two decades. Every machine learning enthusiast wants to work on it and many big companies are already making an impact on Data Science field by exploring it e.g. Google Brain project from Google or DeepFace from Facebook. The reason is simple, experts say and I quote “for most flavors of the old generations of learning algorithms … performance will Continue Reading

Spark Streaming: Unit Testing DStreams

Reading Time: 3 minutes Frankly, I don’t think there’s any need of telling us, “The Developers”, the need for proper testing or Unit testing to be correct(QAs, Don’t be flattered :P). The unit test cases are the quickest way to know there’s something wrong with our code. “Unit testing is important because it is one of the earliest testing efforts performed on the code and the earlier defects are detected, the easier Continue Reading

Artificial Intelligence vs Machine Learning vs Deep Learning

Reading Time: 3 minutes The world as we know it is moving towards machines big time. But we can not fully utilize the working of any machine without a lot of human interaction. So in order to do that, we needed some kind of intelligence for the machines. Here comes the place for Artificial Intelligence. It is the concept of machines being smart to carry out numerous tasks without Continue Reading