ML, AI and Data Engineering

MachineX: The second dimensionality reduction method

Reading Time: 5 minutes In the previous blog we have gone through how more data or to be precise more dimensions in the data creates different problems like overfitting in classification and regression algorithms. This is known as “curse of dimensionality”. Then we have gone through the solutions to the problem i.e. dimensionality reduction. We were mainly focused on one of the dimensionality reduction method called feature selection. In this Continue Reading

Kafka And Spark Streams: The happily ever after !!

Reading Time: 4 minutes Hi everyone, Today we are going to understand a bit about using the spark streaming to transform and transport data between Kafka topics. The demand for stream processing is increasing every day. The reason is that often, processing big volumes of data is not enough. We need real-time processing of data especially when we need to handle continuously increasing volumes of data and also need Continue Reading

Unveiling The Mystery Of Serverless

Reading Time: 2 minutes In this blog, we will explore about Serverless and why it is trending so much? Serverless, is itself a self-explanatory word, means there are no servers. But, is it really true? No, it is not. Serverless does not mean the absence of servers. There are servers actually, it’s just that we don’t have to manage them. All the infrastructure is provided by companies like AWS, Google, Azure Continue Reading

MachineX: When data is a curse to learning

Reading Time: 4 minutes Data and learning are like best friends, perhaps learning is too dependent on data to be called as friends. When data overwhelms, learning acts pricey, so it feels more like a girlfriend-boyfriend sort of a relationship. Well don’t get confused or bothered on how I am comparing the data and learning, it is just my depiction of something called Dimensionality reduction in machine learning. On Continue Reading

MachineX: Simplifying Logistic Regression

Reading Time: 3 minutes Logistic regression is one of the most popular machine learning algorithms for binary classification. This is because it is a simple algorithm that performs very well on a wide range of problems. It is used when you know that the data is linearly separable/classifiable and the outcome is Binary or Dichotomous but it can be extended when the dependent has more than 2 categories. It Continue Reading

Introduction to AWS Step Function

Reading Time: 2 minutes Step Function is state-machine based workflow coordination as a service provided by AWS. AWS provides a straightforward way for application developers to create an execution workflow to coordinate the use of multiple AWS Lambda or Amazon Elastic Compute Cloud (EC2) components in distributed applications running on the cloud. Before digging into AWS Step function, let’s take an example. You want to send an automated verification message to the new user Continue Reading

No need to predict your application load in advance with Amazon DynamoDB

Reading Time: 2 minutes Hello everyone! In this blog, I will try to explain what is Amazon DynamooDb and how it is powerful than other NoSQL databases. What Is Amazon DynamoDB? Amazon DynmoDb is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. We do not need to predict our application load in advance. In order to have a clear understanding, let us take a Continue Reading

They said Spark Streaming simply means Discretized Stream

Reading Time: 3 minutes I am working in a company (Knoldus Software LLP) where Apache Spark is literally running into people’s blood means there are certain people who are really good at it. If you ever visit our blogging page and search for stuff related to spark, you will find enough content which is capable of solving your most of spark related queries, starting form introductions to solutions for Continue Reading

Terraform: Enabling developer to create and manage deployment through code

Reading Time: 2 minutes In this blog post, We will walk you through Terraform which is a tool for building, changing, and versioning infrastructure safely and efficiently. Terraform enables the developers to properly manage the infrastructure through code. The set of files used to describe infrastructure in Terraform is simply known as a Terraform configuration. These files have extension .tf. Configuration files describe to Terraform the components needed to Continue Reading

KnolX: Understanding Spark Structured Streaming

Reading Time: < 1 minute Hello everyone, Knoldus organized a session on 05th January 2018. The topic was “Understanding Spark Structured Streaming”. Many people attended and enjoyed the session. In this blog post, I am going to share the slides & video of the session. Slides:

Cool Breeze of Scala for Easy Computation: Introduction to Breeze Library

Reading Time: 4 minutes Mathematics is a core part of machine learning and to dive deep into machine learning one should possess basic knowledge of mathematics concepts but when you start developing algorithms, mathematics can be a real pain. Thankfully we have some awesome libraries that reduce some of our pain and also allows us to focus more on our basic requirement rather than focussing more on manipulation techniques.While Continue Reading

Developers Needs SDKMAN Not Super-Man

Reading Time: 4 minutes Every developer has pain for setup development environment to his/her machine with lots of the setups. Sometimes, the pain goes beyond while we need to test same application on multiple versions of SDKs or virtual machines. If you are a Mac user, you have the best option called brew installer. But if you are Linux user, your pain is unpredictable. We are JVM stack developers Continue Reading

Spark Streaming: Unit Testing DStreams

Reading Time: 3 minutes Frankly, I don’t think there’s any need of telling us, “The Developers”, the need for proper testing or Unit testing to be correct(QAs, Don’t be flattered :P). The unit test cases are the quickest way to know there’s something wrong with our code. “Unit testing is important because it is one of the earliest testing efforts performed on the code and the earlier defects are detected, the easier Continue Reading