data engineering

Apache Beam: Side input Pattern

Reading Time: 3 minutes Apache Beam is a unified programming model for defining both batch and streaming data-parallel processing pipelines. It is a modern way of defining data processing pipelines. It has rich sources of APIs and mechanisms to solve complex use cases. In some use cases, while we define our data pipelines the requirement is, the pipeline should use some additional inputs. For example, In streaming analytics applications, it Continue Reading

Fundamentals Of Classification Models Part-2

Reading Time: 3 minutes This article is the continuation of “Fundamentals of Classification Models Part – 1” You need to go through this part before going to learn about Classifier Models. Classifier Models As discussed in the previous article ” We prepare the data for training the algorithm” the first step is to pre-process and clean the data The cleaning we need for this dataset is to change the Continue Reading

How To Find Correlation Value Of Categorical Variables.

Reading Time: 4 minutes Hey folks, In this blog we are going to find out the correlation of categorical variables. What is Categorical Variable? In statistics, a categorical variable has two or more categories.But there is no intrinsic ordering to the categories. For example, a binary variable(such as yes/no question) is a categorical variable having two categories (yes or no), and there is no intrinsic ordering to the categories. Continue Reading

Apache Spark: Handle Corrupt/Bad Records

Reading Time: 3 minutes Most of the time writing ETL jobs becomes very expensive when it comes to handling corrupt records. And in such cases, ETL pipelines need a good solution to handle corrupted records. Because, larger the ETL pipeline is, the more complex it becomes to handle such bad records in between. Corrupt data includes: Missing information Incomplete information Schema mismatch Differing formats or data types Apache Spark: Continue Reading

Amazon EMR

Reading Time: 3 minutes Businesses worldwide are discovering the power of new big data processing and analytics frameworks like Apache Hadoop and Apache Spark, but they are also discovering some of the challenges of operating these technologies in on-premises data lake environments. They may also have concerns about the future of their current distribution vendor. Common problems of on-premises big data environments include a lack of agility, excessive costs, Continue Reading

Modernizing Data Storage for fuelling Digital Transformation

Reading Time: 5 minutes As companies mature in their digital transformation journey, old technologies and rules of doing business are being re-defined. Capturing customers is no longer enough and companies are focusing on how to keep them engaged with hyper-personalized experiences. There’s an explosion of data sources as everyone and everything is connected with mobile devices, social media, and IoT.  What this means for a business is an exponential Continue Reading

Apache Spark: Tricks to Increase Job Performance

Reading Time: 2 minutes Apache Spark is quickly adopting the Real-world and most of the companies like Uber are using it in their production. Spark is gaining its popularity in the market as it also provides you with the feature of developing Streaming Applications and doing Machine Learning, which helps companies get better results in their production along with proper analysis using Spark. Although companies are using Spark in Continue Reading

Apache Spark

Deep Dive into Apache Spark Transformations and Action

Reading Time: 4 minutes In our previous blog of Apache Spark, we discussed a little about what Transformations & Actions are? Now we will get deeper into the topic and will understand what actually they are & how they play a vital role to work with Apache Spark? What is Spark RDD? Spark introduces the concept of an RDD (Resilient Distributed Dataset), an immutable fault-tolerant, distributed collection of objects Continue Reading