data analysis

Use Plotly Library for Visualization

Reading Time: 4 minutes Plotly is a very important and beautiful Library of data science. It is an open-source library. It also supports both framework python and Django It has so many types of graphs like scatter, bar, pie bubble, dot treemap, etc. What is Plotly ? Plotly library which is an open-source and without charge kind of library. The utilize of Plotly for statistical analysis of data enables Continue Reading

Apache Beam: Side input Pattern

Reading Time: 3 minutes Apache Beam is a unified programming model for defining both batch and streaming data-parallel processing pipelines. It is a modern way of defining data processing pipelines. It has rich sources of APIs and mechanisms to solve complex use cases. In some use cases, while we define our data pipelines the requirement is, the pipeline should use some additional inputs. For example, In streaming analytics applications, it Continue Reading

Data Analysis Using Python

Reading Time: 4 minutes In this blog we will introduce an overview of Python packages used for data analysis. And finally, we will  learn about how to import and export data in and from Python, and how to obtain basic insights from the datasets. for understanding the basic concepts of Data Analytics , you can go through this link. Python packages for Data Analysis: In order to do analysis Continue Reading

How To Find Correlation Value Of Categorical Variables.

Reading Time: 4 minutes Hey folks, In this blog we are going to find out the correlation of categorical variables. What is Categorical Variable? In statistics, a categorical variable has two or more categories.But there is no intrinsic ordering to the categories. For example, a binary variable(such as yes/no question) is a categorical variable having two categories (yes or no), and there is no intrinsic ordering to the categories. Continue Reading

A Quick Demo: Kafka to Flink to Cassandra

Reading Time: 3 minutes Hi Folks!! In this blog, we are going to learn how we can integrate Flink with Kafka and Cassandra to build a simple streaming data pipeline. Apache Flink is a framework and distributed processing engine. it is used for stateful computations over unbounded and bounded data streams.Kafka is a scalable, high performance, low latency platform. It allows reading and writing streams of data like a messaging system.Cassandra: A distributed and wide-column Continue Reading

Creating Data Pipeline with Spark streaming, Kafka and Cassandra

Reading Time: 3 minutes Hi Folks!! In this blog, we are going to learn how we can integrate Spark Structured Streaming with Kafka and Cassandra to build a simple data pipeline. Spark Structured Streaming is a component of Apache Spark framework that enables scalable, high throughput, fault tolerant processing of data streams.Apache Kafka is a scalable, high performance, low latency platform that allows reading and writing streams of data Continue Reading

Analysis of campus placement dataset using decision tree

Reading Time: 3 minutes KNIME Analytics Platform is open-source software for creating data science applications and services. Intuitive, open, and continuously integrating new developments, KNIME makes understanding data and designing data science workflows and reusable components accessible to everyone. With KNIME Analytics Platform, you can create visual workflows with an intuitive, drag and drop style graphical interface, without the need for coding. Hello, folks! In this blog, we will analyse the Campus placement data Continue Reading

ICC Test Cricket Data Analysis using KNIME

Reading Time: 4 minutes KNIME Analytics Platform is open-source software for creating data science applications and services. Intuitive, open, and continuously integrating new developments, KNIME makes understanding data and designing data science workflows and reusable components accessible to everyone. With KNIME Analytics Platform, you can create visual workflows with an intuitive, drag and drop style graphical interface, without the need for coding. Hello, folks! In this blog, we will analyse Continue Reading

Knime Analytics Platform: A dream for a data scientist

Reading Time: 3 minutes In this blog, we are going to see, what is the Knime analytics platform and its important features to create an analytics workflow in an easy way. Introduction to Knime Analytics Platform KNIME is a platform built for powerful analytics on a GUI based workflow. This means you do not have to know how to code to be able to work using KNIME and derive Continue Reading

Apache Spark: Delta Lake as a Solution – Part II

Reading Time: 3 minutes Well, we have already covered the missing features in Apache Spark & also the causes of the issue in executing Delta Lake in Part1. However, today we will be talking about What Delta Lake is & how it provides the solution to all those problems discussed herein Delta Lake as a Solution: Part1.As we all know that Spark is just a processing engine, it doesn’t Continue Reading

Apache Spark: Delta Lake as a Solution – Part I

Reading Time: 3 minutes Today, everyone is talking about Delta Lake. Why? Ever tried to find the answer to this question? Yes or No doesn’t matter, don’t worry here in Part1 we will be discussing the same & also will be targetting the following questions: What are the features missing from Apache Spark? What kind of issues it causes in executing Data Lake? Answering the above questions will definitely Continue Reading

Knoldus-corona-virus

MachineX: Analysing COVID-19 Pandemic

Reading Time: 5 minutes Introduction COVID-19 disease, caused by the SARS-CoV-2 virus, was identified in December 2019 in China and declared a global pandemic by the WHO(World Health Organization) on 11 March 2020. The disease first originated in Wuhan, China and since then it has spread globally across the world affecting more than 200 countries. Coronavirus disease 2019 (COVID-19) is a highly infectious disease caused by the severe acute respiratory syndrome. The Number Continue Reading

Apache Spark: Handle Corrupt/Bad Records

Reading Time: 3 minutes Most of the time writing ETL jobs becomes very expensive when it comes to handling corrupt records. And in such cases, ETL pipelines need a good solution to handle corrupted records. Because, larger the ETL pipeline is, the more complex it becomes to handle such bad records in between. Corrupt data includes: Missing information Incomplete information Schema mismatch Differing formats or data types Apache Spark: Continue Reading