Big Data

Dataframe and Datasets: Apache Spark’s Developers Friendly Structured APIs

Reading Time: 4 minutes This is a two-part blogs in which first we’ll be covering Dataframe API and in the second part Datasets. Spark 2.x introduced the concept of structuring the spark by introducing two concepts: – to express some computation by using common patterns found in data analysis, such as filtering, selecting, counting, aggregating, and grouping. And the second one of order and structure your data in a Continue Reading

How To Use Regular Expression In Scala

Reading Time: 3 minutes Hi folks, here I am in this article going to explain Regular expression. How to form regular expression in Scala. What is Regular Expression: A regular expression is a string of characters and punctuation that represents a search pattern. Popularized by Perl and command-line utilities like Grep, regular expressions are a standard feature in the libraries of most programming languages including Scala. In Scala, we Continue Reading

BigQuery:  Efficient Data Warehouse Schema Design

Reading Time: 3 minutes Conventional data warehouses support data models based on star schema and snowflake schema. In these models, there are a number of fact tables and dimension tables. In order to minimize redundancy it is recommends to split data into multiple tables in . This is a normalization process. Normalization is the technique of eliminating the redundant data. It minimize the insertion, deletion, and update anomalies. It saves the disk Continue Reading

Welcome to the world of Apache Spark

Reading Time: 5 minutes Welcome to another very important & interesting topic of big data Apache Spark. What is Apache Spark? Spark has been called a “general-purpose distributed data processing engine” for big data and machine learning. It lets you process big data sets faster by splitting the work up into chunks and assigning those chunks across computational resources. Why would you want to use Spark? Spark has some Continue Reading

Dynamic Partitioning in Apache Hive

Reading Time: 3 minutes Introduction We are back with another Important concept of big data is Dynamic partitioning in Hive. Before moving to the dynamic one we should know about static partitioning which I explained In the blog Static partitioning Now it’s time to deep dive into a dynamic one. How Dynamic Differ from Static Partitioing In this partition, columns values are only known at EXECUTION TIME User is Continue Reading

Overview of Static Partitioning in Apache Hive

Reading Time: 4 minutes What is Partitioning? In simple words, we can explain Partitioning as the process of dividing something into sections or parts, with the motive of making it easily understandable and manageable. Apache Hive allows us to organize the table into multiple partitions where we can group the same kind of data together. It is used for distributing the load horizontally which also helps to increase query Continue Reading

Apache Airflow Operators and Tasks

Reading Time: 3 minutes Context: What is Airflow? Airflow is a free to use and open-source tool developed by Apache that is used to manage workflows Most popular and one of the best workflow management systems out there with great community support. What is a DAG ? DAG stands for Directed Acyclic Graph Directed means the flow is one directional Acyclic means the flow will never come back to Continue Reading

Best Way of Optimization: Bucketing in Hive

Reading Time: 4 minutes Apache Hive is an open-source data warehouse system used to query and analyze large datasets. Data in Apache Hive can be categorized into the following three parts : Tables Partitions Buckets What is Bucketing in Hive? Bucketing in the hive is the concept of breaking data down into ranges, which are known as buckets, to give extra structure to the data so it may be Continue Reading

Apache Beam: Side input Pattern

Reading Time: 3 minutes Apache Beam is a unified programming model for defining both batch and streaming data-parallel processing pipelines. It is a modern way of defining data processing pipelines. It has rich sources of APIs and mechanisms to solve complex use cases. In some use cases, while we define our data pipelines the requirement is, the pipeline should use some additional inputs. For example, In streaming analytics applications, it Continue Reading

Google BigQuery: An Introduction to Big Data Analytics Platform.

Reading Time: 6 minutes Hey Folks, Today we going to discuss Google BigQuery, an enterprise data warehouse with built-in machine learning capabilities. Before going to BigQuery, let’s understand what is Google Cloud Platform?Google Cloud Platform is a suite of public cloud computing services offered by Google. The platform includes a range of hosted services for compute, storage and application development that run on Google hardware. Google Cloud protects your data, applications, Continue Reading

Big Data Analytics: An Introduction

Reading Time: 5 minutes DATA ANALYTICS Data can help businesses better understand their customers and improve their advertising campaigns. It can also help personalise their content, and improve their bottom lines. The advantages of data are many, but you can’t access these benefits without the proper data analytics tools and processes. While raw data has a lot of potentials, you need data analytics to unlock the power to grow Continue Reading

Flink: Implementing the Session window.

Reading Time: 3 minutes In the previous blogs, we learned about Tumbling, Sliding, and Count windows in Flink. There is one another useful way to window the data which Flink offers i.e, Session window. So in this blog, we will explore the Session window in detail with an example. In the real world, all the work that we do online- Visiting a website, Clicking around the website, do online Continue Reading