Big Data and Fast Data

Apache Kafka : Log Compaction

Reading Time: 3 minutes As we all know, most of the systems uses Kafka for distributed and real time processing of large scale of messages. Before starting on this topic, i assume that you all are familiar with basic concepts of Kafka such as brokers, partitions, topics, producer and consumer. Here we are discussing about Log Compaction. What is Log Compaction Kafka log compaction is hybrid approach that makes Continue Reading

Best Way of Optimization: Bucketing in Hive

Reading Time: 4 minutes Apache Hive is an open-source data warehouse system used to query and analyze large datasets. Data in Apache Hive can be categorized into the following three parts : Tables Partitions Buckets What is Bucketing in Hive? Bucketing in the hive is the concept of breaking data down into ranges, which are known as buckets, to give extra structure to the data so it may be Continue Reading

Deep Dive into Hadoop Map Reduce Part -2

Reading Time: 8 minutes Prerequisite: Hadoop Basic and understanding of Deep Dive in Hadoop Map reduce Part -1 Blog. MapReduce Tutorial: Introduction In this MapReduce Tutorial blog, I am going to introduce you to MapReduce, which is one of the core building blocks of processing in the Hadoop framework. Before moving ahead, I would suggest you to get familiar with HDFS concepts which I have covered in my previous HDFS tutorial blog. Continue Reading

Introduction To Apache Kafka

Reading Time: 6 minutes Introduction Apache Kafka is a framework implementation of a software bus using stream-processing . It is an open source platform, developed by the Apache Software Foundation. It is written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka can connect to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library. Apache Continue Reading

Apache Beam: Ways to join PCollections

Reading Time: 4 minutes Joining multiple sets of data into a singular entity is very often when working with data pipelines. In this blog, We will cover how we can perform Join operations between datasets in Apache Beam. There are different ways to Join PCollections in Apache beam – Extension-based joins Group-by-key-based joins Join using side input Let’s understand the above different way’s to perform Join with examples. We Continue Reading

Kafka Kerberos Authentication

Reading Time: 2 minutes In this article we will start looking into Kerberos authentication and will focus on the client-side configuration required to authenticate with clusters configured to use Kerberos. Kafka supports four different communication protocols between Consumers, Producers, and Brokers. Each protocol considers different security aspects, while PLAINTEXT is the old insecure communication protocol. PLAINTEXT (non-authenticated, non-encrypted) SSL (SSL authentication, encrypted) PLAINTEXT+SASL (authentication, non-encrypted) SSL+SASL (encrypted authentication, encrypted Continue Reading

Apache Beam: Side input Pattern

Reading Time: 3 minutes Apache Beam is a unified programming model for defining both batch and streaming data-parallel processing pipelines. It is a modern way of defining data processing pipelines. It has rich sources of APIs and mechanisms to solve complex use cases. In some use cases, while we define our data pipelines the requirement is, the pipeline should use some additional inputs. For example, In streaming analytics applications, it Continue Reading

DataMesh: A Web Netting

Reading Time: 4 minutes Enterprises Data Challenges? Global data creation is projected to exceed 180 zettabytes in the next five years. Current data platforms have several architectural failures that hinder enterprise data processing and inhibit business growth. How today’s enterprise data is managed? Today’s technology and organization design divide the data into two categories – operational data and analytical data. Operational data is transactional. Stored into RDBMS at the backend and helps the Continue Reading

Comparing Data Streaming Frameworks | Scala

Reading Time: 4 minutes In this Era of Technology, where the amount of data is growing exponentially and every bit of data holds value. Even, according to some reports, the number of bytes being generated and stored till now in the world has already exceeded the star counts in the sky. As every bit is useful so, it is very important to store them without losing any bit. When Continue Reading

Which Technology is Better: GSM or CDMA ?

Reading Time: 3 minutes What is CDMA ? CDMA stands for Code Division Multiple Access.The CDMA came into existence in 2G and 3G generation.It is the protocol of wireless communication. It is based on the spread spectrum technology and makes optimal use of the available bandwidth. Since it uses the spread spectrum technology. It allows each user to transmit the data over the entire frequency spectrum at any time. Whenever a call Continue Reading

Internet of Things: An Implementation Aspect

Reading Time: 3 minutes What is the Internet of Things? The Internet of Things, or IoT, refers to the billions of physical devices around the world that are now connected to the internet, all collecting and sharing data. IOT is an intelligent technique which reduces human effort as well as easily accesses physical devices. In other we can say IoT devices are basically smart devices which have support for Continue Reading

Google BigQuery: An Introduction to Big Data Analytics Platform.

Reading Time: 6 minutes Hey Folks, Today we going to discuss Google BigQuery, an enterprise data warehouse with built-in machine learning capabilities. Before going to BigQuery, let’s understand what is Google Cloud Platform?Google Cloud Platform is a suite of public cloud computing services offered by Google. The platform includes a range of hosted services for compute, storage and application development that run on Google hardware. Google Cloud protects your data, applications, Continue Reading

Apache Calcite : Evaluating REX Expressions

Reading Time: 2 minutes Apache Calcite is a dynamic data management framework. It provides a SQL parser, validator and optimizer. Using the sub-project Avatica we also have the ability to execute our optimized queries against external databases. Every row expression (rex) in a SQL query is defined as a ‘RexNode’ internally which can be an identifier,a literal or a function.In this blog we will illustrate evaluating functions involving literals Continue Reading