Author: Amarjeet Singh

Spark Structured Streaming

Reading Time: 3 minutes Overview In Spark 2.0, structured streaming was added for building continuous applications. It let you apply processing logic on streaming data in pretty much the same way we work with batch data. It also provides scalable and fault-tolerant processing through checkpointing and write-ahead logs. Spark SQL provides a base for this processing engine. It is an engine to process data in real-time from sources and Continue Reading

An Overview of Elasticsearch

Reading Time: 3 minutes Introduction Elasticsearch is a distributed, open-source full-text search and analytics engine and comprises schema-free JSON documents. It is built based on the Apache Lucene library. It is an important part of the ELK stack. Data can be stored, searched, and analyzed in near real-time. Results can be retrieved in milliseconds. Documents are used to store data instead of tables. It also comes with a rich Continue Reading

Collections in Scala

Reading Time: 2 minutes Scala has a rich set of collection library. Collections are the containers that hold sequenced linear set of items. Collections may be strict or lazy. Lazy collections are collections that are not evaluated until they are accessed. Also, they can be mutable or immutable.   ArrayBuffer As we know that arrays are homogeneous and mutable. You can change the value but cannot change the size Continue Reading

Spark 3.0 : Adaptive Query Execution(AQE)

Reading Time: 3 minutes Introduction As we all know optimization plays an important role in the success of spark SQL. Therefore, a lot of work has been done in this direction. Before spark 3.0, cost-based optimization was a major hit in which different stages related to cost (based on time efficiency and estimated CPU and I/O usage) are compared and executes the strategy which minimizes the cost. But, because Continue Reading