Category Archives: big data

Having Issue How To Order Streamed Dataframe ?


A few days ago, i have to perform aggregation on streaming dataframe. And the moment, i apply groupBy for aggregation, data gets shuffled. Now the situation arises how to maintain order? Yes, i can use orderBy with streaming dataframe using … Continue reading

Posted in Apache Kafka, apache spark, big data, Scala, Spark, Streaming | Tagged , , , , , , , , , , | Leave a comment

Can we stop talking about Big Data now?


If it was still 2012 I would have eagerly heard and responded to any conversation about Big Data. Well, it was the buzz and you had to be speaking the magic words for getting people to listen to the latest … Continue reading

Posted in big data, Scala | Tagged , , | 1 Comment

What to do for overriding the PureConfig behavior in Scala ?


PureConfig has its own predefined behavior for reading and writing to the configuration files, but sometimes we got the tricky requirement in which we need some specific behavior; for example to read the config. It is possible to override the … Continue reading

Posted in Agile, Best Practices, big data, knoldus, Reactive, Scala | 1 Comment

Simple Java program to Append to a file in Hdfs


In this blog, I will present you with a java program to append to a file in HDFS. I will be using Maven as the build tool. Now to start with- First, we need to add maven dependencies in pom.xml. … Continue reading

Posted in big data, HDFS, Java | Tagged , , , | 1 Comment

Spark Streaming vs Kafka Stream


The demand for stream processing is increasing a lot these days. The reason is that often processing big volumes of data is not enough. Data has to be processed fast, so that a firm can react to changing business conditions … Continue reading

Posted in Apache Kafka, apache spark, big data, Scala, Streaming | Tagged , | 1 Comment

Introducing Kafka Streams: Processing made easy


If you are working on huge amount of data, you might have heard about Kafka. At a very high level, Kafka is a fault tolerant, distributed publish-subscribe messaging system that is designed for fast processing of data and the ability … Continue reading

Posted in big data, Java, Streaming | Tagged , , , | 1 Comment

Apache Spark: Reading csv using custom timestamp format


In this blog, we are considering a situation where I wanted to read a CSV through spark, but the CSV contains some timestamp columns in it. Is this going to be a problem while inferring schema at the time of reading the csv using spark? Well, … Continue reading

Posted in apache spark, big data, Functional Programming, Scala | 1 Comment

Resolving the Failure Issue of NameNode


In the previous blog “Smattering of HDFS“, we learnt that “The NameNode is a Single Point of Failure for the HDFS Cluster”. Each cluster had a single NameNode and if that machine became unavailable, the whole cluster would become unavailable … Continue reading

Posted in big data, HDFS, Scala | Tagged , , , , , , , | 1 Comment

Apache Spark: 3 Reasons Why You Should Not Use RDDs


Apache Spark, whenever we hear these two words, the first thing that comes to our mind is RDDs, i.e., Resilient Distributed Datasets. Now, it has been more than 5 years since Apache Spark came into existence and after its arrival a lot … Continue reading

Posted in apache spark, big data, Scala, Spark | Tagged | 1 Comment

Apache Spark : Handle null timestamp while reading csv in Spark 2.0.0


Hello folks, Hope you all are doing good !!! In this blog, I will discuss a problem which I faced some days back. One thing to keep in mind that this problem is specifically related to Spark version 2.0.0. Other … Continue reading

Posted in apache spark, big data, Scala, Spark | Leave a comment