Category Archives: big data

What to do for overriding the PureConfig behavior in Scala ?


PureConfig has its own predefined behavior for reading and writing to the configuration files, but sometimes we got the tricky requirement in which we need some specific behavior; for example to read the config. It is possible to override the … Continue reading

Posted in Agile, Best Practices, big data, knoldus, Reactive, Scala | 1 Comment

Simple Java program to Append to a file in Hdfs


In this blog, I will present you with a java program to append to a file in HDFS. I will be using Maven as the build tool. Now to start with- First, we need to add maven dependencies in pom.xml. … Continue reading

Posted in big data, HDFS, Java | Tagged , , , | 1 Comment

Spark Streaming vs Kafka Stream


The demand for stream processing is increasing a lot these days. The reason is that often processing big volumes of data is not enough. Data has to be processed fast, so that a firm can react to changing business conditions … Continue reading

Posted in Apache Kafka, apache spark, big data, Scala, Streaming | Tagged , | 1 Comment

Introducing Kafka Streams: Processing made easy


If you are working on huge amount of data, you might have heard about Kafka. At a very high level, Kafka is a fault tolerant, distributed publish-subscribe messaging system that is designed for fast processing of data and the ability … Continue reading

Posted in Java, big data, Streaming | Tagged , , , | 1 Comment

Apache Spark: Reading csv using custom timestamp format


In this blog, we are considering a situation where I wanted to read a CSV through spark, but the CSV contains some timestamp columns in it. Is this going to be a problem while inferring schema at the time of reading the csv using spark? Well, … Continue reading

Posted in apache spark, big data, Functional Programming, Scala | 1 Comment

Resolving the Failure Issue of NameNode


In the previous blog “Smattering of HDFS“, we learnt that “The NameNode is a Single Point of Failure for the HDFS Cluster”. Each cluster had a single NameNode and if that machine became unavailable, the whole cluster would become unavailable … Continue reading

Posted in big data, HDFS, Scala | Tagged , , , , , , , | 1 Comment

Apache Spark: 3 Reasons Why You Should Not Use RDDs


Apache Spark, whenever we hear these two words, the first thing that comes to our mind is RDDs, i.e., Resilient Distributed Datasets. Now, it has been more than 5 years since Apache Spark came into existence and after its arrival a lot … Continue reading

Posted in apache spark, big data, Scala, Spark | Tagged | 1 Comment

Apache Spark : Handle null timestamp while reading csv in Spark 2.0.0


Hello folks, Hope you all are doing good !!! In this blog, I will discuss a problem which I faced some days back. One thing to keep in mind that this problem is specifically related to Spark version 2.0.0. Other … Continue reading

Posted in apache spark, big data, Scala, Spark | Leave a comment

Installing and Running Presto


Hi Folks ! In my previous blog, I had talked about Getting Introduced with Presto. In today’s blog, I shall be talking about setting up(installing) and running presto. The basic pre-requisites for setting up Presto are: Linux or Mac OS … Continue reading

Posted in big data, database, Scala | Tagged , , , , | Leave a comment

Getting Introduced with Presto


Hi Folks! In today’s blog I will be introducing you to a new open source distributed Sql Query Engine – Presto. It is designed for running SQL queries over Big Data( petabytes of Data). It was designed by the people … Continue reading

Posted in big data, Scala | Tagged , , , , | 2 Comments