Resilient Distributed Dataset

Spark: Type Safety in Dataset vs DataFrame

May 16, 2020May 16, 2020Apache Spark, Big Data and Fast Data, Studio-ScalaDataFrame, datasets, RDD, Resilient Distributed Dataset

Reading Time: 4 minutes With type safety, programming languages prevents type errors, or we can say that type safety means the compiler will validate type while compiling, and throw an error when we try to assign a wrong type to a variable. Spark, a unified analytics engine for big data processing provides two very useful API’s DataFrame and Dataset that is easy to use, and are intuitive and expressive which makes Continue Reading

RDD: Spark’s Fault Tolerant In-Memory weapon

August 26, 2018Apache Spark, Big Data and Fast Data, Studio-Scala, Tech BlogsApache Spark RDD, How RDD is useful?, How to create RDD?, Resilient Distributed Dataset, What is RDD?

Reading Time: 5 minutes A fault-tolerant collection of elements that can be operated on in parallel: “Resilient Distributed Dataset” a.k.a. RDD RDD (Resilient Distributed Dataset) is the fundamental data structure of Apache Spark which are an immutable collection of objects which computes on the different node of the cluster. Each and every dataset in Spark RDD is logically partitioned across many servers so that they can be computed on Continue Reading