Spark: createDataFrame() vs toDF()

fetching data from different sources using Spark 2.1
Reading Time: 2 minutes

There are two different ways to create a Dataframe in Spark. First, using toDF() and second is using createDataFrame(). In this blog we will see how we can create Dataframe using these two methods and what’s the exact difference between them.

toDF()

toDF() method provides a very concise way to create a Dataframe. This method can be applied to a sequence of objects. To access the toDF() method, we have to import spark.implicits._ after the spark session.

val empDataFrame = Seq(("Alice", 24), ("Bob", 26)).toDF("name","age")
empDataFrame: org.apache.spark.sql.DataFrame = [name: string, age: int]

In the above code we have applied toDF() on a sequence of Tuple2 and passed two strings “name” and “age” to each tuple. These two strings will get map to columns of empDataFrame. Let’s print the schema of the empDataFrame.

We can see that spark has applied column type and nullable flag to every column. The column name has column type string and a nullable flag is true similarly, the column age has column type integer and a nullable flag is false. So, from above we can conclude that in toDF() method we don’t have control over column type and nullable flag. Mean’s there is no control over the schema customization. In most of the cases, toDF() method is only suitable for local testing.

createDataFrame()

The createDataFrame() method addresses the limitations of the toDF() method. With createDataFrame() method we have control over complete schema customization.

import org.apache.spark.sql.Row

val empData = Seq(Row("Alice", 24), Row("Bob", 26))
empData: Seq[org.apache.spark.sql.Row] = List([Alice,24], [Bob,26])

Let’s define the Schema for above empData.

import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};

val empSchema = List(StructField("name", StringType, true), StructField("age", IntegerType, true))
empSchema: List[org.apache.spark.sql.types.StructField] = List(StructField(name, StringType, true), StructField(age, IntegerType, true))

The schema for empDataFrame has been defined with the list of StructField. We have passed three parameters to every StructField i.e, name of the column, type of the column, and a nullable flag. Now, passing the empData and empSchema to createDataFrame() method and create empDataFrame.

val empDataFrame = spark.createDataFrame(spark.sparkContext.parallelize(empData), StructType(empSchema))
empDataFrame: org.apache.spark.sql.DataFrame = [name: string, age: int]

In this way, we have control over the name of the column, column type, and nullable flag. When we are running our code over the cluster or running our code in production it is good to use createDataFrame() method as well as it is good for local testing.

Conclusion

createDataFrame() and toDF() methods are two different way’s to create DataFrame in spark. By using toDF() method, we don’t have the control over schema customization whereas in createDataFrame() method we have complete control over the schema customization. Use toDF() method only for local testing. But we can use createDataFrame() method for both local testings as well as for running the code in production.

1 thought on “Spark: createDataFrame() vs toDF()3 min read

Comments are closed.