Saving Spark DataFrames on Amazon S3 got Easier !!!


In our previous blog post, Congregating Spark Files on S3, we explained that how we can Upload Files(saved in a Spark Cluster) on Amazon S3. Well, I agree that the method explained in that post was a little bit complex and hard to apply. Also, it adds a lot of boilerplate in our code.

So, we started working on simplifying it & finding an easier way to provide a wrapper around Spark DataFrames, which would help us in saving them on S3. And the solution we found to this problem, was a Spark package: spark-s3. It made saving Spark DataFrames on S3 look like a piece of cake, which we can see from the code below:

dataFrame.write
  .format("com.knoldus.spark.s3")
  .option("accessKey","s3_access_key")
  .option("secretKey","s3_secret_key")
  .option("bucket","bucket_name")
  .option("fileType","json")
  .save("sample.json")

The code itself explains that now we don’t have to put any extra effort in saving Spark DataFrames on Amazon S3. All, we need to do is include spark-s3 in our project dependencies and we are done.

Right now spark-s3 supports only Scala & Java APIs, but we are working on providing support for Python and R too. So, stay tuned !!!

To know more about it, please read its documentation on GitHub.

This entry was posted in Amazon, Scala, Spark and tagged , , , , . Bookmark the permalink.

3 Responses to Saving Spark DataFrames on Amazon S3 got Easier !!!

  1. virgil says:

    Wait, why aren’t you writing directly to S3? This works out-of-the box on my cluster:

    dataFrame..write.format(“json”).save(“s3://bucket/prefix/”)

    Depending on how you spin up the cluster, and spark version, you may have to use either s3:// (on EMR, because emrfs is implemented over s3://) or s3n:// or s3a:// (on spark-standalone; s3a is included by default with hadoop 1.7 I think; for older versions you may have to use s3n)

  2. Javier Alba says:

    This works for me in spark 2.0.2 (python):

    my_df.write.save(“s3n://my-bucket/my_path/”, format=”csv”)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s