Saving Spark DataFrames on Amazon S3 got Easier !!!

In our previous blog post, Congregating Spark Files on S3, we explained that how we can Upload Files(saved in a Spark Cluster) on Amazon S3. Well, I agree that the method explained in that post was a little bit complex and hard to apply. Also, it adds a lot of boilerplate in our code.

So, we started working on simplifying it & finding an easier way to provide a wrapper around Spark DataFrames, which would help us in saving them on S3. And the solution we found to this problem, was a Spark package: spark-s3. It made saving Spark DataFrames on S3 look like a piece of cake, which we can see from the code below:


The code itself explains that now we don’t have to put any extra effort in saving Spark DataFrames on Amazon S3. All, we need to do is include spark-s3 in our project dependencies and we are done.

Right now spark-s3 supports only Scala & Java APIs, but we are working on providing support for Python and R too. So, stay tuned !!!

To know more about it, please read its documentation on GitHub.

This entry was posted in Amazon, Scala, Spark and tagged , , , , . Bookmark the permalink.

4 Responses to Saving Spark DataFrames on Amazon S3 got Easier !!!

  1. virgil says:

    Wait, why aren’t you writing directly to S3? This works out-of-the box on my cluster:


    Depending on how you spin up the cluster, and spark version, you may have to use either s3:// (on EMR, because emrfs is implemented over s3://) or s3n:// or s3a:// (on spark-standalone; s3a is included by default with hadoop 1.7 I think; for older versions you may have to use s3n)

  2. Javier Alba says:

    This works for me in spark 2.0.2 (python):“s3n://my-bucket/my_path/”, format=”csv”)

    • Gocht says:

      How you set access_key and secret_key from your AWS account?

      • I believe it could look like:

        val conf = new SparkConf()
        .set(“fs.s3.access.key”, S3_ACCESS)
        .set(“fs.s3.secret.key”, S3_SECRET)
        .set(“fs.s3.endpoint”, S3_HOST)
        .set(“fs.s3.impl”, “org.apache.hadoop.fs.s3native.NativeS3FileSystem”)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s