Saving Spark DataFrames on Amazon S3 got Easier !!!

Table of contents
Reading Time: < 1 minute

In our previous blog post, Congregating Spark Files on S3, we explained that how we can Upload Files(saved in a Spark Cluster) on Amazon S3. Well, I agree that the method explained in that post was a little bit complex and hard to apply. Also, it adds a lot of boilerplate in our code.

So, we started working on simplifying it & finding an easier way to provide a wrapper around Spark DataFrames, which would help us in saving them on S3. And the solution we found to this problem, was a Spark package: spark-s3. It made saving Spark DataFrames on S3 look like a piece of cake, which we can see from the code below:

The code itself explains that now we don’t have to put any extra effort in saving Spark DataFrames on Amazon S3. All, we need to do is include spark-s3 in our project dependencies and we are done.

Right now spark-s3 supports only Scala & Java APIs, but we are working on providing support for Python and R too. So, stay tuned !!!

To know more about it, please read its documentation on GitHub.

Written by 

Himanshu Gupta is a software architect having more than 9 years of experience. He is always keen to learn new technologies. He not only likes programming languages but Data Analytics too. He has sound knowledge of "Machine Learning" and "Pattern Recognition". He believes that best result comes when everyone works as a team. He likes listening to Coding ,music, watch movies, and read science fiction books in his free time.

4 thoughts on “Saving Spark DataFrames on Amazon S3 got Easier !!!1 min read

  1. Wait, why aren’t you writing directly to S3? This works out-of-the box on my cluster:

    dataFrame..write.format(“json”).save(“s3://bucket/prefix/”)

    Depending on how you spin up the cluster, and spark version, you may have to use either s3:// (on EMR, because emrfs is implemented over s3://) or s3n:// or s3a:// (on spark-standalone; s3a is included by default with hadoop 1.7 I think; for older versions you may have to use s3n)

      1. I believe it could look like:

        val conf = new SparkConf()
        .setMaster(master)
        .setAppName(appName)
        .set(“fs.s3.access.key”, S3_ACCESS)
        .set(“fs.s3.secret.key”, S3_SECRET)
        .set(“fs.s3.endpoint”, S3_HOST)
        .set(“fs.s3.impl”, “org.apache.hadoop.fs.s3native.NativeS3FileSystem”)

Comments are closed.

Discover more from Knoldus Blogs

Subscribe now to keep reading and get access to the full archive.

Continue reading