Saving Spark DataFrames on Amazon S3 got Easier !!!

February 8, 2016July 23, 2018 Himanshu GuptaAmazon, Spark, Studio-Scalaamazon, Apache Spark, Big Data, S3, scala4 Comments

Table of contents

Reading Time: < 1 minute

In our previous blog post, Congregating Spark Files on S3, we explained that how we can Upload Files(saved in a Spark Cluster) on Amazon S3. Well, I agree that the method explained in that post was a little bit complex and hard to apply. Also, it adds a lot of boilerplate in our code.

So, we started working on simplifying it & finding an easier way to provide a wrapper around Spark DataFrames, which would help us in saving them on S3. And the solution we found to this problem, was a Spark package: spark-s3. It made saving Spark DataFrames on S3 look like a piece of cake, which we can see from the code below:

dataFrame.write
.format("com.knoldus.spark.s3")
.option("accessKey","s3_access_key")
.option("secretKey","s3_secret_key")
.option("bucket","bucket_name")
.option("fileType","json")
.save("sample.json")

The code itself explains that now we don’t have to put any extra effort in saving Spark DataFrames on Amazon S3. All, we need to do is include spark-s3 in our project dependencies and we are done.

Right now spark-s3 supports only Scala & Java APIs, but we are working on providing support for Python and R too. So, stay tuned !!!

To know more about it, please read its documentation on GitHub.

Written by Himanshu Gupta

Himanshu Gupta is a software architect having more than 9 years of experience. He is always keen to learn new technologies. He not only likes programming languages but Data Analytics too. He has sound knowledge of "Machine Learning" and "Pattern Recognition". He believes that best result comes when everyone works as a team. He likes listening to Coding ,music, watch movies, and read science fiction books in his free time.

4 thoughts on “Saving Spark DataFrames on Amazon S3 got Easier !!!1 min read”

virgil says:

February 29, 2016 at 4:59 PM

Wait, why aren’t you writing directly to S3? This works out-of-the box on my cluster:

dataFrame..write.format(“json”).save(“s3://bucket/prefix/”)

Depending on how you spin up the cluster, and spark version, you may have to use either s3:// (on EMR, because emrfs is implemented over s3://) or s3n:// or s3a:// (on spark-standalone; s3a is included by default with hadoop 1.7 I think; for older versions you may have to use s3n)
Javier Alba says:

December 12, 2016 at 4:50 PM

This works for me in spark 2.0.2 (python):

my_df.write.save(“s3n://my-bucket/my_path/”, format=”csv”)
1. Gocht says:
  
  May 16, 2017 at 1:44 AM
  
  How you set access_key and secret_key from your AWS account?
  1. Alexandr Korsak says:
    
    August 16, 2017 at 4:20 PM
    
    I believe it could look like:
    
    val conf = new SparkConf()
    .setMaster(master)
    .setAppName(appName)
    .set(“fs.s3.access.key”, S3_ACCESS)
    .set(“fs.s3.secret.key”, S3_SECRET)
    .set(“fs.s3.endpoint”, S3_HOST)
    .set(“fs.s3.impl”, “org.apache.hadoop.fs.s3native.NativeS3FileSystem”)