In our previous blog post, Congregating Spark Files on S3, we explained that how we can Upload Files(saved in a Spark Cluster) on Amazon S3. Well, I agree that the method explained in that post was a little bit complex and hard to apply. Also, it adds a lot of boilerplate in our code.
So, we started working on simplifying it & finding an easier way to provide a wrapper around Spark DataFrames, which would help us in saving them on S3. And the solution we found to this problem, was a Spark package: spark-s3. It made saving Spark DataFrames on S3 look like a piece of cake, which we can see from the code below:
The code itself explains that now we don’t have to put any extra effort in saving Spark DataFrames on Amazon S3. All, we need to do is include spark-s3 in our project dependencies and we are done.
Right now spark-s3 supports only Scala & Java APIs, but we are working on providing support for Python and R too. So, stay tuned !!!
To know more about it, please read its documentation on GitHub.
Wait, why aren’t you writing directly to S3? This works out-of-the box on my cluster:
dataFrame..write.format(“json”).save(“s3://bucket/prefix/”)
Depending on how you spin up the cluster, and spark version, you may have to use either s3:// (on EMR, because emrfs is implemented over s3://) or s3n:// or s3a:// (on spark-standalone; s3a is included by default with hadoop 1.7 I think; for older versions you may have to use s3n)
This works for me in spark 2.0.2 (python):
my_df.write.save(“s3n://my-bucket/my_path/”, format=”csv”)
How you set access_key and secret_key from your AWS account?
I believe it could look like:
val conf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set(“fs.s3.access.key”, S3_ACCESS)
.set(“fs.s3.secret.key”, S3_SECRET)
.set(“fs.s3.endpoint”, S3_HOST)
.set(“fs.s3.impl”, “org.apache.hadoop.fs.s3native.NativeS3FileSystem”)