In our previous blog post, Congregating Spark Files on S3, we explained that how we can Upload Files(saved in a Spark Cluster) on Amazon S3. Well, I agree that the method explained in that post was a little bit complex and hard to apply. Also, it adds a lot of boilerplate in our code.
So, we started working on simplifying it & finding an easier way to provide a wrapper around Spark DataFrames, which would help us in saving them on S3. And the solution we found to this problem, was a Spark package: spark-s3. It made saving Spark DataFrames on S3 look like a piece of cake, which we can see from the code below:
dataFrame.write .format("com.knoldus.spark.s3") .option("accessKey","s3_access_key") .option("secretKey","s3_secret_key") .option("bucket","bucket_name") .option("fileType","json") .save("sample.json")
The code itself explains that now we don’t have to put any extra effort in saving Spark DataFrames on Amazon S3. All, we need to do is include spark-s3 in our project dependencies and we are done.
Right now spark-s3 supports only Scala & Java APIs, but we are working on providing support for Python and R too. So, stay tuned !!!
To know more about it, please read its documentation on GitHub.