In our previous blog post, Congregating Spark Files on S3, we explained that how we can Upload Files(saved in a Spark Cluster) on Amazon S3. Well, I agree that the method explained in that post was a little bit complex and hard to apply. Also, it adds a lot of boilerplate in our code.
So, we started working on simplifying it & finding an easier way to provide a wrapper around Spark DataFrames, which would help us in saving them on S3. And the solution we found to this problem, was a Spark package: spark-s3. It made saving Spark DataFrames on S3 look like a piece of cake, which we can see from the code below:
The code itself explains that now we don’t have to put any extra effort in saving Spark DataFrames on Amazon S3. All, we need to do is include spark-s3 in our project dependencies and we are done.
Right now spark-s3 supports only Scala & Java APIs, but we are working on providing support for Python and R too. So, stay tuned !!!
To know more about it, please read its documentation on GitHub.