Apache Spark: Read Data from S3 Bucket

Reading Time: 2 minutes

Well, a one working with spark is very much familiar with the ways of reading the file from local either from a Table or HDFS or from any file.
But do you know how tricky it is to read data into spark from an S3 bucket?

So, this blog makes you give a stepwise follow up to how to read data from an S3 bucket.

Before moving to our actual topic, we should know what is S3 bucket?

Amazon S3

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. This means customers of all sizes and industries can use it to store and protect any amount of data for a range of use cases, such as websites, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics.

So in short, S3 is a Bucket to which you can store any type of data.

Accessing S3 Bucket through Spark

Now, coming to the actual topic that how to read data from S3 bucket to Spark. Well, it is not very easy to read S3 bucket by just adding Spark-core dependencies to your Spark project and use spark.read to read you data from S3 Bucket.

So, to read data from an S3, below are the steps to be followed:

  1. Edit spark-default.conf file
    You need to add below 3 lines consists of your S3 access key, secret key & file system
spark.hadoop.fs.s3a.access.key AKIBJEKY6UIV6M32JXAQ
spark.hadoop.fs.s3a.secret.key IWT3f8BjqUFTZlbVXx+3Tk7eSUUHLj6CIRLWSP5lz0
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem

2. Start Spark with AWS SDK package
Add Aws-Java-SDK along with Hadoop-AWS package to your spark-shell as written in the below command.

./spark-shell --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3

3. Now read data from S3.
Let’s say S3 bucket contains your parquet data so to read the data do as below:

spark.read.parquet("S3 Bucket URL")

Example:

spark.read.parquet("s3a://AKIBJEKY6UIV6M32JXAQ:IWT3f8BjqUFTZlbVXx+3Tk7eSUUHLj6CIRLWSP5lz0@tham.omniture/wzhou/data/2019_9months/")

This is how you can access data from S3 Bucket through Spark.

If you like this blog, please do show your appreciation by hitting like button and sharing this blog. Also, drop any comments about the post & improvements if needed. Till then HAPPY LEARNING.

 

Written by 

Divyansh Jain is a Software Consultant with experience of 1 years. He has a deep understanding of Big Data Technologies, Hadoop, Spark, Tableau & also in Web Development. He is an amazing team player with self-learning skills and a self-motivated professional. He also worked as Freelance Web Developer. He loves to play & explore with Real-time problems, Big Data. In his leisure time, he prefers doing LAN Gaming & watch movies.

Leave a Reply