Businesses worldwide are discovering the power of new big data processing and analytics frameworks like Apache Hadoop and Apache Spark, but they are also discovering some of the challenges of operating these technologies in on-premises data lake environments. They may also have concerns about the future of their current distribution vendor.
Common problems of on-premises big data environments include a lack of agility, excessive costs, and administrative headaches, as IT organizations wrestle with the effort of provisioning resources, handling uneven workloads at large scale, and keeping up with the pace of rapidly changing, community-driven, open-source software innovation. Many big data initiatives suffer from the delay and burden of evaluating, selecting, purchasing, receiving, deploying, integrating, provisioning, patching, maintaining, upgrading, and supporting the underlying hardware and software infrastructure.
Thus, to solve all these common problems, there comes Elastic MapReduce as a prior solution. So, to know what EMR is & how EMR solves such problems let’s deep dive into it.
Amazon EMR allows us to use a managed Hadoop Framework to process massive amounts of data across scalable EC2 instances.
EMR also allows running other distributed frameworks such as Apache Spark, HBase, Presto & Flink.
Now as we have got a brief of what EMR is? Let’s discuss about EMR cluster.
EMR Custers are collections of Amazon EC2 instances. Each instance in a cluster is a node and these nodes each have different node types within the cluster that determine the role the node plays. These node types are:
- Leader Node:
- Manages the cluster by coordinating the distribution of jobs and tasks
- Tracks the status & health of the cluster
- Also known as “Master Node“
- Worker Node:
- Core Node
- Run Task & Stores data in HDFS
- Task Node
- Run Task but doesn’t store data
- Also Known as “Slave Node“
- Core Node
How & where does EMR store data?
EMR can store data in three ways:
a) Hadoop Distributed File System (HDFS)
b) EMR File System (EMRFS)
c) Local File System
So, let’s discuss these storage types in detail.
1) Hadoop Distributed File System (HDFS)
- HDFS is a distributed & scalable file system for Hadoop.
- HDFS distributes data it stores across multiple instances in the cluster.
- It creates replicas of data on different instances
- Only used for the intermediate result as the data get lost once the cluster is terminated.
2) EMR File System (EMRFS)
- It allows EMR to directly access the data stored in S3
- Used to store input and output data as data gets stored in S3 and can be reused when required.
3) Local File System
- In this local disk is used to store data.
Although it has a broad range of tools that can be installed in the EMR cluster. But the bootstrapping process can be used to install your own tool.
Here is the list of tools which EMR Cluster supports:
1) Apache Zeppelin
2) Apache Hadoop
Well, there are many more tools that are supported by the EMR cluster & can be viewed here.
- Cost Saving
– Can use AWS instead of physical hardware
– Can easily use AWS reserved instances to save cost
– Easy to deploy Big Data tools & framework
– Possible to customize the EMR cluster as per the need
– Use IAM to configure AWS USers, Groups, roles & policies for cluster
– Encryption to protect stored data
– EC2 key pairs can be set up for secure access of the cluster
- AWS Integration
– Can use S3 for data storage
– Use IAM policies for security & permission
– VPC can ve used for proper networking via an EC2 instance
So, this was all about EMR & its features. In our next blog, we will discuss more about Amazon EMR.
If you like this blog, please do show your appreciation by hitting like button and sharing this blog. Also, drop any comments about the post & improvements if needed. Till then HAPPY LEARNING.