Build Enterprise Data Lake with AWS Cloud

low angle photo of four high rise curtain wall buildings under white clouds and blue sky
Reading Time: 4 minutes

Data Lake

A Data Lake is a place to store enterprise data in one common place. This data can be further accessed by data wranglers with analytical needs. However, a data lake is different from a normal database. As a data lake can store current and historical data for different systems in its raw form for analysis. And, a database stores current updated data for an application. Now this data which organisations preserve can be in any shape or format – structured, unstructured or semi-structured. Also, it can be saved in any desired format like CSV, Apache Parquet, XML, JSON etc. When we talk about data this data can have no limit on size. So, we need a mechanism in place to ingest this data by batch or stream processing most of the times. Potential users of this data also look forward to secure this data lake and ensure data governance. Hence, we need a data lake which is secure with proper security and controls governing access. This should be independent of data access methods.

Data Lake Benefits

  • Accessibility of data by storing it at common place. This is accessible by everyone based on privileges set by data custodians (who manage and owns this data).
  • Store raw data at scale for a low cost.
  • Unlock the data from different domains just in few clicks
  • Provide leading industry experience to different data personas.
  • Ensure the value associated with each data stored in lake to provide valuable experience and competitive edge over each other.
  • Make it more comprehensive with desired search, filtering and navigation capabilities to make it work like a search engine aka. Google for your organisation.

Now to make this data lake accessible to users we need a web based application. A data catalog can be one form to address this need which would act as a persistent metadata store that facilitates data exploration around different data stores.

Data Lake (ELT Tool) vs. Data Warehouse (ETL Tool)

Let’s try to understand how this data lake is different from a data warehouse. ETL (Extract Transform and Load) is what happens within a Data Warehouse and ELT (Extract Load and Transform) within a Data Lake. DWH (Data warehouse) serves as an integration platform for data from different data sources. It creates a structured data during ETL which can be used for various analytical needs whereas a DL (Data Lake) can preserve data in structured, unstructured or semi-structured format without specific purpose or need. This data from data lake gets value out of it over period of time with gradual transformation and other other analytical processes. Also, schema of this data is defined at time of processing or reading in lake. So, data in data lake is highly configurable, agile based on requirement. Data Lake works well with real time and big data needs. Hence, when a business has drastically changing data need one should build a data lake whereas for slowly changing structured data needs one can go with building data warehouse.

Data Lake for Big Data

In this age of big data which is collecting several millions of rows of data per second in any format can be stored and used with data lake. Another addition to this is Data Vault methodology and modelling which is a governed data lake that address some of the limitations of DWH. Vault provides durability and accelerates business value.

Deploying Data Lakes on Cloud

A data lake is considered as an ideal workload to be deployed in cloud for scalability, reliability, availability, performance, and analytics purposes. Users perceive cloud as a benefit to deploy data lake for better security, faster deployment time, elasticity, pay as use model, and for more coverage across different geographies.

Build Data Lake via. AWS Cloud

Now let’s discuss the final part of this discussion – how can we build a data lake on cloud using different AWS services.

Data Collection: Collect & Extract data from different sources including formats like flat files, API’s, or any SQL, No-SQL database or from some cloud storage like S3.

Data Load: Load this raw unprocessed data into AWS S3 bucket for storage. This bucket will act as a landing bucket.

Data Transformation: Then use ETL tool like AWS Glue for various data processing and transformations.

Data Governance: We can further enable security settings and access controls on this data and ensure data governance on top of this transformed-processed data. A data-catalog can be build for storing metadata and further exploration around different data stores.

Data Curation: We can curate this processed data in another target S3 bucket or in AWS Redshift (as a DWH).

Data Notification & Monitoring: AWS SNS can be used for intermediate notifications and alerting mechanism for various jobs. AWS cloudwatch can be used for monitoring and logging.

Data Analytics: From second S3 bucket or Redhift where transformed data was curated we can query and analyse data for various business requirements via. AWS Athena, QuickSight. Also, data scientists can use this data for building & training various ML models.

Written by 

Karuna Puri is Tech. Lead at Knoldus Inc. with 8+ years of experience. She is backend developer with expertise in Functional Programming language - Scala. She is also well versed on cloud front with AWS. She is a tech. enthusiastic with interests in other domains - Data Mining and Machine Learning.