Over the past few years the data lake has gained a lot of momentum and almost every big organization wants to build one. Most of the people think data lake as a “cheap way to store/manage data” and it will help them reduce the cost. But, data lake is much more about making money rather than saving money. Data lake is a powerful tool where you can store all your organizations data and then run your analytics on them. Just think about the power of having the data in a single repository where you can aggregate, slice or dice data and run analytics.
The data lake enables the data science team to build the predictive and prescriptive analytics necessary to support the organization’s different business use cases and key business initiatives.
The key challenge here is that nobody have clarity on how to embark on the path of creating a Data Lake and in this blog we have tried to identify different phases of this journey
The above diagram shows major phases involved in creating a data lake and managing it efficiently.
Discover Phase : It is the most important phase of this journey. In this phase we look into all the data that flows in our applications i.e. all the data needs of the organization irrespective of where they are located. We need to identify the data sources, build the data flow diagrams, define schemas if necessary as well as create a strategy for how the data will be consumed by data lake.
Setup : In this phase we will setup the Infrastructure for the Data Lake , we will try to calculate the size of existing data in our system as well as interpolate how our data will grow over time. Object Storage like S3, HDFS is one of the recommended storage for data lake. Data Lake infrastructure should be totally separate from your regular IT and application infrastructure. The key considerations are
- Scalability ( how easy it is to scale the storage)
- Store variety of data (structured, semi structured and unstructured data)
- Easy to ingest data.
- Separate Storage and Analytics Layer
Migrate: Data Lakes allow you to import any amount of data that can come in real-time. Data is collected from multiple sources, and moved into the data lake in its original format. This process allows you to scale to data of any size, while saving time of defining data structures, schema, and transformations. In the migrate phase we will transfer the existing data from other systems to the data lake and establish protocols for ingesting the continuous data.
Governance: Just dumping the raw data will not help our use case so we need a defined data management process that an organization will follow to ensure high-quality data is available at the right time throughout the data’s full life cycle. A governed data lake contains clean, relevant data from structured and unstructured sources that can easily be found, accessed, managed and protected. The platform your data resides on is security-rich and reliable. Data that comes into your data lake is properly cleaned, classified and protected in timely, controlled data feeds that populate and document it with reliable information assets and metadata.
Monitor: Monitoring of data lake is a necessity to make sure everything is working correctly and we can detect the failures early so that our initiatives don’t suffer any setbacks. The monitoring includes pipelines that are ingesting into and out of data lake, health of storage system, detecting anomalies as well as doing preventive maintenance.
Analytics: is the reason why data lake exists, it allow various users in our system like data scientists, data developers, and business analysts to access data with their choice of analytic tools and frameworks. We can use open source frameworks such as Apache Hadoop, Presto, and Apache Spark, and commercial offerings from data warehouse and business intelligence vendors. Data Lakes allow you to run analytics without the need to move your data to a separate analytics system. We can generate different types of insights including reporting on historical data, and doing machine learning where models are built to forecast likely outcomes, and suggest a range of prescribed actions to achieve the optimal result.
These are all the phases that are involved in building and operating a successful data lake. In the next blog I will discuss the Architecture of a data lake and how we can build it using some of the open source tools.