A data lake is a large storage repository that holds a vast amount of raw data in its native format until it is needed. Usually, the data in a lake consists of structured, unstructured and object data like pictures, blogs, posts, videos etc. An “enterprise data lake” (EDL) is simply a data lake for enterprise-wide information storage and sharing.
Major stages of a data lake strategy can be broken down into
(I) Data collection
The main benefit of a data lake is the centralization of disparate content sources. Once gathered together (from their “information silos”), these sources can be combined and processed using big data, search and analytics techniques which would have otherwise been impossible.
(II) Data Preparation/Enrichment
Once the content is in the data lake, it can be normalized and enriched. This can include metadata extraction, format conversion, augmentation, entity extraction, cross-linking, aggregation, de-normalization, or indexing. Data is prepared “as needed,” reducing preparation costs over up-front processing (such as would be required by data warehouses). A big data compute fabric makes it possible to scale this processing to include the largest possible enterprise-wide data sets.
(III) Security, Governance and Access
Once all the data is present together and is enriched, users, from different departments, potentially scattered around the globe, can have flexible access to the data lake and its content from anywhere. This increases the re-use of the content and helps the organization to more easily collect the data required to drive business decisions.
Innovation is driven by such information. Data lake puts enterprise-wide information into the hands of many more employees to make the organization agile and gives them tools to innovate faster.
(IV) Working with the Data Lake (Search)
Final step in the data lake cycle is actually working with the data. In this environment, search is a necessary tool:
- To find the structured data as stored
- To extract enriched data on the basis of queries like GraphQL
- To work with unstructured data sets
- To handle analytics at scale
At Knoldus since we end up working with quite a few enterprise clients on their design, architecture and strategy for data lake, we are frequently involved in building the roadmap of the data lake and its use into building various revenue-generating products.
Major Health care client case study
For one of our major healthcare client, we collected data not only from structured and unstructured data sources but also from ontologies and triples information from major scientific systems.
This data was then enriched and powered with meta information using Machine Learning and AI to detect patterns and hence build a graph data lake.
The resultant data lake was now enabled through the search for various scientists across the organisation. These scientists would then feedback information into the enriched data lake thus making it a living lake. The analytics was run on the graph to enable various data products which would be used by the end clients.
These data products also let to building the ecosystem as a platform to which other third-party products could connect via the use of API platforms.
Our choice of tooling included
- Data Ingestion – Scala, Rust, Spark pipelines, HDFS
- Data enrichment – Cleansing of data with Spark, Akka, Patterns recognition with neural networks (feedforward nets, conv nets, mlp would do) and decision trees (gradient boosting, xgboost or random forest).
- Data Security and governance – Access control, network isolation, data level security using HDFS, Cassandra, In memory Ignite,
- Search – SQL queries for in-memory searches, Elastic Search for quick retrieval, GraphQL
If your organization is looking to build or improve your data lake strategy for Digital transformation then contact us for how our Architects can help you in building your roadmap.