Digital Transformation – Getting your Data Lake ready

Table of contents
Reading Time: 3 minutes

A data lake is a large storage repository that holds a vast amount of raw data in its native format until it is needed. Usually, the data in a lake consists of structured, unstructured and object data like pictures, blogs, posts, videos etc. An “enterprise data lake” (EDL) is simply a data lake for enterprise-wide information storage and sharing.

Major stages of a data lake strategy can be broken down into

Typical stages of a Data Lake

(I) Data collection

The main benefit of a data lake is the centralization of disparate content sources. Once gathered together (from their “information silos”), these sources can be combined and processed using big data, search and analytics techniques which would have otherwise been impossible. 

(II) Data Preparation/Enrichment

Once the content is in the data lake, it can be normalized and enriched. This can include metadata extraction, format conversion, augmentation, entity extraction, cross-linking, aggregation, de-normalization, or indexing. Data is prepared “as needed,” reducing preparation costs over up-front processing (such as would be required by data warehouses). A big data compute fabric makes it possible to scale this processing to include the largest possible enterprise-wide data sets.

(III) Security, Governance and Access

Once all the data is present together and is enriched, users, from different departments, potentially scattered around the globe, can have flexible access to the data lake and its content from anywhere. This increases the re-use of the content and helps the organization to more easily collect the data required to drive business decisions.

Innovation is driven by such information. Data lake puts enterprise-wide information into the hands of many more employees to make the organization agile and gives them tools to innovate faster.

(IV) Working with the Data Lake (Search)

Final step in the data lake cycle is actually working with the data. In this environment, search is a necessary tool:

  • To find the structured data as stored
  • To extract enriched data on the basis of queries like GraphQL
  • To work with unstructured data sets
  • To handle analytics at scale

At Knoldus since we end up working with quite a few enterprise clients on their design, architecture and strategy for data lake, we are frequently involved in building the roadmap of the data lake and its use into building various revenue-generating products.

Major Health care client case study

Enriched Data Lake for a major health care client

For one of our major healthcare client, we collected data not only from structured and unstructured data sources but also from ontologies and triples information from major scientific systems.

This data was then enriched and powered with meta information using Machine Learning and AI to detect patterns and hence build a graph data lake.

The resultant data lake was now enabled through the search for various scientists across the organisation. These scientists would then feedback information into the enriched data lake thus making it a living lake. The analytics was run on the graph to enable various data products which would be used by the end clients.

These data products also let to building the ecosystem as a platform to which other third-party products could connect via the use of API platforms.

Our choice of tooling included

  1. Data Ingestion – Scala, Rust, Spark pipelines, HDFS
  2. Data enrichment – Cleansing of data with Spark, Akka, Patterns recognition with neural networks (feedforward nets, conv nets, mlp would do) and decision trees (gradient boosting, xgboost or random forest).
  3. Data Security and governance – Access control, network isolation, data level security using HDFS, Cassandra, In memory Ignite,
  4. Search – SQL queries for in-memory searches, Elastic Search for quick retrieval, GraphQL

If your organization is looking to build or improve your data lake strategy for Digital transformation then contact us for how our Architects can help you in building your roadmap.

Written by 

Vikas is the CEO and Co-Founder of Knoldus Inc. Knoldus does niche Reactive and Big Data product development on Scala, Spark, and Functional Java. Knoldus has a strong focus on software craftsmanship which ensures high-quality software development. It partners with the best in the industry like Lightbend (Scala Ecosystem), Databricks (Spark Ecosystem), Confluent (Kafka) and Datastax (Cassandra). Vikas has been working in the cutting edge tech industry for 20+ years. He was an ardent fan of Java with multiple high load enterprise systems to boast of till he met Scala. His current passions include utilizing the power of Scala, Akka and Play to make Reactive and Big Data systems for niche startups and enterprises who would like to change the way software is developed. To know more, send a mail to or visit