Intended Audience: Senior Managers & CTOs
“ The descendants of Noah were living in the area of Mesopotami in Babylon. They settled in a land named Shinar. The population was growing and they all spoke one language. The people decided to build a tall, proud symbol of how great they had made their nation. The Babylonians wanted a tower that would “reach to the heavens” so that they could be like God and that they would not need Him.
God did not like the pride and arrogance in the hearts of the people. God caused the people to suddenly speak different languages so they could not communicate and work together to build the tower. This caused the people to scatter across the land. The tower was named The Tower of Babel because the word Babel means confusion.”
The world of internet is no different. The content in the world of internet mostly lies in HTML. HTML is understandable to humans but is of no use to machines. Anyone, who worked on web scraping understands this very well. Databases are well understood by machines, but within the purview of the defined schemas and are too narrowly(locally) defined. As a result, integrating the knowledge from various domains has been a problem that is unsolved until now.
Benefits of solving the problem:
The benefits of connecting knowledge is immense. Imagine you are a scientist researching cancer. The challenge in cancer research is that it requires knowledge of cell biology, pharmaceutical chemistry, medicinal chemistry, genetics, radiologists and even computer scientists. For example, imagine if you need to answer the question, “What is the compound whose boiling point is greater than 10 degress, similar in structure as thiocyanate phosphine, have no side effects with people with BRCA gene”. This question require querying 3 different domains. Unfortunately, this knowledge at best exists in separate knowledge stores which need a humans to communicate and sift through.
Are today’s data lakes not solving this problem already ?
Data lakes are a meek effort to solve this problem of combining the knowledge. At best data lakes reduces the physical distance between the knowledge, but does nothing to connect. It is same as getting a Chinese and Russian together, providing them 2 translators who translate into english.
There is one solution on the horizon for this problem. Tim Berners Lee, inventor of the internet, proposed “Semantic Web”. The novel idea is simple, represent knowledge as structured ‘Triples’.
Lets take an entity (Equivalent to a Java class) called Human. Humans are born to humans, are either male or female, have a name and live in a location. We can represent this knowledge as following ‘triples’.
- Human ChildernOf Human
- Male isA Sex
- Female isA Sex
- Human hasType Sex
- Human livesAt Address
- Address isA Place
- Human hasName Name
In object oriented world, you can think of these as a way to define class hierarchy. Once the hierarchy is defined, now we can define objects.
- Ram isTypeof Human
- Ram name “Ram Indukuri” String
- Address1 label “56 N Averry Ct, Palatine, IL USA” String
- Ram livesAt Address1
- Ram typeOf Male
Now, lets say we discovered a new property of human, that Humans has hair color. We can describe the class and new objects as follows.
Hair-Color isA Color
Humans hasA HairColor
Black isA Color
Chris hairColor Black
So, What is so special about it.
- All information is defined with a ‘Triple’ that contains “Subject” , “Predicate” and “Object”
- The classes and objects are stored physically as text.
- At the most basic level, physical structure required to store this data is a table with 3 columns
- Creating new classes does not require changing the database structure. In other words, schema of the data and data itself are stored in the same store.
This creates a phenomenal ease of joining two different sets of knowledge bases. For example, if Humans are defined in domain A with attributes like friends, enemies, likes and dislikes (Social) in one knowledge base, and in Domain B with attributes like skills, level, Company, earnings (Work), all it takes to connect them is to define a new triple that connects person in domain A is same as person in Domain B.
This is the concept behind “Semantic Web”. The roots of semantic web are laid 300 – 400BC by Socrates and Plato. They created the first classification of knowledge, as ideas and things, where in Ideas (Modern day classes) are abstract and things (objects) are tangible and visible. Since then, many scientists and philosophers contributed to “Ontologies” which are essentially, formally defined sets of knowledge.
Why Enterprises should be interested in?
Semantic Web concepts can be applied to enterprises, in building a Knowledge Graph (Instead of data lake), that can bring together domains of knowledge together into one data store,
- In which, every piece of knowledge can be represented in that one table
- Provide ability to add new attributes and entities without changing the database schema.
- lets you traverse knowledge like a graph database.
- absorb structured and unstructured data
- Bring together knowledge in different domains together with few triples that define relationships between domains.
One question may pop up, how is it different from a graph database. All database technologies including graph data bases require upfront definition of schema, before data is ingested. They differ in serving different query patterns given the structured data is stored in databases. Semantic Web technologies, store data as text strings, with triple structure, and the schema/meta data is provided within the data, which requires no change in schema when new classes or attributes are detected and no schema changes (DDLs) required.
The growth of publicly available semantic data stores is growing rapidly, leading to more intelligence available but unused by enterprises due to lack of correct tools and infrastructure. Enterprise knowledge graphs, if built and used, can change things dramatically. Google along with several tech giants are rapidly encouraging and pushing semantic web concepts.
There are several challenges and limitations with building a semantic web based application. For example, due to flatness of the data structure, the triples run easily into billions and even trillions. The graph queries (Typically written in a DSL called SPARQL), if complex, will result in self joins which may drastically impact performance. But there are ways around it, if architected right.
At Knoldus, we have built enterprise knowledge graphs for our customers using latest advances in distributed computing. We built the graphs using combination of domain knowledge, distributed computing, data engineering, search technologies and machine learning. We will discuss architecture choices in building a knowledge graph in my next post.