The story of the Semantic Web is not new, however, it is interesting how some things become more and more important with the passage of time. The term was coined by Sir Tim Berners-Lee in May 2001 however, it took us around 14 years to get to the details of it when we were approached by a large publishing house to help them build a system on Scala and Spark ecosystem utilizing the concept. (This does not sound very different from the concept of Actors Model which originated in 1973 and became popular with our beloved Akka Framework.) Good food is cooked slowly !
So, what is the Semantic Web?
The vision of the Semantic Web is to extend principles of the Web from documents to data. This is an important distinction. Hang on! Currently if you notice, we are document dependent. Documents link to other documents. Hyperlinks are related to further documents and so on. The data in those documents is treated as a secondary part and does not have identity of its own (well, mostly!). In the Semantic web vision, data is of prime importance
- Data should be accessed using the general web architecture using, like URIs
- Data should be related to one another just as documents (or portions of documents) are already.
- Data could be shared and reused across application, enterprise, and community boundaries, to be processed automatically by tools as well as manually, including revealing possible new relationships among pieces of data.
Building block – Triple
Before we understand how data can relate to other data, let us understand the building block of Semantic web called the Triple. A Triple is a relationship between a Subject, Predicate and an Object. Lets consider the following example
Here, our subject is the MovieID:Gladiator which has predicates like name, year and director and the objects are the literals “Gladiator”, 2000 and “Ridley Scott” . Predicate is a property of the entity to which it is attached. If you notice here, “Ridley Scott” is a literal. If however, we wanted to use “Ridley Scott” as a subject for another triple then that would be an issue if it was just a literal. Hence, let us convert “Ridley Scott” to an id as well.
If you notice, the predicate ‘name ‘ is now reused for the person as well as the movie for representing the relationship. Ok, now how does having PersonID:RS, help us? Well, now we can define many more triples for this personID, for example
So now, we know that Ridley Scott, directed other movies as well and he is also a producer of Martian along with being a director.
The interesting thing about these triples now is that, these can form chain of relationships. Multiple triples can be tied together by using the same subjects and objects in different triples, and as we assemble these chains of relationships, they form a directed graph.
So, assuming that we understand Triples now, let us look at how the data could be connected. Say for example that you / your forward looking organization, would like to build a highly scalable, concurrent and fault tolerant system using reactive product development techniques and would like to evaluate Knoldus for that. (We are a direct fit , btw 😉
You are interested in the following
- Details about Knoldus
- Stock holding details of the company
- Partnerships of the company and what do the partners represent
- Consulting charges of the company (assuming it is available on another site)
- Clients and client details of the company
Currently in order to fetch all this information, it involves manual processing. You would get some details from http://www.knoldus.com. Then on the basis of the company id you would have to look at the stock holding status in the ministry of commercial affairs site. Then in order to look up the partnership information you would look at the partner sites. Then go to the consulting charges site and finally visit what domain the clients are in.
This is quite a bit of document trolling. Now let us see how, with Semantic Web in action it might become much easier. Let us say that the agent which crawls the Knoldus website gets the following information about Knoldus in terms of Triples.
nsk:CompanyID nsk:name "Knoldus Software LLP" nsk:CompanyID nsk:location nsk:CountryID nsk:CompanyID nsk:partner nsk:lightbend nsk:lightbend nsk:name "Lightbend" nsk:lightbend nsk:speciality nsk:platform nsk:CompanyID nsk:partner nsk:databricks nsk:databricks nsk:name "Databricks" nsk:databricks nsk:speciality nsk:platform
Now let us try to decipher it. nsk is the namespace for say Knoldus. nsk:CompanyID describes the resource which is the Knoldus website.
nsk:CompanyID nsk:name "Knoldus Software LLP"
represents a subject nsk:CompanyID which has an attribute (predicate) called nsk:name which has a value “Knoldus Software LLP”
nsk:CompanyID nsk:partner nsk:lightbend
represents the same subject having another predicate called partner and uses the id of another subject as the object. The other subject in this case being nsk:lightbend which has further predicates of nsk:name and nsk:specialty
In a graphical format, this information would be represented like
Now, lets say that the same agent went to the ministry of commerce website and got the following information
nsm:mcID nsm:active nsm:knoldusID nsm:knoldusID nsm:name "Knoldus Software LLP" nsm:knoldusID nsm:location nsm:cityID nsm:knoldusID nsm:partner nsm:p1 nsm:knoldusID nsm:partner nsm:p2 nsm:p1 nsm:name "Vikas Hazrati" ...
Looking at the information, we can see how we can relate this information together. The agent is also intelligent enough to relate this information together. It understands that the nsk:companyID in the first name space is same as the nsm:knoldusID in the second namespace. It adds the following statement automatically to the statement collection as
nsk:companyID sameAS nsm:knoldusID
Once, this is done, the agent has found a way to glue the graphs together.With this we get the following graph
Similarly, the agent went to the Lightbend website and found the following information
nsl:companyID nsm:name "Lightbend" nsl:companyID nsm:speciality "Reactive, Big Data" nsl:partnerID nsm:partner nsl:knoldusID nsl:knoldusID nsm:name "Knoldus Software LLP" ... ...
Again, as you would notice, the agent would be able to form a GLUE and merge the two graphs together for fetching information. I would let you create the next graph on your own now 😉 , but you get the idea.
The set of questions that this integration agent can answer keeps growing on the basis of other graphs that it can glue together based on common subjects and objects.
So what would we need for this agent to be successful?
- Each statement collected by the agent represents a piece of knowledge.
- Since this knowledge should be understood by machines, there needs to be a standard model in which this knowledge is available on the web.
- This model has to be accepted as a standard by all the Web sites. Otherwise,
statements contained in different Web sites would not share a common pattern.
- There has to be a way to create these statements on each Web site. For example,
they could be either manually added or automatically generated.
- There should be common standards for domains. Say, for example, a person should be represented in the same format across domains.
Hence, Semantic Web can be understood as a brand new layer built on top of the current Web, and it adds machine understandable meanings (or “semantics”) to the current Web. It would enable automatic data integration and linked data for the machines.
Stay tuned, more to come ….