Loading and Indexing Data in MarkLogic

Table of contents

Reading Time: 3 minutes

With MarkLogic being a document-oriented database, data is commonly stored in a JSON or XML document format.

If the data to bring into the MarkLogic is not already structured in JSON or XML means if it is currently in a relational database, there are various ways to export or transform it from the source.

For example, many relational databases provide an option to export relational data in XML or in JSON format, or a SQL script could be written to fetch the data from the database, outputting it in an XML or JSON structure. Or, using Marklogic rows from a .csv file can be imported as XML and JSON documents.

In any case, it is normal to first denormalize the data being exported from the relational database to first put the content back together in its original state. Denormalization, which naturally occurs when working with documents in their original form, greatly reduces the need for joins and acceleration performance.

Schema Agnostic

As we know that schema is something having a set of rules for a particular structure of the database. While we talk about data quality then schemas are helpful as quality matters a lot with quality reliability and a proper actional database is going to present.

Now if we talk about the schema-agnostic then it is something the database is not bounded by any schema but it is aware of it. Schemas are optional in MarkLogic. Data is going to be loaded in its original data form. To address a group of documents within a database, directories, collections and internal structure of documents can be used. With MarkLogic easily supports data from disparate systems all in the same database.

Required Document Size and Structure

When loading a document, it is the best choice to have one document per entity. Marklogic is the most performant with many small documents, rather than one large document. The target document size is 1KB to 100KB but can be larger.

For Example, rather than loading a bunch of students all as one document, have each student be a document.

Whenever defining a document remember that use XML document and attribute names or JSON property names. Make document names human-readable so do not create generic names. Using this convention help indexes be efficient.

<items>

<item>

<product> Mouse </product>

<price> 1000 </price>

<quantity> 3 </quantity>

</item>

<item>

<product> Keyboard </product>

<price> 2000 </price>

<quantity> 2 </quantity>

</item>

</items>

Indexing Documents

As documents are loaded, all the words in each document and the structure of each document, are indexed. So documents are easily searchable.

The document can be loaded into the MarkLogic in many ways:

MarkLogic Content Pump.
Data movement SDK.
Rest APIs
Java API or Node js API.
XQuery
Javascript Functions.

Reading a Document

To read a document, the URI of the document is used.

XQuery Example : fn:doc("college/course-101.json")

JavaScript Example : fn:doc("account/order-202.json")

Rest API Example : curl --abc --user admin:admin: -X GET "http://localhost:8055/v1/document?uri=/accounting/order-10072.json"

Splitting feature of MLCP

MLCP has the feature of splitting the long XML documents, where each occurrence of a designated element becomes an individual XML document in the database. This is useful when multiple records are all contained within one large XML file. Such as a list of students, courses, details, etc.

The -input_file_type aggregates option is used to split a large document into individual documents. The aggregate_record-element option is used element used to designate a new document. The -uri_id is used to create a URI for each document.

While it is fine to have a mix of XML and JSON documents in the same database, it is also possible to transform content from one format to other. You can easily transform the files by following the below steps.

xquery version "1.0-ml";

import module namespace json = "http://abc.com/xdmp/json" at "abc/json/json.xqy";

json:transform-to-json(fn:doc("doc-01.xml"), json:config("custom"))

A Marklogic content pump can be used to import the rows from the .csv file to a MarkLogic database. We are able to the data during the process or afterward in the database. Ways to modify content once it is already in the database include using the data movement SDK, XQuery, Js, etc.

Conclusion

As we know that MarkLogic is a database that facilitates many things like we can load the data, indexing the data, transforming the data, and splitting the data.

References: