Introduction to Pentaho Data Integration – Theory For Foundational Understanding

Reading Time: 7 minutes
Pentaho Image

In this blog we will go over the basic theoretical concepts you must understand if you want to use Pentaho Data Integration.

To begin with, we need to know what ETL is.

What is ETL?

Image of extract, transform and load.

ETL stands for Extract, Transform and Load. It is the process of gathering data from different data sources, converting it into the required format, and then loading it into a destination data storage system.

The steps involved in ETL are classified into three parts.

The first step is extract which is comprised of all the processing required to connect to various data sources and then extract data from them. A few examples of different data sources are Excel files, a table in a SQL database, JSON and XML files, or even data stored in a cloud storage service. Getting data from all these sources may sound straightforward, but it is one of the main challenges in ETL.

The second step is to transform the data. It includes any function applied to the extracted data. Among the functions that we can apply to data are data cleaning and quality checks, modifying data and modifying its structure, integrating data, and calculating or aggregating values on the processed data.

The last set of steps is loading the data. This includes all the processing required to load data into a target system. This part does not simply involve uploading data in bulk to the target. It involves several processes, like the management of surrogate keys, dimension tables, and other related operations.

Usually, the last step involves loading the data into a data warehouse or a data mart. Our next step is to understand the context of these terms and how a data warehouse differs from an operational database.

Data Warehouse and Operational Database

In an organization, data is typically stored in two types of databases. One is an operational database and the other is a database warehouse. While both systems store data, their objectives differ greatly.

In an operational database, information pertaining to a particular aspect of a business is stored in real-time. For instance, when you place an order on Amazon, that order is put into an operational database. In this database, only the orders coming from their site are stored.

A data warehouse, on the other hand, gathers data from many different sources within an organization, and its purpose is to be utilized for analytics and reporting. The data warehouse report generated from complex queries is then used as the basis for business decisions. Therefore, the ultimate goal of a data warehouse is to facilitate business decisions.

Unlike operational databases, data warehouses store a large amount of historical information so that historical data can also be analyzed.

An important difference is that a data warehouse isn’t updated as frequently as an operational database. Thus, in our Amazon example, each time a new order is placed, the operational database is updated. However, the data warehouse will be updated once a day, once a week, and so on.

Difference between an operational database and a data warehouse

In operational databases, OLTP (Online Transaction Processing) is used to delete, insert, replace, and update thousands of short transactions. So, less data needs to be updated, but multiple updates are required. This type of processing responds immediately to user requests and is used to manage an organization’s daily operations in real-time.

If we compare OLTP with a data warehouse, we see that the former usually focuses on a single business process, whereas the latter collects all the data from different business objects. Furthermore, you can see that each of the data sources in the OLTP is a single source, whereas the data warehouse collates data from several sources.

In OLTP, insert and update queries are the dominant type of queries, which means we insert and update new rows of data frequently, whereas in a data warehouse, read queries are the dominant type of queries, as we will be extracting data from the data warehouse for analytics and reports.

Another major difference is the amount of data. OLTP stores less than 100 GB of data, while a data warehouse stores terabytes of data.

Data warehouses are primarily organization-wide. On the other hand, OLTP supports an operational business process.

A reason for doing ETL is to combine data from several OLTP databases into a single data warehouse, which can later be used to produce analytical reports so that business decisions can be made.

What is Data Mart?

Image of Data Mart

Data marts are similar to data warehouses, however, they are subsets of data warehouses. A data mart is also not an operational database. They are similar to data warehouses, but the aim of a data mart is to manage only a particular unit of an organization. In a data mart, for instance, data is stored under departments such as sales, finance, or marketing, similarly to a data warehouse. The agenda here is that all sales-related data is available to the sales team, but finance-related data is not.

Thus, a data warehouse assists in setting up organizational-wide strategies, while a data mart assists specific departments in creating departmental strategies.

Inmon vs Kimball Architectures

There are two popular data warehouse architectures proposed by renowned authors. The first is by Inmon and the second is by Kimball.

Let us first discuss the Inmon Model.

Inmon Model

Inmon Model

According to Inmon, organizations should first compile all the data from several sources and then create a data warehouse. The data warehouse will provide a central repository for all data across the organization.

From this data warehouse, data will be segregated for use by different departments, and it is this segregation that creates the data mart. Then, each department creates an analytics layer on the data mart to extract information and analytics from it. Lastly, reports are generated from the data mart analytics.

This entire process involves ETL in two steps, in which several sources of data are collected and put into a warehouse, and then data is taken out of the warehouse and put into the data mart. During the first step, not much transformation is done. The data is typically extracted and loaded straight away so that the data in its raw format can be found in the data warehouse. Second, the data is processed and transformed mainly when the data is put into the data marts. This was the Inmon model.

Kimball Model

Kimball Model

In the Kimball model, which is more popular, data from the data sources is first collated into a data mart. This means that Kimball was suggesting that each department in the organization maintain its own data mart. The data from these data marts, which is maintained by different functions of the organization, is then collated and put into a data warehouse.

For instance, the sales department will collect and maintain its data, the finance department will collect and maintain its data, the marketing department will collect and maintain its data, and so on. All of these databases will be maintained by the different functions. The data will then be collected and put into the main warehouse.

Organizations are more likely to use this type of architecture. Cleansing and validating the data has already happened in the ETL process, where the data was extracted from various sources, it is usually cleaned and put into the data mart. The data from the data mart is only processed to make the data more meaningful, then it is imported into the data warehouse.

In lieu of a data mart, analytics and reporting are handled by the data warehouse.

ETL vs ELT

Likewise, as mentioned earlier, ETL stands for Extract, Transform, Load, while ELT stands for Extract, Load, and then Transform.

ETL

ETL

A typical ETL process starts with extracting data from several sources and then putting it into a staging area, which is just a collection of data from all the data sources and storing it in the raw format. The staging area contains raw data that has not been processed. 

After the staging area, we transform the data according to the requirements and then load it into the data warehouse. The data in a data warehouse is then used to create analytics and analytics-based reports. This is the typical ETL flow, and all of our software tools like Pentaho, Talend, and Informatica can be used to follow this flow.

The other option is ELT, which follows extract, load, and then transform.

ELT

ELT

In this, we first extract the data from all the sources and straightaway put it into the data warehouse without any transformations, so there is no need for that staging area. The data warehouse itself contains data in the raw format. Then the transformations are done on the data within the warehouse and this transformed data is then used for analytics purposes.

The major difference is, transformations in ETL are being done in the tool such as Pentaho, whereas transformations in ELT are being done in the data warehouse using Complex SQL queries.

Now, which is used when? ELT is typically used when we have a very large amount of data, so large that transforming it in a software tool is not possible and we have to put that data in the data warehouse and use SQL queries to transform it. In most of the scenarios, data is not so large, which is why ETL is usually preferred because of the availability of comprehensive tools such as Pentaho, Talend,  Informatica, which can very easily and completely handle the entire ETL process. So for most practical purposes, ETL is preferred over ELT.

References –

To know more about Pentaho Data Integration, you can visit this link.

To read more tech blogs, feel free to visit Knoldus Blogs.

Written by 

Akshat Mathur is a Software Consultant at Knoldus Inc. He has worked on Java, Angular 11, and Spring Boot for more than three years. Learning is his passion and solving problems is his forte.

2 thoughts on “Introduction to Pentaho Data Integration – Theory For Foundational Understanding10 min read

Comments are closed.