Basic Overview Of Pentaho Data Integration

Reading Time: 4 minutes
Do data integration using pentaho kettle by Mostafib | Fiverr

Let us go through the basic overview of Pentaho Data Integration, its importance, ETL Process, etc. So, let’s start.

The first question arises that What is Pentaho?

Pentaho is a leading business intelligence tool that makes it possible for an organization to easily access, organize, and analyze data. Nowadays it is very popular and has set the benchmark for the most used and preferred component for data integration.

Pentaho Data Integration provides Extraction, Transform, and Load i.e. ETL capabilities that facilitate the process of capturing, cleaning, and storing data using the same format as the accessible and compatible format not only for end-users but also for IoT technologies.

Extract, Transform, and Load (ETL) Process

ETL is a three-step process that involves the first extracting data from different source systems and making great changes to it and then transforming, giving effect to such actions as changing the data types or making request calculation After the data is transformed then it is loaded to the target datastore, typically a data warehouse.

The ETL process requires active inputs from various stakeholders including developers, analysts, testers, top executives, and is technically challenging.

There are a few development tools for implementing ETL processes in Pentaho:

  • Spoon- It’s a desktop application that used as a graphical interface , editor for transformations and jobs. With spoon, we can author, edit, run, and debug transformations. We also can use Spoon to enter license keys, add data connection and define security.
  • Pan- Pan is the PDI command-line tool for executing transformations. Transformations can be from a PDI repository or a local file.
  • Kitchen- Kitchen is a program that can execute jobs designed by Spoon in XML or in a database repository.
  • Carte- Carte is a simple web server that allows you to run transformations and jobs remotely. It receives XML (using a small servlet) that contains the transformation to run and the execution configuration.

How to do a database join with PDI?

  • If we want to join 2 tables from the same database, we can use a “Table Input” step and do the join in SQL itself.
  • If we want to join 2 tables that are not in the same database. We can use “Database Join”.

Get Started with the PDI client

PDI client (also known as Spoon) is a desktop application that enables us to build transformations schedule and run jobs.

Common uses of PDI clients include:

  • Data migration between different databases and applications
  • Loading huge data sets into databases taking full advantage of cloud, clustered and massively parallel processing environments
  • Data cleansing with steps ranging from very simple to very complex transformations
  • Data integration including the ability to leverage real-time ETL as a data source for Pentaho Reporting

Use Pentaho Repositories in Pentaho Data Integration

Pentaho Data Integration Client offers several different types of file storage. If a team needs an ETL collaboration site, it is recommended to use a Pentaho Repository.

In addition, to maintaining and managing any of our operations and changes, Pentaho Repository also provides a comprehensive review history to track changes, compare updates, and revert to previous versions if necessary. These features, along with business security & content locking, make Pentaho Repository an ideal platform for collaboration.

Features of Pentaho Data Integratioin

  • It provides capabilities of almost no code development approach.
  • It provides capability of platform independence with respect to development and deployment.
  • Pentaho Data Integration has capability to deal with huge data, in some scenarios data size is in TB’s.
  • It has capability to integrate with Bigdata i.e. It supports AVRO, Cassandra, Hadoop, MongoDB and many more.
  • Pentaho Data Integration has capability to interact with Rest client, HTTP client.

What’s new in Pentaho 9.2

Access to Microsoft Azure from Pentaho- We can now access Microsoft Azure from Pentaho using the following methods;

  • Via VFS connections to Azure Data Lake Storage Gen2 and Blob Storage services.
  • As PDI and PUC data sources for an Azure SQL database.
  • Load data into an Azure SQL database from Azure Data Lake Storage using the new Bulk load into Azure SQL DB PDI job entry.

Access to Cloudera Data Platform from PDI– We can now access and process data from the Cloudera Data Platform in PDI. Cloudera Data Platform (CDP) is an analytics and management platform that provides self-service access to integrated, multi-function analytics on centrally managed business data with security and governance.

Access to HDFS copy file operations from PDI through an Apache Hadoop driver- We can access and use the installed Apache Hadoop driver for HDFS copy file operations as well as for executing input-output transformations and jobs. The driver works with both secure and unsecured clusters.

Pentaho Upgrade Installer- The new Pentaho Upgrade Installer is an easy-to-use interface tool that automatically applies the new release version to your Pentaho products. We can upgrade version 8.3 of our Pentaho products on a server or a workstation directly to version 9.2 using this simplified upgrade process via the Pentaho Upgrade Installer.

Minor platform enhancements-

  • Performance logging- Users can take advantage of improvements to how Pentaho logs are configured for auditing and tracking to combine logs from multiple applications for a comprehensive view and analysis of activity across the platform.
  • HBase parameter support of namespace and table name files- HBase steps in PDI now support the use of defaults and variables in namespace and table name file definitions.


Nowadays, it is very important to use the right business intelligence tool at the right time so that any organization can make the correct decision at right time. To make this happen tools like Pentaho etc are used widely nowadays. Its powerful components and high performance enable organizations to unlock the real value of data. As we know, nowadays data is the most important thing and which data will be beneficial for our organization choosing that is also a major task. So therefore data needs to be properly collected, cleaned, and analyzed to identify strengths, risks and get new insights. That’s why Pentaho came into the market, it is a complete suite that offers exceptional data integration, reporting, and presenting capabilities. It is also capable of handling a large volume of data, processing data at speed, and working with various data sources.

Written by 

Hi, I'm Software Consultant with experience in technologies like Core Java, Advance Java, Functional Programming, and looking forward to learn and explore more into this field. I also love competitive programming, solving live problems on Leetcode, CodeChef.

1 thought on “Basic Overview Of Pentaho Data Integration6 min read

Comments are closed.