Pentaho Data Integration - Getting Started With Transformations

Table of contents

Reading Time: 5 minutes

Pentaho Data Integration (PDI) is an extract, transform, and load (ETL) solution that uses an innovative metadata-driven approach.

PDI includes the DI Server, a design tool, three utilities, and several plugins.

You can download the Pentaho from URL:- https://sourceforge.net/projects/pentaho/

Uses of Pentaho Data Integration

Pentaho Data Integration is an extremely flexible tool that addresses a broad number of use cases including:

Data warehouse population with built-in support for slowly changing dimensions and surrogate key creation
Data migration between different databases and applications
Loading huge data sets into databases taking full advantage of cloud, clustered, and massively parallel processing environments
Data Cleansing with steps ranging from very simple to very complex transformations
Data Integration including the ability to leverage real-time ETL as a data source for Pentaho Reporting
Rapid prototyping of ROLAP schemas
Hadoop functions: Hadoop job execution and scheduling, simple Hadoop MapReduce design, Amazon EMR integration

Transformations

A transformation is a network of logical tasks called steps. Transformations are essentially data flows. In the example below, the database developer has created a transformation that reads a flat file, filters it, sorts it, and loads it to a relational database table.

Streps

Steps are the building blocks of a transformation, for example a text file input or a table output. There are over 140 steps available in Pentaho Data Integration and they are grouped according to function; for example, input, output, scripting, and so on. Each step in a transformation is designed to perform a specific task, such as reading data from a flat file, filtering rows, and logging to a database as shown in the example above. Steps can be configured to perform the tasks you require.

Hops

Hops are data pathways that connect steps together and allow schema metadata to pass from one step to another. In the image above, it seems like there is a sequential execution occurring; however, that is not true. Hops determine the flow of data through the steps not necessarily the sequence in which they run. When you run a transformation, each step starts up in its own thread and pushes and passes data.

Create Transformations

Follow these instructions to begin creating your transformation.

Click File > New > Transformation.
Under the Design tab, expand the Input node, then select and drag a Generate Rows step onto the canvas.Note: If you don’t know where to find a step, there is a search function in the left corner of Spoon. Type the name of the step in the search box. Possible matches appear under their associated nodes. Clear your search criteria when you are done searching.
Expand the Flow node; click and drag a Dummy (do nothing) step onto the canvas.
To connect the steps to each other, you must add a hop. Hops describe the flow of data between steps in your transformation. To create the hop, click the Generate Rows step, then press and hold the <SHIFT> key and draw a line to the Dummy (do nothing) step. Note: Alternatively, you can draw hops by hovering over a step until the hover menu appears. Drag the hop painter icon from the source step to your target step.
Double click the Generate Rows step to open its edit properties dialog box.
In the Limit field, type 100000. This limits the number of generated rows to 100,000.
Under Name, type FirstCol in the Name field.
Then Type, type String.
Under Length, type 150.
Under Value, type My First Step. Your entries should look like the image below. Click OK to exit the Generate Rows edit properties dialog box.
Now, save your transformation.

Save Your Transformation

In Spoon, click File > Save As. The Transformation Properties dialog box appears.
In the Transformation Name field, type First Transformation.
In the Directory field, click the Folder Icon to select a repository folder where you will save your transformation.
Expand the Home directory and double-click the admin folder. Your transformation will be saved in the admin folder in the DI Repository.
Click OK to exit the Transformation Properties dialog box. The Enter Comment dialog box appears.
Click in the Enter Comment dialog box and press <Delete> to remove the default text string. Type a meaningful comment about your transformation. The comment and your transformation are tracked for version control purposes in the DI Repository.
Click OK to exit the Enter Comment dialog box and save your transformation.

Run Your Transformation Locally

In Spoon, go to File > Open. The contents of the repository appear.
Navigate to the folder that contains your transformation. If you are a user with administrative rights, you may see the folders of other users.
Double-click on your transformation to open it in the Spoon workspace.Note: If you followed the exercise instructions, the name of the transformation is First Transformation.
In the upper left corner of the workspace, click Run. The Execute a Transformation dialog box appears. Notice that Local Execution is enabled by default.
Click Launch. The Execution Results appear in the lower pane.
Examine the contents under Step Metrics.

Build a Job

In the Spoon menubar, go to File > New > Job. Alternatively click (New) in the toolbar.
Click the Design tab. The nodes that contain job entries appear.
Expand the General node and select the Start job entry.
Drag the Start job entry to the workspace (canvas) on the right.The Start job entry defines where the execution will begin.
Expand the General node, select and drag a Transformation job entry on to the workspace.
Use a hop to connect the Start job entry to the Transformation job entry.
Double-click on the Transformation job entry to open its properties dialog box.
Under Transformation specification, click Specify by name and directory.
Click (Browse) to locate your transformation in the solution repository.
In the Select repository object view, expand the directories. Locate First Transformation and click OK. The name of the transformation and its location appear next to the Specify by name and directory option.
Under Transformation specification, click OK.
Save your job; call it First Job. Steps used to save a job are nearly identical to saving a transformation.
Execute a Job dialog box appears,
choose Local Execution and click Launch.The Execution Results panel opens displaying the job metrics and log information for the job execution.

Executing Transformations

When you are done modifying a transformation or job, you can run it by clicking ../pdi_admin_guide/images/run.png (Run) from the main menu toolbar, or by pressing F9. There are three options that allow you to decide where you want your transformation to be executed:

Local Execution — The transformation or job executes on the machine you are currently using.
Execute remotely — Allows you to specify a remote server where you want the execution to take place. This feature requires that you have the Data Integration Server running or Data Integration installed on a remote machine and running the Carte service. To use remote execution you first must set up a slave server .
Execute clustered — Allows you to execute a transformation in a clustered environment.