Prerequisites: Basic knowledge about Big Data and ETL.
What is PDI?
PDI stands for Pentaho Data Integration. It is a tool that provides us with ETL capabilities to effectively manage huge and complex data ingestion pipelines.
Its use cases include:
- Loading huge data sets into databases.
- Performing simple to complex transformations on data.
- Data migration between different databases.
and many more…
Installing PDI in your system:
We will start off with PDI by first installing the PDI desktop client application which is known as Spoon. Please follow the steps below to install PDI in your system:
Step 1: Download from SourceForge: https://sourceforge.net/projects/pentaho/files/latest/download
Step 2: Make sure Java 8 or higher is installed on your system.
Step3: Extract the downloaded zip file.
Step4: Make sure environment variables are set.
There is a windows batch file (set-pentaho-env.bat) which lets you set up environment specific variables for PDI. You will find it in the data-integration folder.
Step 5: Create a shortcut and run Spoon. You will find a batch file named spoon.bat. Create a shortcut for it on your desktop or any other preferred location. Finally, run the file.(Allow the firewall network access).
You will be greeted with an intuitive user friendly interface using which we can create data ingestion pipelines.
Working with Spoon:
I will show you a basic transformation to show Spoon works.
We’ll start off by creating a transformation, to open a new transformation just click on the “New Transformation” option in the welcome screen or you can also right click on “Transformations” in the view tab to create a new transformation.
Now you can see various steps that you can perform in the “Design” tab. Various steps are categorized and can be found under their respective category. Go to input and drag and drop CSV file input into the work area. Now double click on the CSV file input step. A configuration window will appear. Next, you need to click on browse and select your CSV file. For demonstration purpose we will use a sample CSV file already provided in the installation folder under the path : samples\transformations\files\sales_data.csv
Now click on “Get Fields” and the grid will update with the data from the CSV file. “Preview” option shows the preview of the file.
Now we can further process the data obtained from the input step. Now we will add a filter step. Goto the flow category, and drag and drop “Filter rows” step. Now click on CSV file input step and then click on the output connector and connect it to “Filter rows” step. You will be prompted with the kind of hop.
Select “Main output of step” as shown in the picture above.
Now, Double click the “Filter rows” step to configure it.
Now using this filter we want to filter out the null values in the “POSTALCODE” column. To perform this we need to set the filter condition.Please set the filter condition as mentioned below.
Select the condition as “NOT” field as “POSTALCODE” and finally click on value and click ok to pass a null value.
Next we need to take the output from the filter.
From the Output category select a Text file output and an excel file output. Now connect an output hop from “filter rows” to the “Microsoft Excel output” file and select “Result is False” option as we want the data with null values in this excel file to be sent for further rectification. Then connect another output hop from “filter rows” to the “text output file” and select the “Result is True” option as we want the data without null values in this text file.
Now, Let’s configure the output files one by one. First double click on the “Text File output” to configure it. Click on browse and set your preferred location and then name the file. Then change the extension to .csv to get the output in a csv file.Also, go to the content tab and set the “separator” as comma(,).
Similarly, set the path of the “Microsoft Output FIle”
Finally, run the transformation. Click on run again. Then Spoon will prompt you to save the transformation first, go ahead and do it. After the transformation has finished you will find the output files in their respective specified location. Check the output files to experience the magic f PDI.
This was a very basic example of a transformation to get you started. There are many other features in PDI which can be used to perform various different steps and create and manage large and complex data ingestion pipelines.
To get more info about all the features please check the official documentation here: https://help.hitachivantara.com/Documentation/Pentaho/
To stay updated with the new and upcoming technologies, please stay tuned to: Knoldus Blogs