Executing Pentaho jobs(Kettle) through JAVA

Reading Time: 4 minutes

Prerequisite: Basic Details of Pentaho

An enterprise-grade BI answer consists of two or more additives. There are reporting tools, ETL procedures, databases, and regularly some sort of web portal. All of which should be well integrated. ETL is usually a scheduled process, but you often want to have business clients trigger it manually. A great way to get this is through some simple interfaces we have created on our web portal. This way you don’t need to know the backend infrastructure and can handle consumer management, access, etc. There are many ways to run ETL in a Java application.

The easiest way is to run the third-party procedure

This may be your best bet. It doesn’t look great, but it works, so it’s ultimately the most important. It could be as simple as:

public static void main(String args[]){
public static void runProcess(String filename) throws IOException { Process p = new ProcessBuilder (KITCHEN_PATH, "/file:"+filename). inheritIo().start();
}}

Going to a separate thread to amplify and configure it so that it isn’t structured by the platform, inspect the output, etc. Requires a PDI installed in the stroller. The main downside to this approach is that the ETL runs on the JVM which can slow down web-portal. I wouldn’t use this method if it suits my needs, for example, a few small tweaks my business associates make.

The cooler Approach – Using the Pentaho Data Integration libraries

Pentaho provides a Java library that allows you to integrate and perform operations and tweaks directly in your Java code. I’ll show you a simple example of using maven to import the required libraries and then do a simple task.
The code uses maven to build the project. The maven dependencies vary based on the kettle steps involved. If the kettle transformation involves database connectivity, then you will need to add the corresponding database jar files. The pom.xml for developing the project is as below:

<dependencies>
        <!-- Pentaho Kettle Core dependencies development -->
        <dependency>
            <groupId>pentaho-kettle</groupId>
            <artifactId>kettle-core</artifactId>
            <version>5.0.0.1</version>
        </dependency>
        <dependency>
            <groupId>pentaho-kettle</groupId>
            <artifactId>kettle-dbdialog</artifactId>
            <version>5.0.0.1</version>
        </dependency>
        <dependency>
            <groupId>pentaho-kettle</groupId>
            <artifactId>kettle-engine</artifactId>
            <version>5.0.0.1</version>
        </dependency>
        <dependency>
            <groupId>pentaho-kettle</groupId>
            <artifactId>kettle-ui-swt</artifactId>
            <version>5.0.0.1</version>
        </dependency>
        <dependency>
            <groupId>pentaho-kettle</groupId>
            <artifactId>kettle5-log4j-plugin</artifactId>
            <version>5.0.0.1</version>
        </dependency>
        
        <!-- The database dependency files. Use it if your kettle file involves database connectivity. -->
        <dependency>
            <groupId>postgresql</groupId>
            <artifactId>postgresql</artifactId>
            <version>9.1-902.jdbc4</version>
        </dependency>

</dependencies>

And then we can use an embedded kettle environment from our code:

public static void run Internal string filename) { 
try { 
KettleEnvironment.init();
 JobMeta jobMeta = new JobMeta(filename, null);
 Job job = new Job (null, jobMeta);
 job.setLogLevel(LogLevel.BASIC); 
job.start();
 job.waitUntilFinished();
 if (job.getErrors() != 0)
 { System.out.println("We have errors");}
 
} catch (KettleException e) 
{
 e.printStackTrace();
}
}

I called this a cool method because, in my experience working in a custom software development company, it gives us more control over the performance of the tasks. We can read tasks in the repository, set parameters, read output parameters, monitor log etc. It is a kitchen embedded in our application. The possibilities here are endless – we can even use PDI modification to manage some of the business reasons in our app. The drawback as in the previous example is that the execution is within the JVM. This can cause a website overload and can lead to crashes. Here we do not need PDI pre-installed on a working machine, but libraries will be packaged in an application that will make the spread even bigger.

Taking the above two approaches to other levels

the best thing with those techniques is that they live in our Java code which of course mean we will do something we want with that code and expand it in any way we want to. that is kind of obvious, however I nonetheless desired to say it due to the fact that lets in us to do easy workarounds and to keep away from the risks. So for instance the biggest downside we noticed right here is that these are accomplished within the JVM and it can load up web server. With some higher architecture of our agency utility we will effortlessly pass that execution to some other instance of the JVM (any other server) or maybe to load stability it to different servers. A easy solution would be to create a separate web service that executes the ETL and make contact with that one from the web portal. every other approach could be to apply a messaging provider and create listeners that execute Jobs the usage of some of the above strategies.

The enterprise way – without writing a single piece of code

Pentaho Data Integration comes with a tool known as Carte which essentially affords a net service interface to the Pentaho server allowing us to execute jobs remotely. walking its miles quite sincere – in the statistics-integration/pwd folder you have got some basic configuration XMLs for the server and there is great documentation on how to configure it according to your wishes. it’ll additionally require a repository installation for the roles. as soon as run it can be accessed thru a simple web interface.

To run a process, you do a call like:

http://carte.server:8080/kettle/runJob/?job=MY_JOB_ON_THE_REPSITORY&level=Debug

This approach permits faraway execution on the server, so it doesn’t suffer from the principle drawback of the preceding strategies. in case you run complex ETL’s that take hours and need to run on different machines and servers, it must be your method.

Conclusion

There are a couple of approaches to execute our Pentaho Data Integration jobs from the Java code. I blanketed simply three but there are possibly more. For company applications, most of the people should cross for the employer way due to the fact it is the most robust and once setup is probably easiest to apply. It does make the infrastructure greater complicated – you want server, repository, some of its superior features even require the organisation version of Pentaho Data Integration. There are scenarios whilst the opposite  tactics work simply pleasant so choose the best on your application!

for more info please visit the below sites :

knoldus