Pentaho – Hadoop Cluster connection

Reading Time: 2 minutes

Prerequisite: Basic overview of Pentaho.

Using Pentaho you can simply solve all big-data analytics problems easily without writing a single line of code and generate required results/ Output for analysis. It can easily able to establish connections with other Big Data Platforms such as Google Dataproc, Hortonworks Data Platform (HDP)  Amazon Elastic MapReduce (EMR), etc

Also, it can be integrated with its services like HDFS, HBase, Oozie, ZooKeeper for flexibility.
please consider below generalized architecture diagram for reference.

Steps to connect with Hadoop Cluster: – 

1. Download the Driver and and setup the property file

  • To download the driver please visit the below link for your client version  Drivers and Other info. add the connection path in metastore from your server’s plugins directory for big-data plugins i.e.the hadoop.configurations.path property and set the value to the metastore directory. For example, /home/devuser/.pentaho/metastore
  • in the above file find the Hadoop.configurations.path property and Assign the value to the metastore directory. For example, /home/user/.pentaho/metastore
  • then after above changes start the Pentaho Server.

2. Install the driver

  • browse to the <pentahohomedir>/server/pentaho-server/pentaho-solutions/ADDITIONAL-FILES/drivers directory, where < pentahohomedir > is the directory where Pentaho is installed.
  • Select the driver (.kar file) you want to add (i.e. downloaded compatible driver file from support page) and copy it to the browsed server directory.
  • Restart server and services.

3. install client plugins as mentioned below

Go to “Tools -> Marketplace” and search for “Apache Hadoop” and click on install.
please refer to the below image for reference.

Obtaining the Installation Materials

Consult the Welcome Kit email that was sent to you after completing the sales process. This email contains user credentials for the Enterprise Edition FTP site, where you can download individual archive packages for the Data Integration for Hadoop package, and the desktop client tools needed to design Hadoop jobs and transformations. Here are the packages you need for each platform and distribution:

  • Data Integration for Hadoop: phd-ee-4.2.0-GA.tar.gz
  • Client tool Windows package:
  • Tool Linux/Solaris/OS X package: pdi-ee-client-4.2.0-GA.tar.gz
  • Pentaho client tool patches for Apache Hadoop deployments:

More Details Please visit below Links:-  

Sample Core Design

And Other…