Working with Big Data and Hadoop in PDI

Reading Time: 4 minutes

Big Data in Pentaho

The term big data applies to very large, complex, and dynamic datasets that need to be stored and managed over a long time. To derive benefits from big data you need the ability to access, process, and analyze data as it is being created. The size and structure of big data make it very inefficient to maintain and process it using traditional relational databases.

Big data solutions re-engineer the components of traditional databases—data storage, retrieval, query, processing—and massively scale them.

Pentaho Big Data Overview

Pentaho increases speed-of-thought analysis against even the largest of big data stores by focusing on the features that deliver performance.

  • Instant access—Pentaho provides visual tools to make it easy to define the sets of data that are important to you for interactive analysis. These data sets and associated analytics can be easily shared with others, and as new business questions arise, new views of data can be defined for interactive analysis.
  • High performance platform—Pentaho is built on a modern, lightweight, high performance platform. This platform fully leverages 64-bit, multi-core processors and large memory spaces to efficiently leverage the power of contemporary hardware.
  • Extreme-scale, in-memory caching— Pentaho is unique in leveraging external data grid technologies like Infinispan and Memcached which load vast amounts of data into memory so that it is instantly available for speed-of-thought analysis.
  • Federated data integration—Data can be extracted from multiple sources including big data and traditional data stores, integrated together and flowed directly into reports, without needing an enterprise data warehouse or data mart.

If you are familiar with Pentaho Data Integration (PDI) ETL steps, transformations, job entries, and jobs you can apply these same concepts to working with PDI and big data, and Hadoop. Here are some of the typical functions you might perform using PDI Hadoop.

1. Load data into a Hadoop cluster

2. Transform data within a Hadoop cluster

3. Extract data from a Hadoop cluster

4. Report on data within a Hadoop cluster

5. Access other big data-related technology and databases, such as MongoDB, Cassandra, Hive, HBase, Sqoop, and Oozie using PDI transformations or jobs.

Hadoop and PDI Workflows

There are two different ways to use PDI/Spoon jobs to connect to and extract Hadoop data, transform it, execute MapReduce operations, and if you want to, return the data to Hadoop.

• Use Pentaho MapReduce to interactively design the data flow for a MapReduce job without writing scripts or code.

• Use PDI Hadoop Job Executor to schedule and execute your existing MapReduce applications.

Pentaho MapReduce Workflow

PDI and Pentaho MapReduce enables you to pull data from a Hadoop cluster, transform it, and pass it back to the cluster. Here is how you would approach doing this.

PDI Transformation

Start by deciding what you want to do with your data, open a PDI transformation, and drag the appropriate steps onto the canvas, configuring the steps to meet your data requirements. Drag the specifically-designed Hadoop MapReduce Input and Hadoop MapReduce Output steps onto the canvas. PDI provides these steps to completely avoid the need to write Java classes for this functionality. Configure both of these steps as needed. You have configured all the steps, add hops to sequence the steps as a transformation.

Hadoop communicates in key/value pairs. PDI uses the MapReduce Input step to define how key/value pairs are interpreted by PDI From Hadoop. The MapReduce Input dialog box enables you to configure the MapReduce Input step.

PDI uses a MapReduce Output step to pass the output back to Hadoop. The MapReduce Output dialog box enables you to configure the MapReduce Output step.


Mapper transformation is created you are ready to include it in a Pentaho MapReduce job entry and build a MapReduce job. Again, no need to provide a Java class to achieve this. Configure the Pentaho MapReduce entry to use the transformation as a mapper. Drag and drop a Start job entry, other job entries as needed, and result in job entries to handle the output onto the canvas. Add hops to sequence the entries into a job that you execute in PDI.

The workflow for the job should look something like this

PDI Hadoop Job Workflow

PDI enables you to execute a Java class from within a PDI/Spoon job to perform operations on Hadoop data. The way you approach doing this is similar to the way would for any other PDI job. In this illustration, it is used in the WordCount – Advanced entry.

Hadoop Hive-Specific SQL Limitations

There are a few key limitations in Hive that prevent regular Metadata Editor features from working as intended and limit the structure of your SQL queries in Report Designer. Outer joins are not supported.

1. Each column can be used once in the SELECT clause. Duplicate columns in SELECT statements cause errors.

2. Conditional joins can only use the = conditional unless you use a WHERE clause. Any non-equal conditional in a FROM statement forces the Metadata Editor to use a cartesian join and a WHERE clause conditional to limit it. This is not much of a limitation but it may seem unusual to experienced Metadata. Editor users who are accustomed to working with SQL databases.


Pentaho provides a complete big data analytics solution that supports the entire big data analytics process. From big data aggregation, preparation, and integration, to interactive visualization, analysis, and prediction, Pentaho allows you to harvest the meaningful patterns buried in big data stores. Analyzing your big data sets gives you the ability to identify new revenue sources, develop loyal and profitable customer relationships, and run your organization more efficiently and cost effectively.

Written by 

Mohd Uzair is a Software intern at Knoldus. He is passionate about java programming. He is recognized as a good team player, a dedicated and responsible professional, and a technology enthusiast. He is a quick learner & curious to learn new technologies. His hobbies include watching movies, surfing youtube, playing video games.