MERGING DATA STREAMS, DATA CLEANSING AND VALIDATION IN PENTAHO

Reading Time: 5 minutes

Merging Data Streams

This section of the blog will discuss the considerations to be kept in mind when merging data from different streams.

In spite of the fact that there are many steps that can receive multiple inputs and therefore merge data, it is better to use specialized steps. When using specialized steps, it is ensured that the rows are merged in a particular order.  In the absence of a specialized step, the rows will be merged randomly, which is not desirable.

It is generally desired to add all the data of one stream to the end of the other stream. Below is an example, which shows two data streams. Upon merging these two streams, the second stream will simply be placed below the first stream. This is known as appending the streams.

Appending Data Streams

The other option is merging the rows and sorting them according to a particular order. As an example, assume we have two data streams as shown in the image below.

Sorted Merge Streams

Merging these two streams and sorting them according to column A is what is known as a sorted merge. In a sorted merge, data from two streams are merged and then sorted according to one of the columns. Therefore, either the entire block of one stream will be added to the other block, which is defined as appending stream, or the two data will be combined and sorted based on one of the columns, which is defined as sorted merge. Merging data streams usually entails these two options.

Whenever we merge streams, we also need to consider duplication. Even if there is no duplication in the individual streams, it is possible that some records are repeated in two or more streams. When these streams are merged, the final stream will contain duplicate values. Hence, the standard practice would be to deduplicate the primary key column after merging streams so as not to end up with duplicate values in the primary key column. However, what should be done if the output stream contains duplicate values?

A simple solution is to delete one of the rows. This is usually the easiest solution and works fine. Consider the case where the primary key is duplicated, but other values are different. For instance, corresponding to a particular product ID, which functions as the primary key in the product table, there is also an office chair, as well as office paper. Now, the primary key, which is the product ID, is supposed to identify only one product, since it should be unique. But how do you know which one is correct and which one should be deleted? For such a situation, we apply error handling measures.

While discussing cleansing and validating data, we will cover the complete error handling process. In this case, we will use the simplest error handling method, which is to discard the second instance of the same primary key row. The duplicate row, if any, will simply be removed.

Another thing that is of concern while merging data is the importance of sorting before deduplicating. The step that we use to remove duplicates is called the unique rows step. If we use this step, it will always ask us to sort the data beforehand. In this way, it is much more efficient and faster as well. A lot of processing power is already used while sorting, and once sorted, it is easy to deduplicate, since similar rows are beside each other.

Last but not least, we must ensure that the metadata of the streams we merge is the same. In other words, the variable names, their types, their order of occurrence, their lengths should all match.

Data Cleansing

Let’s talk about data cleansing now. Data cleansing is the process of correcting mistakes or typing errors made when collecting or aggregating the data.

Additionally, it includes the format in which the data is collected. As an example, you may enter 5.0 for a variable that has integer values, instead of 5. Likewise, when noting the date, only the month and date were entered, but not the year. While you know it would be the current year, you would like your data to show that year as well. These minor changes we make to improve the quality of the data are called data cleansing.

For data cleansing, validation, or transformation, the steps are nearly identical. In the same manner, as cleansing is done, validation is also carried out. The cleansing process begins when data is extracted and merged. The last example in the previous section was also data cleansing, as we ensured the primary key values were unique and removed duplicates from the product table.

Data Validation

Since we have discussed data cleansing, let us discuss data validation. Data cleansing is simply correcting data errors or changing formats to improve the quality of data. Data validation is the act of applying business rules to which the data needs to conform.

When collecting customer data, for example, if all our customers are supposed to be adults (18 years or older) then, based on our business rules, we have to make sure that our customer age is always over 18 years in our dataset. Other examples of data validation rules include that the age field should be of an integer type, and the reference values from one table should be available in the reference table. We often see that customers enter a credit card number, email address, or phone number, and they have a predefined format. Therefore, we want the entered values for these fields to conform to the predefined format.

All businesses have business rules, which must be applied to the data. When data does not conform to the applied rule, then we say there are some error rows. When error rows don’t match up with data validation rules, then we need error handling.

Error handling refers to what we do with rows that do not comply with our data validation rules.

In case you find error rows in your data, there are four options you can choose from.

One is simply discarding those error rows. By applying a filter and mentioning the business rule as a constraint, whichever rows do not meet the constraint can be immediately removed from the stream of data. There are specific steps in your flow that can help you delete those data rows.

The second option is to remove the error rows, process them, treat them, get them corrected, and then add them back into the mainstream. It can be tricky to set this up because we do not know what type of error will occur, but if there are some standard categories of errors, you can set up error processing and merge those corrected rows with the mainstream.

Third, we can log the error rows, so that when we run any transformation, we can see what steps our software has taken. 

Fourth, all these error rows can be written into a file or a dedicated table and shared with the corresponding team, which can correct the error rows.

References –

To know more about Pentaho Data Integration, you can visit this link.

To read more tech blogs, feel free to visit Knoldus Blogs.

knoldus

Written by 

Akshat Mathur is a Software Consultant at Knoldus Inc.. He has worked on Java, Angular 11 and Spring Boot for more than a year. Learning is his passion and solving problems is his forte.