Linear Regression is perhaps one of the most well known and well -understood algorithms in Statistics and Machine Learning. It tries to find a relationship between the independent and dependent continuous variables by determining a linear equation of the form
Y = b0 + b1*x1 + b2*x2 + ..... Here, the
x values represent the independent variables,
b values are the coefficients of the independent variables and
Y represents the output or predicted value. The linear equation thus formed is the best-fit line of the data that predicts the output value for given input values with minimum error.
In this blog, we will see how to implement Linear Regression with Knime.
Exploring the Dataset
LEGO is a popular brand of toy building bricks. They are often sold in sets to build a specific object. Each set is designed for a particular age-group, with a theme in mind and containing a different number of pieces. Each set has a different rating and price. Using this data, we want to design a Linear Regression model with Knime that can predict the price of a given Lego set.
The Lego Dataset we are using looks like this:
The different features in the dataset are:
|age||Which age categories it belongs to||String|
|list_price||price of the set (in $)||Double|
|num_reviews||number of reviews per set||Integer|
|piece_count||number of pieces in that lego set||Integer|
|review_difficulty||difficulty level of the set||String|
|theme_name||which theme it belongs||String|
Pre-Processing and Cleaning the data
Having a look at the data, you may notice that some of the features in the dataset are textual in nature. Thus, they don’t add value to the prediction model.
So, once the file is read into Knime using a File Reader node, we need to apply the first pre-processing step to the data. We will read the features with nominal values and map every category in that feature to an integer. Knime’s Category to Number nodes does the job for us.
Now, our complete dataset is in a numerical format. So, the next step is to remove any numeric outliers that may exist. Outliers are extreme values in a feature that deviate from other observations on data. They might exist due to experimental errors or variability in measurement. They need to be removed as they may have an effect on the statistics involved in the data. Knime’s Numeric Outliers node gives us an option to remove the rows with outliers.
After the outliers are removed, the next step is to use Knime’s Missing Value node that allows us to replace all missing values in a feature with a fixed value, the feature’s mean, or any other statistic.
Linear Regression model works under the assumption that there is no relation between independent features. Correlation should exist only between the independent features and the target feature. If multi-collinearity exists, then the overall performance of the model is affected.
To calculate the correlation between the independent features, we configure the Rank Correlation node to use Spearman’s Rank Correlation. The output of the node is a correlation matrix.
From the output, it is clear that there are some independent features that are highly correlated to each other. To filter these columns out, we use Knime’s Correlation Filter node that allows us to set a threshold value on the correlation value of the output matrix. It filters out the columns with correlation more than the threshold value.
From the output of the above node, it is clear that we don’t want to keep star_rating, theme_name, and val_star_rating features. So, using the Column Filter node to our cleaned data, we filter out the unwanted features.
Train – Test Split
Finally, we have our dataset in a form that can be used for training a linear regressor and testing it. Before that, the last step we need to do is split the complete data into Train and Test data. To do so, we use Knime’s Partitioning node. In its configuration, we specify to split the data randomly with 70 % as our train data and the remaining as our test data.
Training and Testing the model
Knime provides a Linear Regression Learner and Regression Predictor node for creating a Linear Regression Learner and Predictor. We feed the train data from partitioning node to the Learner node, and it produces a Predictor Model.
Then, we feed the output model and Test data to the Predictor node that churns out the predicted values for lego set prices.
Evaluating the Model
We use the following metrics for evaluating the performance of the linear regression model:
1. R-Square value
2. Mean Absolute Error
3. Mean Square Error
4. Root Mean Square Error
All these metrics measure how much the predicted value deviates from the actual values. We can directly calculate these metrics using Knime’s Numeric Scorer that takes the predicted feature values and actual feature values as input and produces the metrics.
From the metrics, we see that our model has an R-square value of 72.8 % which means that 72.8 % of our lego dataset falls around the regression line created by our model.
The complete workflow can be seen here.