KNIME is definitely a dream for data scientists. It makes the work of an Data Scientist much easier. If you haven’t heard about KNIME, you can find all about it in our blog Knime Analytics Platform: A dream for a data scientist
Continuing on, in this blog we will now see how to create visualizations in KNIME and how easy it is to create visualizations. We will look at 4 types of visualizations that are most commonly used in data exploration.
Data we are using for Visualisation will be from Kaggle Dataset of House Prices, which can be downloaded from https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview
A histogram is used to examine the distribution of a continuous variable. It divides the data into bins, and plots the frequency of occurrences within the range of each bin. Data we are going to visualize is the column Sale Price
- Search for the Histogram node in the Node Repository, and drag it to the Workflow Editor.
- Connect the Histogram node to the File Reader node.
- Right-click on the Histogram Node and select configure. Choose SalePrice as the Binning Column AND the Aggregation Column. The Number of Bins can be anything that makes sense for the data. The default is 10. Click OK to save this configuration and close the dialog.
- Execute the workflow by clicking on the double green arrow at the top.
- To view the histogram of SalePrice, right-click on the Histogram node, and select ‘View: Histogram View’. The x-axis labels show the SalePrice of range of each bin, and the y-axis shows the frequency of samples that in each bin. You can expand the window to get a better view of the labels.
A scatter plot shows points on a graph to show the relationship between two variables, and can be used to visually inspect the correlation between the variables. Columns we are selecting for Scatter plot are GrLiveArea and SalePrice
- Search for the Scatter Plot node in the Node Repository, and drag it to the Workflow Editor, and connect it to the File Reader node.
- There is no further configuration necessary. Execute the workflow, and view the Scatter Plot node and see the Column Selection tab to make the X Column GrLiveArea and the Y Column SalePrice.
The plot shows that there is a very strong Liner Correlation between the SalePrice and GrLiveArea. We will see Correlation Matrix at the end of this blog.
A box plot can be used to compare different distributions. Data for a numeric variable is partitioned into categories, and a box plot is created to show the distribution for each category. The box plots are then shown on a single graph to compare the different categories. Columns we are selecting for this is OverallQual
- Search for the Box Plot node in the Node Repository, and drag it to the Workflow Editor, and connect it to the File Reader node.
- There is no further configuration necessary. Execute the workflow, and view the Box Plot node and see the Column Selection, by default all the columns are selected. Exclude all the columns at once with << sign and then select the columns for which you want to see the box plot
A correlation matrix is a table showing correlation coefficients between sets of variables. Each random variable (Xi) in the table is correlated with each of the other values in the table (Xj). This allows you to see which pairs have the highest correlation. We will see Correlation Matrix between some variables
- Search for the Linear Correlation node in the Node Repository, and drag it to the Workflow Editor, and connect it to the File Reader
- Right-Click and Select Configure
- Select Columns you want to include in the matrix
- There are various settings under Output Column Pairs and p-value which can be changed.
In upcoming blogs we will further move on this workflow to create a complete ML model using KNIME.