Hey folks, In this blog we are going to find out the correlation of categorical variables.
What is Categorical Variable?
In statistics, a categorical variable has two or more categories.
But there is no intrinsic ordering to the categories.
For example, a binary variable(such as yes/no question) is a categorical variable having two categories (yes or no), and there is no intrinsic ordering to the categories.
Categorical variables represent types of data that may be divided into groups.
Examples of categorical variables are race, sex, age, group, and educational level.
What is Correlation?
Correlation is a statistical measure that expresses the extent to which two variables are linearly related.
This means that they change together at a constant rate.
It is a common tool for describing simple relationships without making a statement about cause and effect.
Correlation is a statistic that measures the degree to which two variables move concerning each other.
It shows the strength of a relationship between two variables, expressed numerically by the correlation coefficient.
The correlation coefficient’s values range between -1.0 and 1.0.
A positive correlation means implies that as one variable move, either up or down, the other variable will move in the same direction.
A negative correlation means that the two variables move in opposite directions, while a zero correlation implies no linear relationship at all.
Let’s Find The Correlation of Categorical Variable.
We are not going to deep dive into the mathematics behind the correlation coefficient.
Like other data types such as numerical, boolean we can not use the inbuilt methods of pandas to generate the correlation matrix.
To find the correlation of categorical variables, we are going to use a library called dython.
Dython is a set of data analysis tools in python 3.x, which can let you get more insights into your data.
This library was designed with analysis usage in mind.
Ease-of-use, functionality, and readability are the core values of this library.
Dython will automatically find which features are categorical and which are numerical, compute a relevant measure of association between each and every feature, and plot it all as an easy-to-read heat-map.
We can easily install dython using the pip tool:
pip install dython
or, we can install using the conda package manager.
conda install -c conda-forge dython
if we like to use the source code instead, we can install directly from it using any of the following methods:
- Installing source code pip:
pip install git+https://github.com/shakedzy/dython.git`
Dython requires Python 3.5 or higher, and the following packages:
Importing Neccessary Library
We are going to use to libraries:
Pandas for loading the dataset.
If you want to explore more about Pandas. Check this out: Pandas for Data Analysis.
The second library we are going to use is dython to calculate the correlation.
import pandas as pd from dython.nominal import associations
We are going to use the pokemon dataset for our analysis.
Link to Dataset.
URL ='https://raw.githubusercontent.com/adamerose/datasets/master/pokemon.csv' df= pd.read_csv(URL)
Indentifying the Categorical Variables
We can use the function identify_nominal_columns(dataset) of the dython library to identify the categorical variables in the dataset.
from dython.nominal import identify_nominal_columns categorical_features=identify_nominal_columns(df) categorical_features
['Name', 'Type 1', 'Type 2']
We have identified Name, Type 1, and Type 2 as categorical features in the Pokemon dataset.
Generating Correlation Matrix and Heat-Map.
To generate the correlation matrix, we are going to use the associations function of the dython library.
associations(dataset, nominal_columns='auto', numerical_columns=None, mark_columns=False, nom_nom_assoc='cramer', num_num_assoc='pearson', bias_correction=True, nan_strategy=_REPLACE, nan_replace_value=_DEFAULT_REPLACE_VALUE, ax=None, figsize=None, annot=True, fmt='.2f', cmap=None, sv_color='silver', cbar=True, vmax=1.0, vmin=None, plot=True, compute_only=False, clustering=False, title=None, filename=None)
It calculates the correlation/strength-of-association of features in the data-set with both categorical and continuous features using: Pearson’s R for continuous-continuous cases, Correlation Ratio for categorical-continuous cases, Cramer’s V or Theil’s U for categorical-categorical cases.
associations function returns a dictionary that contains:
- ‘corr’ as key : A DataFrame of the correlation between all features.
- ‘ax’ as value: A matplotlib axe which contains the correlation heatmap.
Firstly, Let’s find the correlation matrix for the whole pokemon dataset.
complete_correlation= associations(df, filename= 'complete_correlation.png', figsize=(10,10))
Correlation Matrix Of Complete Dataset
You can extract the correlation matrix by using the below code.
df_complete_corr=complete_correlation['corr'] df_complete_corr.dropna(axis=1, how='all').dropna(axis=0, how='all').style.background_gradient(cmap='coolwarm', axis=None).set_precision(2)
Heat Map Of Complete Dataset
Heat map generate can be saved by providing the filename and the suitable format like png, jpeg, etc.
Default = None
If not None, the plot will be saved to the given filename
Correlation Matrix of Categorical Variables Only.
To generate the correlation matrix for only categorical variables,
We are going to filter out all the categorical variables in a separate data frame.
selected_column= df[categorical_features] categorical_df = selected_column.copy()
After preparing the separate data frame, we are going to use the below code to generate the correlation for categorical variables.
categorical_correlation= associations(categorical_df, filename= 'categorical_correlation.png', figsize=(10,10))
So, In this blog, we have discussed in brief categorical variables, correlation matrix. We have learned how we can find the correlation matrix of categorical variables. The correlation matrix really helps us in identifying the features which are suitable for our model training.