Logistic Regression in Machine Learning: A Guided Tour

Reading Time: 4 minutes

In this blog we will understand about the logistic regression and see its practical implementation on loan prediction.

What is Logistic Regression?

Logistic regression is a statistical and machine learning technique for classifying records of a data set based on the values of the input fields. Let’s say we have a loan data set that we’d like to analyse in order to understand which customers might be eligible for the loan. This is historical customer data where each row represents one customer. Imagine that you’re an analyst at this company and you have to find out the loan eligibility of the customer. You’ll use the data set to build a model based on historical records(i.e credit score, education, income etc) and use it to predict the future churn within the customer group.

How is it different from Linear Regression?

Logistic regression is analogous to linear regression but tries to predict a categorical or discrete target field instead of a numeric one. In linear regression, we might try to predict a continuous value of variables such as the price of a house, blood pressure of a patient, or fuel consumption of a car. But in logistic regression, we predict a variable which is binary such as yes/no, true/false, successful or not successful, and so on.

Implementation:

Now lets see how you can use the logistic regression on a data set. You can download the loan data set from here. First we will have to import the required packages.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Import the train and test set form your local.

df_test=pd.read_csv(r"local-path/loan-test.csv")
df_train=pd.read_csv(r"local-path/loan-train.csv")

df_train

You can see the data set for your view and see what all features does it have.

Before moving forward we need check for missing values and pre-process the data. It is a good habit to always do so before doing any kind off model training or prediction. Lets do it.

df_train.isnull().sum()

As we can see their many missing or null values present in the data set. Now since this data set quite small, eliminating those rows is not a good choice. It will not give us the right prediction for our data set. Therefore we need to fill these missing values. One way to go is fill them with values having max count in that column. Lets see how to do it.

print(df_train.Gender.value_counts())
print(df_test.Gender.value_counts())

Here we have max count for gender as Male. So lets fill them.

#Filling missing values with male
df_train["Gender"]= df_train["Gender"].fillna("Male")
df_test["Gender"]= df_test["Gender"].fillna("Male")

Similarly we will do it for all other features having missing values.

#Filling missing value with yes
df_train["Married"]= df_train["Married"].fillna("Yes")
df_test["Married"]= df_test["Married"].fillna("Yes")

#Filling missing values with 0
df_train["Dependents"]= df_train["Dependents"].fillna(0)
df_test["Dependents"]= df_test["Dependents"].fillna(0)

#Filling Missing values with No
df_train["Self_Employed"]= df_train["Self_Employed"].fillna("No")
df_test["Self_Employed"]= df_test["Self_Employed"].fillna("No")

#Filling missing values as mean of Loan Amount
df_train['LoanAmount']= df_train['LoanAmount'].fillna(df_train['LoanAmount'].mean())
df_test['LoanAmount']= df_test['LoanAmount'].fillna(df_test['LoanAmount'].mean())

#Filling missing values with 360
df_train["Loan_Amount_Term"]= df_train["Loan_Amount_Term"].fillna(360)
df_test["Loan_Amount_Term"]= df_test["Loan_Amount_Term"].fillna(360)

#Filling Missing values with 1.0
df_train["Credit_History"]= df_train["Credit_History"].fillna(1.0)
df_test["Credit_History"]= df_test["Credit_History"].fillna(1.0)

Great so now we have filled all our missing values. Lets check it.

df_train.isnull().sum()

Now we can use label encoder to normalise our column data set. Lets do it.

from sklearn import preprocessing
labelEncoder = preprocessing.LabelEncoder()

df_train['Gender']= labelEncoder.fit_transform(df_train['Gender'])
df_test['Gender']= labelEncoder.fit_transform(df_test['Gender'])

df_train['Dependents']=labelEncoder.fit_transform(df_train['Dependents'])
df_test['Dependents']=labelEncoder.fit_transform(df_test['Dependents'])

df_train['Married']= labelEncoder.fit_transform(df_train['Married'])
df_test['Married']= labelEncoder.fit_transform(df_test['Married'])

df_train['Education']= labelEncoder.fit_transform(df_train['Education'])
df_test['Education']= labelEncoder.fit_transform(df_test['Education'])


df_train['Self_Employed']= labelEncoder.fit_transform(df_train['Self_Employed'])
df_test['Self_Employed']= labelEncoder.fit_transform(df_test['Self_Employed'])

df_train['Property_Area']= labelEncoder.fit_transform(df_train['Property_Area'])
df_test['Property_Area']= labelEncoder.fit_transform(df_test['Property_Area'])

df_train['Loan_Status']= labelEncoder.fit_transform(df_train['Loan_Status'])

Now we can create our model. I have selected Loan_ID and Loan_Status as my features. Feel free to play around with other features as well.

#Building Model
x = df_train.drop(columns= ["Loan_ID","Loan_Status"],axis=1)
y = df_train["Loan_Status"]

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

The library sklearn can be used to perform logistic regression in a few lines as shown using the Logistic Regression class.

from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(x_train,y_train)

print(model.score(x_test,y_test))

print(model.score(x_train,y_train))

We will get the output score as:

Conclusion:

We see how to use a Logistic Regression Algorithm to a data set. Hopefully this will also give reader the idea of what basic steps needed to be done before making a model. To know more visit the sklearn library.

1 thought on “Logistic Regression in Machine Learning: A Guided Tour6 min read

Comments are closed.