Regression Analysis

Table of contents

Reading Time: 8 minutes

Introduction

It is a statistical method use in finance, investing, and other disciplines that attempts to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables (known as independent variables)

Regression” comes from “regress” which in turn comes from latin “regressus” – to go back (to something). In that sense, It is the technique that allows “to go back” from messy, hard to interpret data, to a clearer and more meaningful model.

for basic concepts of Data analysis, you may go through this link.

in this blog we will cover the terminologies related , types and how to select the right model.

Dependent Variable: depends on independent variable . It is also refers as target variable.
Independent Variable: The factors which influence the dependent variables or which are use to predict the values of the dependent variables are refers as independent variable, also called as a predictor.
Outliers: Outlier is an observation which contains either very low value or very high value in comparison to other observed values. An outlier may affect the result, so it should be avoid.
Multicollinearity: If the independent variables are highly inter-relate with each other than other variables, then such condition is called Multicollinearity. It should not be present in the dataset, because it creates problem while ranking the most affecting variable.
Underfitting and Overfitting: If our algorithm works well with the training dataset but not well with test dataset, then such problem is Overfitting. And if our algorithm does not perform well even with training dataset, then such problem is underfitting.

Why to use ?

It helps in the prediction of a continuous variable. There are various scenarios in the real world where we need some future predictions such as weather condition, sales prediction, marketing trends, etc., for such case we need some technology which can make predictions more accurately. So for such case we need it which is a method and used in machine learning and data science. Below are some other reasons for using Regression analysis:

this estimates the relationship between the target and the independent variable.
It is use to find the trends in data.
It helps to predict real/continuous values.
we can confidently determine the most important factor, the least important factor, and how each factor is affecting the other factors.

Types

Linear Regression

It is a method which is used for predictive analysis.
It shows the linear relationship between the independent variable (X-axis) and the dependent variable (Y-axis).
If there is only one input variable (x), then such called as simple linear regression. And if there is more than one input variable, then such linear is called multiple linear regression.
The relationship between variables in this model can be explained using the below image. Here we are predicting the salary of an employee on the basis of the year of experience.

Y= aX+b

Here, Y = dependent variables (target variables),

X= Independent variables (predictor variables),

a and b are the linear coefficients

Points to keep in mind:

Note that this model is more susceptible to outliers hence; it should not be used in the case of big-size data.
There should be a linear relationship between independent and dependent variables.
There is only one independent and dependent variable.
It is a best fit straight line.

Multiple linear regression

Simple linear regression allows a data scientist or data analyst to make predictions about only one variable by training the model and predicting another variable. In a similar way, a multiple regression model extends to several more than one variable.

Expression

yi=β0 +β1xi1+β2 x i2 +…+βp x ip+ϵ

where, for i=n observations:

y=dependent variable

x=explanatory variables

β0=y-intercept (constant term)

βp=slope coefficients for each explanatory variable

ϵ=the model’s error term (also known as the residuals)

Points to keep in mind:

It shows these features multicollinearity, autocorrelation, heteroscedasticity.
Multicollinearity increases the variance of the coefficient estimates and makes the estimates very sensitive to minor changes in the model. As a result, the coefficient estimates are unstable.
In the case of multiple independent variables, we can go with a forward selection, backward elimination, and stepwise approach for feature selection.

Polynomial Regression

This models the non-linear dataset using a linear model.
It is similar to multiple linear , but it fits a non-linear curve between the value of x and corresponding conditional values of y.
Suppose there is a dataset which consists of datapoints which are present in a non-linear fashion, so for such case, linear will not best fit to those . To cover such , we need this.
In this, the original features are transfer into polynomial features of given degree and then modeled using a linear model. that means the data points are best fits using a polynomial line.

Points to keep in mind:

In order to fit a higher degree polynomial to get a lower error, can result in overfitting. To plot the relationships to see the fit and focus to make sure that the curve fits according to the nature of the problem.

Logistic Regression

It is another supervise learning algorithm which is use to solve the classification problems. In classification problems, we have dependent variables in a binary or discrete format such as 0 or 1.
this algorithm works with the categorical variable such as 0 or 1, Yes or No, True or False, Spam or not spam, etc.
It is a predictive analysis algorithm which works on the concept of probability.
It is a type of regression, but it is different from the linear regression algorithm in the term how they are use.
this model use sigmoid function or logistic function which is a complex cost function. This sigmoid function is use in model this. The function is:

f(x)= 1/ (1+e^-x)

f(x)= Output between the 0 and 1 value.
x= input to the function
e= base of natural logarithm.
When we provide the input values (data) to the function, it gives the S-curve as follows:
It use the concept of threshold levels, values above the threshold level are round to 1, and values below the threshold level are round to 0.

Ridge Regression

It is one of the most robust versions of linear in which a small amount of bias is introduce so that we can get better long term predictions.
The amount of bias add to the model is refers as Ridge Regression penalty. We can compute this penalty term by multiplying with the lambda to the squared weight of each individual features.

A general linear or polynomial regression will fail if there is high collinearity between the independent variables, so to solve such problems, Ridge regression can be use.
use to reduce the complexity of the model. It is also refers as L2 regularization.
It helps to solve the problems if we have more parameters than samples.

Below is the equation use to denote the Ridge Regression, λ (lambda) resolves the multicollinearity issue:

β = (X′X+λI)^−1 (X′Y)

Lasso Regression

The full form of LASSO is the Least Absolute Shrinkage and Selection Operation. As the name suggests, LASSO uses the “shrinkage” technique in which coefficients are determine , which get shrink towards the central point as the mean.

The LASSO , in regularization is based on simple models that posses fewer parameters. We get a better interpretation of the models due to the shrinkage process. The shrinkage process also enables the identification of variables strongly associated with variables corresponding to the target.

this is another regularization technique to reduce the complexity of the model.
It is similar to the Ridge except that penalty term contains only the absolute weights instead of a square of weights.
Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge can only shrink it near to 0.
this is also refers as L1 regularization. The equation for Lasso will be:

Bayesian Linear Regression

this is use to find out the value of regression coefficients. In Bayesian linear , the posterior distribution of the features is determine instead of finding the least-squares. Bayesian Linear is a combination of Linear and Ridge but is more stable than simple Linear .

Decision Tree Regression

The decision tree as the name suggests works on the principle of conditions. It is efficient and has strong algorithms used for predictive analysis. It has mainly attributed that include internal nodes, branches, and a terminal node.

Every internal node holds a “test” on an attribute, branches hold the conclusion of the test and every leaf node means the class label. It is used for both classifications as well as regression which are both supervised learning algorithms. Decisions trees are extremely delicate to the information they are prepared on — little changes to the preparation set can bring about fundamentally different tree structures.

Random Forest Regression

Random forest, as its name suggests, comprises an enormous amount of individual decision trees that work as a group or as they say, an ensemble. Every individual decision tree in the random forest lets out a class prediction and the class with the most votes is considered as the model’s prediction.

Random forest uses this by permitting every individual tree to randomly sample from the dataset with replacement, bringing about various trees. This is known as bagging.

How to select the right model?

Each type of regression model performs differently and the model efficiency depends on the data structure. Different types of algorithms help determine which parameters are necessary for creating predictions. There are a few methods to perform model selection.

Adjusted R-squared and predicted R-square: The models with larger adjusted and predicted R-squared values are more efficient. These statistics can help you avoid the fundamental problem with regular R-squared—it always increases when you add an independent variable. This property can lead to more complex models, which can sometimes produce misleading results.

Adjusted R-squared increases when a new parameter improves the model. Low-quality parameters can decrease model efficiency.
Predicted R-squared is a cross-validation method that can also decrease the model accuracy. Cross-validation partitions the data to determine whether the model is a generic model for the dataset.

2. P-values for the independent variables: In regression, smaller p-values than significance level indicate that the hypothesis is statistically significant. “Reducing the model” is the process of including all the parameters in the model, and then repeatedly removing the term with the highest non-significant p-value until the model contains only significant weighted terms.

3. Stepwise and Best subsets : When we have a huge amount of independent variables and require a variable selection process, these automated methods can be very helpful.

Conclusion

The different types of regression analysis in data science and machine learning discussed in this presentation can be used to build the model depending upon the structure of the training data in order to achieve optimum model accuracy.