For any Data Scientist, its very normal to deal with data sets having missing terms and still be able to manage and create a good predictive model out of it. Here we will discuss some techniques to handle missing data in a given data set.
Missing Value occur when no data is stored for a variable or feature. It could be represented as “?”, “NA”, “0” or just blank cell.
How to deal with missing data?
There are many ways to deal with missing values, and this is regardless of Python, R, or whatever tool you use.
First is to check if the person or group that collected the data can go back and find what the actual value should be. Another possibility is just to remove the data where that missing value is found.
When you drop data, you can either drop the whole variable or just the single data entry with the missing value. If you don’t have a lot of observations with missing data, usually dropping the particular entry is the best. If you’re removing data, you want to look to do something that has the least amount of impact.
Replacing data is better, since no data is wasted. However, it is less accurate since we need to replace missing data with a guess of what the data should be. One standard replacement technique is to replace missing values by the average value of the entire variable.
But what if the values cannot be averaged, as with categorical variables? In this case, one possibility is to try using the mode –the most common term in that feature or column.
Now let’s see some common practises used in python to handle missing values.
For example we are using a Loan Data.
We see that there is NaN value in the column “LoanAmount”.
In the following data set we will use the dropna function to drop the the row related to missing column. Essentially, with the dropna method, you can choose to drop rows or columns that contain missing values, like NaN. To modify the data frame, you have to set the parameter “inplace” equal to true.
df.dropna(subset = ["LoanAmount"], axis=0,inplace=True)
“Inplace=True” just writes the result back into the data frame. Don’t forget that this line of code does not change the data frame, but is a good way to make sure that you are performing the correct operation. Let’s see what changes has been made to our data frame.
As you can observe the row consisting of the NaN value has been droped.
You should always check the documentation if you are not familiar with a function or method.
To replace missing values like NaNs with actual values, pandas library has a built in method called ‘replace’, which can be used to fill in the missing values with the newly calculated values.
As an example, assume that we want to replace the missing values of the variable ‘’LoanAmount” by the mean of the variable.
Let’s check for “LoanAmount” feature only:
mean = df["LoanAmount"].mean() df["LoanAmount"].replace(np.nan, mean)
The following would be the result:
In case of categorical values, finding mean is not an option. For such cases, one approach is that filling them with the maximum of frequency of the value appearing the most.
Let’s look for the column “Gender”.
We see that “Male” is occurring quite a bit more than. So we can fill the missinf values with “Male” using following lines of code.
This is a fairly simplified way of replacing missing values.
There are of course other techniques, such as replacing missing values for the average of the group, instead of the entire data set. Check out the pandas library for more information.