MachineX: Data Cleaning in Python

Table of contents

Reading Time: 8 minutes

In this Blog, we are going to learn about how to do Data Cleaning in Python.

Most data scientists spend only 20 percent of their time on actual data analysis and 80 percent of their time finding, cleaning, and reorganizing huge amounts of data, which is an inefficient data strategy.

The reason data scientists are hired in the first place is to develop algorithms and build machine learning models—and these are typically the parts of the job that they enjoy most. Yet in most companies today, 80 percent of a data scientist’s valuable time is spent simply finding, cleaning and re-organizing huge amounts of data.

If you are just stepping into this field or planning to make your career in this field, it is important to be able to deal with messy data, whether that means missing values, inconsistent formatting, malformed records, or nonsensical outliers.

In this tutorial, we are going to use python’s NumPy and Pandas libraries to clean data and see in how many ways we can use them.

Datasets

we are using some common datasets to explore our knowledge and each of the dataset is according to the cleaning technique we are using , so you can also download your own data set and follow the instruction .

Here are the different datasets that we will be using, you can download these through given below links or can direct download from githubRepo :

BL-Flickr-Images-Book.csv – A CSV file containing information about books from the British Library
university_towns.txt – A text file containing names of college towns in every US state
olympics.csv – A CSV file summarizing the participation of all countries in the Summer and Winter Olympics

Note: I am using Jupyter Notebook and recommend same to follow along.

Let’s import the required modules and get started!

>>> import pandas as pd
>>> import numpy as np

Now, Lets start with our data cleaning with help of these two modules . we import these these two modules and assign pd and np as object to use them .

Dropping Unrequired Columns in a DataFrame

To delete columns from DataFrames, Pandas uses the “drop” function.Pandas provide a handy way of removing unwanted columns or rows from a DataFramewith the drop() function. Let’s look at a simple example where we drop a number of columns from a DataFrame. Let’s create a DataFrame out of the CSV file ‘BL-Flickr-Images-Book.csv’. In the examples below, we pass a relative path to pd.read_csv, meaning that all of the datasets are in a folder named Datasets in our current working directory:

>>> dataF = pd.read_csv('Datasets/BL-Flickr-Images-Book.csv')
>>> dataF.head()

    Identifier             Edition Statement      Place of Publication  \
0         206                           NaN                    London
1         216                           NaN  London; Virtue & Yorston
2         218                           NaN                    London
3         472                           NaN                    London
4         480  A new edition, revised, etc.                    London

  Date of Publication              Publisher  \
0         1879 [1878]       S. Tinsley & Co.
1                1868           Virtue & Co.
2                1869  Bradbury, Evans & Co.
3                1851          James Darling
4                1857   Wertheim & Macintosh

                                               Title     Author  \
0                  Walter Forbes. [A novel.] By A. A      A. A.
1  All for Greed. [A novel. The dedication signed...  A., A. A.
2  Love the Avenger. By the author of “All for Gr...  A., A. A.
3  Welsh Sketches, chiefly ecclesiastical, to the...  A., E. S.
4  [The World in which I live, and my place in it...  A., E. S.

                                   Contributors  Corporate Author  \
0                               FORBES, Walter.               NaN
1  BLAZE DE BURY, Marie Pauline Rose - Baroness               NaN
2  BLAZE DE BURY, Marie Pauline Rose - Baroness               NaN
3                   Appleyard, Ernest Silvanus.               NaN
4                           BROOME, John Henry.               NaN

   Corporate Contributors Former owner  Engraver Issuance type  \
0                     NaN          NaN       NaN   monographic
1                     NaN          NaN       NaN   monographic
2                     NaN          NaN       NaN   monographic
3                     NaN          NaN       NaN   monographic
4                     NaN          NaN       NaN   monographic

                                          Flickr URL  \
0  http://www.flickr.com/photos/britishlibrary/ta...
1  http://www.flickr.com/photos/britishlibrary/ta...
2  http://www.flickr.com/photos/britishlibrary/ta...
3  http://www.flickr.com/photos/britishlibrary/ta...
4  http://www.flickr.com/photos/britishlibrary/ta...

                            Shelfmarks
0    British Library HMNTS 12641.b.30.
1    British Library HMNTS 12626.cc.2.
2    British Library HMNTS 12625.dd.1.
3    British Library HMNTS 10369.bbb.15.
4    British Library HMNTS 9007.d.28.

When we look at the first five entries using the method,head() we can see that a handful of columns provide ancillary information that would be helpful to the library but isn’t very descriptive of the books themselves:Edition StatementCorporate Author, Corporate Contributors, Former owner, Engraver, Issuance type and Shelfmarks.

Dropping Columns

We can drop these columns in the following way:

>>> data_to_drop = ['Edition Statement',
...            'Corporate Author',
...            'Corporate Contributors',
...            'Former owner',
...            'Engraver',
...            'Contributors',
...            'Issuance type',
...            'Shelfmarks']

>>> dataF.drop(data_to_drop, inplace=True, axis=1)

Above, we defined a list that contains the names of all the columns we want to drop. Next, we call the drop() function on our object, passing in the inplace parameter as True and the axis parameter as 1. This tells Pandas that we want the changes to be made directly in our object and that it should look for the values to be dropped in the columns of the object.

When we inspect the DataFrame again, we’ll see that the unwanted columns have been removed:

>>> dataF.head()
   Identifier      Place of Publication Date of Publication  \
0         206                    London         1879 [1878]
1         216  London; Virtue & Yorston                1868
2         218                    London                1869
3         472                    London                1851
4         480                    London                1857

               Publisher                                              Title  \
0       S. Tinsley & Co.                  Walter Forbes. [A novel.] By A. A
1           Virtue & Co.  All for Greed. [A novel. The dedication signed...
2  Bradbury, Evans & Co.  Love the Avenger. By the author of “All for Gr...
3          James Darling  Welsh Sketches, chiefly ecclesiastical, to the...
4   Wertheim & Macintosh  [The World in which I live, and my place in it...

      Author                                         Flickr URL
0      A. A.  http://www.flickr.com/photos/britishlibrary/ta...
1  A., A. A.  http://www.flickr.com/photos/britishlibrary/ta...
2  A., A. A.  http://www.flickr.com/photos/britishlibrary/ta...
3  A., E. S.  http://www.flickr.com/photos/britishlibrary/ta...
4  A., E. S.  http://www.flickr.com/photos/britishlibrary/ta...

Change the Index of a DataFrame

A Pandas Index extends the functionality of NumPy arrays to allow for more versatile slicing and labeling. In many cases, it is helpful to use a uniquely valued identifying field of the data as its index.

For example, in the dataset used in the previous section, it can be expected that when a librarian searches for a record, they may input the unique identifier (values in the Identifier column) for a book:

>>> dataF['Identifier'].is_unique
True

Let’s replace the existing index with this column using set_index:

>>> dataF = dataF.set_index('Identifier')
>>> dataF.head()
                Place of Publication Date of Publication  \
206                           London         1879 [1878]
216         London; Virtue & Yorston                1868
218                           London                1869
472                           London                1851
480                           London                1857

                        Publisher  \
206              S. Tinsley & Co.
216                  Virtue & Co.
218         Bradbury, Evans & Co.
472                 James Darling
480          Wertheim & Macintosh

                                                        Title     Author  \
206                         Walter Forbes. [A novel.] By A. A      A. A.
216         All for Greed. [A novel. The dedication signed...  A., A. A.
218         Love the Avenger. By the author of “All for Gr...  A., A. A.
472         Welsh Sketches, chiefly ecclesiastical, to the...  A., E. S.
480         [The World in which I live, and my place in it...  A., E. S.

                                                   Flickr URL
206         http://www.flickr.com/photos/britishlibrary/ta...
216         http://www.flickr.com/photos/britishlibrary/ta...
218         http://www.flickr.com/photos/britishlibrary/ta...
472         http://www.flickr.com/photos/britishlibrary/ta...
480         http://www.flickr.com/photos/britishlibrary/ta...

Each record can be acccessed withloc[], it allows us to do label-based indexing, which is the labeling of a row or record without regard to its position:

>>> dataF.loc[206]
Place of Publication                                               London
Date of Publication                                           1879 [1878]
Publisher                                                S. Tinsley & Co.
Title                                   Walter Forbes. [A novel.] By A. A
Author                                                              A. A.
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 206, dtype: object

In other words, 206 is the first label of the index. To access it by position, we could use iloc[0], which does position-based indexing.

Tidying up Fields in the Data

So far, we have removed unnecessary columns and changed the index of our DataFrame to something more sensible. In this section, we will clean specific columns and get them to a uniform format to get a better understanding of the dataset and enforce consistency. In particular, we will be cleaning Date of Publication and Place of Publication.

Upon inspection, all of the data types are currently the object dtype, which is roughly analogous to str in native Python.

It encapsulates any field that can’t be neatly fit as numerical or categorical data. This makes sense since we’re working with data that is initially a bunch of messy strings:

>>> dataF.get_dtype_counts() 
object    6

One field where it makes sense to enforce a numeric value is the date of publication so that we can do calculations down the road:

>>> dataF.loc[1905:, 'Date of Publication'].head(10) 
Identifier 
1905           1888 
1929    1839, 38-54 
2836        [1897?] 
2854           1865 
2956        1860-63 
2957           1873 
3017           1866 
3131           1899 
4598           1814 
4884           1820 
Name: Date of Publication, dtype: object

A particular book can have only one date of publication. Therefore, we need to do the following:

Remove the extra dates in square brackets, wherever present: 1879 [1878]
Convert date ranges to their “start date”, wherever present: 1860-63; 1839, 38-54
Completely remove the dates we are not certain about and replace them with NumPy’s NaN: [1897?]
Convert the string nan to NumPy’s NaN value

Synthesizing these patterns, we can actually take advantage of a single regular expression to extract the publication year:

regex = r'^(\d{4})'

The above regular expression is to find any four digits at the beginning of a string. The above is a raw string (meaning that a backslash is no longer an escape character), which is standard practice with regular expressions.

The \d represents any digit, and {4} repeats this rule four times. The ^ character matches the start of a string, and the parentheses denote a capturing group, which signals to Pandas that we want to extract that part of the regex. (We want ^ to avoid cases where [ starts off the string.)

Let’s see what happens when we run this regex across our dataset:

>>> extr = dataF['Date of Publication'].str.extract(r'^(\d{4})', expand=False) 
>>> extr.head() 
Identifier 
206    1879 
216    1868 
218    1869 
472    1851 
480    1857 
Name: Date of Publication, dtype: object

Not familiar with regex? You can inspect the expression above at regex101.com and read more at the Python Regular Expressions HOWTOMakeExpressions.

Technically, this column still has object dtype, but we can easily get its numerical version with pd.to_numeric:

>>> dataF['Date of Publication'] = pd.to_numeric(extr) 
>>> dataF['Date of Publication'].dtype dtype('float64')

This results in about one in every ten values being missing, which is a small price to pay for now being able to do computations on the remaining valid values:

>>> dataF['Date of Publication'].isnull().sum() / len(dataF) 0.11717147339205986

Great!done!!!!

Cleaning the Entire Dataset Using the applymap Function

There are some instances where it would be helpful to apply a customized function to each cell or element of a DataFrame. Pandas .applymap() method is similar to the in-built map()function and simply applies a function to all the elements in a DataFrame.

We will create a DataFrame out of the “university_towns.txt” file:

>>> head Datasets/univerisity_towns.txt
Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]

We see that we have periodic state names followed by the university towns in that state: StateA TownA1 TownA2 StateB TownB1 TownB2.... If we look at the way state names are written in the file, we’ll see that all of them have the “[edit]” substring in them.

We can take advantage of this pattern by creating a list of (state, city) tuples and wrapping that list in a DataFrame

>>> university_towns = []
>>> with open('Datasets/university_towns.txt') as file:
...     for line in file:
...         if '[edit]' in line:
...             # Remember this `state` until the next is found
...             state = line
...         else:
...             # Otherwise, we have a city; keep `state` as last-seen
...             university_towns.append((state, line))

>>> university_towns[:5]
[('Alabama[edit]\n', 'Auburn (Auburn University)[1]\n'),
 ('Alabama[edit]\n', 'Florence (University of North Alabama)\n'),
 ('Alabama[edit]\n', 'Jacksonville (Jacksonville State University)[2]\n'),
 ('Alabama[edit]\n', 'Livingston (University of West Alabama)[2]\n'),
 ('Alabama[edit]\n', 'Montevallo (University of Montevallo)[2]\n')]

We can wrap this list in a DataFrame and set the columns as “State” and “RegionName”. Pandas will take each element in the list and set State to the left value and RegionName to the right value.

>>> towns_dataF = pd.DataFrame(university_towns,
...                         columns=['State', 'RegionName'])

>>> towns_dataF.head()
 State                                         RegionName
0  Alabama[edit]\n                    Auburn (Auburn University)[1]\n
1  Alabama[edit]\n           Florence (University of North Alabama)\n
2  Alabama[edit]\n  Jacksonville (Jacksonville State University)[2]\n
3  Alabama[edit]\n       Livingston (University of West Alabama)[2]\n
4  Alabama[edit]\n         Montevallo (University of Montevallo)[2]\n

`applymap()`

While we could have cleaned these strings in the for loop above, Pandas makes it easy. We only need the state name and the town name and can remove everything else. While we could use Pandas’ .str() methods again here, we could also use applymap() to map a Python callable to each element of the DataFrame.

We have been using the term element, but what exactly do we mean by it? Consider the following “toy” DataFrame:

       0           1
0    Mock     Dataset
1  Python     Pandas
2    Real     Python
3   NumPy     Clean

In this example, each cell (‘Mock’, ‘Dataset’, ‘Python’, ‘Pandas’, etc.) is an element. Therefore, applymap() will apply a function to each of these independently. Let’s define that function:

>>> def get_citystate(item): 
...     if ' (' in item: 
...         return item[:item.find(' (')] 
...     elif '[' in item: 
...         return item[:item.find('[')] 
...     else: 
...         return item

Pandas’ .applymap() only takes one parameter, which is the function (callable) that should be applied to each element:

>>> towns_dataF =  towns_dataF.applymap(get_citystate)

First, we define a Python function that takes an element from the DataFrame as its parameter. Inside the function, checks are performed to determine whether there’s a ( or [in the element or not.

function returned values accordingly, depending on the check. Finally, the applymap() function is called on our object. Now the DataFrame is much neater:

>>> towns_dataF.head()      
State    RegionName 
0  Alabama        Auburn 
1  Alabama      Florence 
2  Alabama  Jacksonville 
3  Alabama    Livingston 
4  Alabama    Montevallo

The methodapplymap() took each element from the DataFrame, passed it to the function, and the original value was replaced by the returned value. It’s that simple!

Happy learning !!!!!!!!!

Check out the links below to find additional resources that will help you on your Python data science journey: