A Little Hands-on with NumPy

Reading Time: 4 minutes

An Introduction to NumPy and what makes it stand out for big data processing is already discussed in the blog NumPy – Say Bye to Loops. So, this blog takes you a little deeper and gets you ready for a hands-on with Numpy.

Here, we will perform analysis on a small census data with NumPy and explore what all we can from the data. But first, we need to have a look at the data.

Exploring the data

Before moving on with the hands-on in NumPy, the first step is to look at the data and understand what each column and value means.

AgeEducation – NumRaceSexCapital – gainCapital
-Loss
IncomeHours
– per – week
39134121740040
50134100013
3894100040
5372100040
311440140840150
42134151780140
37102100180

The dataset has the following features:

FeaturesDescription
AgeAge of the person
Education-NumNo. of years of education they had
RacePerson’s race
KEY==> 0 : Amer-Indian-Eskimo
1 : Asian-Pac-Islander
2 : Black
3 : Other
4 : White
SexPerson’s gender
KEY==> 0 : Female
1 : Male
Capital-GainIncome from investment sources, apart from wages/salary
Capital-LossLosses from investment sources, apart from wages/salary
IncomeAnnual Income of the person
KEY==> 0 : Less than or equal to 50K
1 : More than 50K
Hours-per-weekNo. of hours per week the person works

Now, let’s assume that the path to the data set is stored in a variable named path. Thus, we can read it from the location using the following code in NumPy:

>>> import numpy as np
>>> census = np.genfromtxt(path, delimiter=",", skip_header=1)

The delimiter specifies that the data is in CSV (comma-separated values) format and skip_header=1 specifies to ignore the header in the data set.

Now, let’s move on to the hands-on with NumPy and have a look at the various analysis that can be done on this sample data set.

Analysis of Age Distribution

Numpy eases the task for performing simple analysis on any numeric datatype using the built-in methods. Because Age feature in the data set is numeric, we can perform all R supported statistical operations on it.

Using the Age feature in the first column, we can perform any operation on it. But first, we need to extract it form the rest of the data. This can be done using indexing supported by NumPy’s nd-array as follows:

# Fetching all rows of the column at index 0
>>> age = census[:,0]

Once we extract the age column, we can perform analysis on this column as follows:

# max value in age column
max_age = np.max(age)
# min value in age column
min_age = np.min(age)
# mean value of age column
age_mean = np.mean(age)
# standard deviation of age column
age_std = np.std(age)

Analysis of Race Distribution

Now we want to look at the race distribution of the country and identify the race in minority.

First, we will extract all rows of different races in different variables. This can be done using another NumPy concept called as Boolean Indexing.

>>> census[:,2]==4
[ True  True  True  False  True  True  False  ...  ]

For instance, the above code creates a boolean nd-array containing True / False value only, with True for values of the column at second index having value 4.

Further on, we can filter the rows matching our condition using the concept of masking. It means that for an array A and a boolean condition, A[condition] will result in an array that contains only the values which satisfy the given condition.

So, if we apply the above condition on our census array, we get the following result:

# Extracting all columns of rows matching the boolean condition
>>> race_4 = census[census[:,2]==4, :]

[[39. 13.  4. ...  0. 0.  40.]
 [50. 13.  4. ...  0. 0.  13.]
 [38.  9.  4. ...  0. 0.  40.]
 ...
 [40. 10.  4. ...  0. 0.  40.]
 [39. 13.  4. ...  0. 1.  50.]
 [50.  9.  4. ...  0. 0.  40.]]

All the rows with 4 in it’s second index are extracted from the data set.

Thus, the complete code to identify minority race from the list of races in our data set is:

# Generic method for extracting rows of given raceId
def extractRace(raceId):
    return census[:,2]==raceId

# Extract different races
race_0 = census[extractRace(0), :]
race_1 = census[extractRace(1), :]


# length of each nd-array created
len_0 = len(race_0)
len_1 = len(race_1)

races = np.array([len_0, len_1])

# finding the smallest array
minority_race_number = np.min(races)

# Fetching the index of the array with the smallest length
minority_race = list(races).index(minority_race_number)

Analysis of Senior Welfare

Continuing with the NumPy hands-on, now we want to confirm that citizens above age sixty are not working more than 25 hours per week. Again, the following code can be used to do that:

# Applying boolean filter for all values in age column at index 0 and # extracting the data for senior citizens
>>> senior_citizens = census[census[:,0]>60,:]
# Calculating the average working hours for senior citizens 
>>> working_hours_sum = np.sum(senior_citizens[:,7])
>>> senior_citizens_len = len(senior_citizens)
>>> avg_working_hours = working_hours_sum/senior_citizens_len

In summary, the above code applies a boolean condition on the age feature of the data set and using boolean indexing, it extracts all the rows which fall under that condition. Once done, the final step is to calculate average working hours of people falling in the category.

More education years – Better paying job

Finally, let’s confirm if a better paying job depends on the number of years a person spent on his education.

To do so, let’s filter out the data with more than ten years of education from the data with less than ten years of education. Thus, this can also be done using boolean indexing and masking.

# Data where education years > 10
>>> high = census[census[:,1]>10,:]
# Data where education years < 10
>>> low = census[census[:,1]<10,:]

As a result, we get two NumPy arrays with filtered data. Now, we can find the mean salary for both the categories and compare them to find out result.

>>> avg_pay_high = np.mean(high[:,7])
>>> avg_pay_low = np.around(np.mean(low[:,7]),decimals = 2) + 0.01
>>> avg_pay_high > avg_pay_low

Conclusion

In conclusion, while looking at the data at hand, the first step should always be to understand what each value means. Once done, find out the possible questions you want the data to answer. When you start doing that, every old question you’ll answer would bring up a whole lot of new questions that require answers too.

Since NumPy provides various built-in methods and concepts like vectorization, broadcasting, and indexing, you’ll only have to focus on answering questions and not about how to code those solutions. NumPy handles most of that for you.

Knoldus-blog-footer-image