An Introduction to NumPy and what makes it stand out for big data processing is already discussed in the blog NumPy – Say Bye to Loops. So, this blog takes you a little deeper and gets you ready for a hands-on with Numpy.
Here, we will perform analysis on a small census data with NumPy and explore what all we can from the data. But first, we need to have a look at the data.
Exploring the data
Before moving on with the hands-on in NumPy, the first step is to look at the data and understand what each column and value means.
|Age||Education – Num||Race||Sex||Capital – gain||Capital|
– per – week
The dataset has the following features:
|Age||Age of the person|
|Education-Num||No. of years of education they had|
KEY==> 0 : Amer-Indian-Eskimo
1 : Asian-Pac-Islander
2 : Black
3 : Other
4 : White
KEY==> 0 : Female
1 : Male
|Capital-Gain||Income from investment sources, apart from wages/salary|
|Capital-Loss||Losses from investment sources, apart from wages/salary|
|Income||Annual Income of the person|
KEY==> 0 : Less than or equal to 50K
1 : More than 50K
|Hours-per-week||No. of hours per week the person works|
Now, let’s assume that the path to the data set is stored in a variable named path. Thus, we can read it from the location using the following code in NumPy:
>>> import numpy as np >>> census = np.genfromtxt(path, delimiter=",", skip_header=1)
The delimiter specifies that the data is in CSV (comma-separated values) format and skip_header=1 specifies to ignore the header in the data set.
Now, let’s move on to the hands-on with NumPy and have a look at the various analysis that can be done on this sample data set.
Analysis of Age Distribution
Numpy eases the task for performing simple analysis on any numeric datatype using the built-in methods. Because Age feature in the data set is numeric, we can perform all R supported statistical operations on it.
Using the Age feature in the first column, we can perform any operation on it. But first, we need to extract it form the rest of the data. This can be done using indexing supported by NumPy’s nd-array as follows:
# Fetching all rows of the column at index 0 >>> age = census[:,0]
Once we extract the age column, we can perform analysis on this column as follows:
# max value in age column max_age = np.max(age) # min value in age column min_age = np.min(age) # mean value of age column age_mean = np.mean(age) # standard deviation of age column age_std = np.std(age)
Analysis of Race Distribution
Now we want to look at the race distribution of the country and identify the race in minority.
First, we will extract all rows of different races in different variables. This can be done using another NumPy concept called as Boolean Indexing.
>>> census[:,2]==4 [ True True True False True True False ... ]
For instance, the above code creates a boolean nd-array containing True / False value only, with True for values of the column at second index having value 4.
Further on, we can filter the rows matching our condition using the concept of masking. It means that for an array A and a boolean condition, A[condition] will result in an array that contains only the values which satisfy the given condition.
So, if we apply the above condition on our census array, we get the following result:
# Extracting all columns of rows matching the boolean condition >>> race_4 = census[census[:,2]==4, :] [[39. 13. 4. ... 0. 0. 40.] [50. 13. 4. ... 0. 0. 13.] [38. 9. 4. ... 0. 0. 40.] ... [40. 10. 4. ... 0. 0. 40.] [39. 13. 4. ... 0. 1. 50.] [50. 9. 4. ... 0. 0. 40.]]
All the rows with 4 in it’s second index are extracted from the data set.
Thus, the complete code to identify minority race from the list of races in our data set is:
# Generic method for extracting rows of given raceId def extractRace(raceId): return census[:,2]==raceId # Extract different races race_0 = census[extractRace(0), :] race_1 = census[extractRace(1), :] # length of each nd-array created len_0 = len(race_0) len_1 = len(race_1) races = np.array([len_0, len_1]) # finding the smallest array minority_race_number = np.min(races) # Fetching the index of the array with the smallest length minority_race = list(races).index(minority_race_number)
Analysis of Senior Welfare
Continuing with the NumPy hands-on, now we want to confirm that citizens above age sixty are not working more than 25 hours per week. Again, the following code can be used to do that:
# Applying boolean filter for all values in age column at index 0 and # extracting the data for senior citizens >>> senior_citizens = census[census[:,0]>60,:] # Calculating the average working hours for senior citizens >>> working_hours_sum = np.sum(senior_citizens[:,7]) >>> senior_citizens_len = len(senior_citizens) >>> avg_working_hours = working_hours_sum/senior_citizens_len
In summary, the above code applies a boolean condition on the age feature of the data set and using boolean indexing, it extracts all the rows which fall under that condition. Once done, the final step is to calculate average working hours of people falling in the category.
More education years – Better paying job
Finally, let’s confirm if a better paying job depends on the number of years a person spent on his education.
To do so, let’s filter out the data with more than ten years of education from the data with less than ten years of education. Thus, this can also be done using boolean indexing and masking.
# Data where education years > 10 >>> high = census[census[:,1]>10,:] # Data where education years < 10 >>> low = census[census[:,1]<10,:]
As a result, we get two NumPy arrays with filtered data. Now, we can find the mean salary for both the categories and compare them to find out result.
>>> avg_pay_high = np.mean(high[:,7]) >>> avg_pay_low = np.around(np.mean(low[:,7]),decimals = 2) + 0.01 >>> avg_pay_high > avg_pay_low
In conclusion, while looking at the data at hand, the first step should always be to understand what each value means. Once done, find out the possible questions you want the data to answer. When you start doing that, every old question you’ll answer would bring up a whole lot of new questions that require answers too.
Since NumPy provides various built-in methods and concepts like vectorization, broadcasting, and indexing, you’ll only have to focus on answering questions and not about how to code those solutions. NumPy handles most of that for you.