Introduction
Resume screening is the process of determining whether a candidate is qualified for a role based on his or her education, experience, and other information captured on their resume. It’s a form of pattern matching between a job’s requirements and the qualifications of a candidate based on their resume.
The goal of screening resumes is to decide whether to move a candidate forward – usually onto an interview – or to reject them.
Why do we need Resume Screening?
- For each recruitment, companies take out the resume, and referrals and go through them manually.
- Companies often received thousands of resumes for every job posting.
- When companies collect resumes then they categorize those resumes according to their requirements and then they send the collected resumes to the Hiring Teams.
- It becomes very difficult for the hiring teams to read the resume and select the resume according to the requirement, there is no problem if there are one or two resumes but it is very difficult to go through 1000’s resumes and select the best one.
- To solve this problem, we will screen the resume using machine learning and Nlp using Python so that we can complete days of work in a few minutes.
Modules Description
For this blog, we will be using KNN with TfidfVectorizer to generate our automated resume screening system.
KNN (K Nearest Neighbor algorithm)
K Nearest Neighbor algorithm falls under the Supervised Learning category and is used for classification (most commonly) and regression. As the name (K Nearest Neighbor) suggests it considers K Nearest Neighbors (Data points) to predict the class or continuous value for the new Datapoint.
TfidfVectorizer
TF-IDF is an abbreviation for Term Frequency Inverse Document Frequency. This is a very common algorithm to transform text. Basically, the text is transformed into a meaningful representation of numbers which is used to fit machine algorithms for prediction.
Building of Resume Screening System
In this section, we will see the step-wise implementation of Resume screening using python.
Data Used
We have publicly available data from Kaggle. You can download the data using the below link.
https://www.kaggle.com/gauravduttakiit/resume-dataset
Data Preprocessing
Step 1: Clean the ‘Resume’ column
In this step, we remove any unnecessary information from resumes like URLs, hashtags, and special characters.
def cleanResume(resumeText): resumeText = re.sub('httpS+s*', ' ', resumeText) # remove URLs resumeText = re.sub('RT|cc', ' ', resumeText) # remove RT and cc resumeText = re.sub('#S+', '', resumeText) # remove hashtags resumeText = re.sub('@S+', ' ', resumeText) # remove mentions resumeText = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[]^_`{|}~"""), ' ', resumeText) # remove punctuations resumeText = re.sub(r'[^x00-x7f]',r' ', resumeText) resumeText = re.sub('s+', ' ', resumeText) # remove extra whitespace return resumeText resumeDataSet['cleaned_resume'] = resumeDataSet.Resume.apply(lambda x: cleanResume(x))

Step 2: Encoding ‘Category’
Now, we will encode the ‘Category’ column using LabelEncoding. Even though the ‘Category’ column is ‘Nominal’ data we are using LabelEncong because the ‘Category’ column is our ‘target’ column. By performing LabelEncoding each category will become a class and we will be building a multiclass classification model.
var_mod = ['Category'] le = LabelEncoder() for i in var_mod: resumeDataSet[i] = le.fit_transform(resumeDataSet[i])
Step 3: Preprocessing the ‘cleaned_resume’ column
Here we will preprocess and convert the ‘cleaned_resume’ column into vectors. There are many ways to do that like ‘Bag of Words’, ‘Tf-Idf’, ‘Word2Vec’, and a combination of these methods.
We will be using the ‘Tf-Idf’ method to get the vectors in this approach.
requiredText = resumeDataSet['cleaned_resume'].values requiredTarget = resumeDataSet['Category'].values word_vectorizer = TfidfVectorizer( sublinear_tf=True, stop_words='english', max_features=1500) word_vectorizer.fit(requiredText) WordFeatures = word_vectorizer.transform(requiredText)
We have ‘WordFeatures’ as vectors and ‘requiredTarget’ and target after this step.
Model Building
We will be using the ‘One vs Rest’ method with ‘KNeighborsClassifier’ to build this multiclass classification model.
We will use 80% data for training and 20% data for validation. Let’s split the data now into training and test set.
X_train,X_test,y_train,y_test = train_test_split(WordFeatures,requiredTarget,random_state=0, test_size=0.2) print(X_train.shape) print(X_test.shape)
Output:
(769, 1500) (193, 1500)
Now we have trained and tested data let’s build the model.
clf = OneVsRestClassifier(KNeighborsClassifier()) clf.fit(X_train, y_train) prediction = clf.predict(X_test)
Results
Let’s see the results we have
print('Accuracy of KNeighbors Classifier on training set: {:.2f}'.format(clf.score(X_train, y_train))) print('Accuracy of KNeighbors Classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))
Output:
Accuracy of KNeighbors Classifier on training set: 0.99 Accuracy of KNeighbors Classifier on test set: 0.99
We can see that results are awesome. We are able to classify each Category of a given resume with 99% accuracy.
Conclusion
In this blog, we learned how the K Nearest Neighbor machine learning algorithm can be applied for building a system such as a resume screening. We just classified almost 1000 resumes in a few minutes into their respective categories with 99% accuracy.