Best Approach For Resume screening by Machine Learning-Part 1

Reading Time: 3 minutes


Resume screening is the process of determining whether a candidate is qualified for a role based on his or her education, experience, and other information captured on their resume. It’s a form of pattern matching between a job’s requirements and the qualifications of a candidate based on their resume.

The goal of screening resumes is to decide whether to move a candidate forward – usually onto an interview – or to reject them.

Why do we need Resume Screening?

  • For each recruitment, companies take out the resume, and referrals and go through them manually.
  • Companies often received thousands of resumes for every job posting.
  • When companies collect resumes then they categorize those resumes according to their requirements and then they send the collected resumes to the Hiring Teams.
  • It becomes very difficult for the hiring teams to read the resume and select the resume according to the requirement, there is no problem if there are one or two resumes but it is very difficult to go through 1000’s resumes and select the best one.
  • To solve this problem, we will screen the resume using machine learning and Nlp using Python so that we can complete days of work in a few minutes.

Modules Description

For this blog, we will be using KNN with TfidfVectorizer to generate our automated resume screening system.

KNN (K Nearest Neighbor algorithm)

K Nearest Neighbor algorithm falls under the Supervised Learning category and is used for classification (most commonly) and regression. As the name (K Nearest Neighbor) suggests it considers K Nearest Neighbors (Data points) to predict the class or continuous value for the new Datapoint.


TF-IDF is an abbreviation for Term Frequency Inverse Document Frequency. This is a very common algorithm to transform text. Basically, the text is transformed into a meaningful representation of numbers which is used to fit machine algorithms for prediction. 

Building of Resume Screening System

In this section, we will see the step-wise implementation of Resume screening using python.

Data Used

We have publicly available data from Kaggle. You can download the data using the below link.

Data Preprocessing

Step 1: Clean the ‘Resume’ column

In this step, we remove any unnecessary information from resumes like URLs, hashtags, and special characters.

def cleanResume(resumeText):
    resumeText = re.sub('httpS+s*', ' ', resumeText)  # remove URLs
    resumeText = re.sub('RT|cc', ' ', resumeText)  # remove RT and cc
    resumeText = re.sub('#S+', '', resumeText)  # remove hashtags
    resumeText = re.sub('@S+', '  ', resumeText)  # remove mentions
    resumeText = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[]^_`{|}~"""), ' ', resumeText)  # remove punctuations
    resumeText = re.sub(r'[^x00-x7f]',r' ', resumeText) 
    resumeText = re.sub('s+', ' ', resumeText)  # remove extra whitespace
    return resumeText
resumeDataSet['cleaned_resume'] = resumeDataSet.Resume.apply(lambda x: cleanResume(x))
Resume Screening summary

Step 2: Encoding ‘Category’

Now, we will encode the ‘Category’ column using LabelEncoding. Even though the ‘Category’ column is ‘Nominal’ data we are using LabelEncong because the ‘Category’ column is our ‘target’ column. By performing LabelEncoding each category will become a class and we will be building a multiclass classification model.

var_mod = ['Category']
le = LabelEncoder()
for i in var_mod:
    resumeDataSet[i] = le.fit_transform(resumeDataSet[i])

Step 3: Preprocessing thecleaned_resume’ column

Here we will preprocess and convert the ‘cleaned_resume’ column into vectors. There are many ways to do that like ‘Bag of Words’, ‘Tf-Idf’, ‘Word2Vec’, and a combination of these methods.

We will be using the ‘Tf-Idf’ method to get the vectors in this approach.

requiredText = resumeDataSet['cleaned_resume'].values
requiredTarget = resumeDataSet['Category'].values
word_vectorizer = TfidfVectorizer(
WordFeatures = word_vectorizer.transform(requiredText)

We have ‘WordFeatures’ as vectors and ‘requiredTarget’ and target after this step.

Model Building

We will be using the ‘One vs Rest’ method with ‘KNeighborsClassifier’ to build this multiclass classification model.

We will use 80% data for training and 20% data for validation. Let’s split the data now into training and test set.

X_train,X_test,y_train,y_test = train_test_split(WordFeatures,requiredTarget,random_state=0, test_size=0.2)


(769, 1500)
(193, 1500)

Now we have trained and tested data let’s build the model.

clf = OneVsRestClassifier(KNeighborsClassifier()), y_train)
prediction = clf.predict(X_test)


Let’s see the results we have

print('Accuracy of KNeighbors Classifier on training set: {:.2f}'.format(clf.score(X_train, y_train)))
print('Accuracy of KNeighbors Classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))


Accuracy of KNeighbors Classifier on training set: 0.99
Accuracy of KNeighbors Classifier on test set: 0.99

We can see that results are awesome. We are able to classify each Category of a given resume with 99% accuracy.


In this blog, we learned how the K Nearest Neighbor machine learning algorithm can be applied for building a system such as a resume screening. We just classified almost 1000 resumes in a few minutes into their respective categories with 99% accuracy.

Written by 

Tanishka Garg is a Software Consultant working in AI/ML domain.