MachineX: Heart Diseases detection using Machine Learning

Reading Time: 4 minutes

In this blog, we will be going to see how we can use machine learning and data science to detect or to predict potential Heart Diseases.


Heart disease describes a range of conditions that affect your heart. Diseases under the heart disease umbrella include blood vessel diseases, such as coronary artery disease, heart rhythm problems (arrhythmias) and heart defects you’re born with (congenital heart defects), among others.

The term “heart disease” is often used interchangeably with the term “cardiovascular disease”. This refers to conditions that involve narrowed or blocked blood vessels that can lead to a heart attack, chest pain (angina) or stroke.

Other heart conditions, such as those that affect your heart’s muscle, valves or rhythm, also are considered forms of heart disease.

Machine Learning is used across many spheres around the world. The healthcare industry is no exception. Machine Learning can play an essential role in predicting the presence/absence of Locomotor disorders, Heart diseases and more. Such information, if predicted well in advance, can provide important insights to doctors who can then adapt their diagnosis and treatment per-patient basis.


Detecting heart diseases is one of the major problems facing doctors and healthcare specialists nowadays.

Preventing heart disease is important. Good data-driven systems for predicting heart disease can improve the entire research and prevention process, making sure that more people can live healthy lives.

In the United States, the Centers for Disease Control and Prevention is a good resource for information about heart disease. According to their website:

  • About 610,000 people die of heart disease in the United States every year–that’s 1 in every 4 deaths.
  • Heart disease is the leading cause of death for both men and women. More than half of the deaths due to heart disease in 2009 were in men.
  • Coronary heart disease (CHD) is the most common type of heart disease, killing over 370,000 people annually.
  • Every year about 735,000 Americans have a heart attack. Of these, 525,000 are a first heart attack and 210,000 happen in people who have already had a heart attack.
  • Heart disease is the leading cause of death for people of most ethnicities in the United States. For American Indians or Alaska Natives and Asians or Pacific Islanders, heart disease is second only to cancer.

I am neither a doctor nor a healthcare expert. This type of data analysis and machine learning solutions can be beneficial as a piece of second advice to doctors.

Data Description

The dataset has been taken from Kaggle.

there are a total of 13 features and 1 target variable. Also, there are no missing values so we don’t need to take care of any null values.

This is how my dataset is looking like:


I took 4 algorithms and varied their various parameters and compared the final models. I split the dataset into 67% training data and 33% testing data.

The project involved an analysis of the heart disease patient dataset with proper data processing. Then, I trained four models and tested them with maximum scores as follows:

K Neighbors Classifier: 87%

The classification score varies based on the different values of neighbors that we choose. Thus, I’ll plot a score graph for different values of K (neighbors) and check when do I achieve the best score.

From the plot above, it is clear that the maximum score achieved was 0.87 for the 8 neighbors.

Support Vector Classifier: 83%

There are several kernels for the Support Vector Classifier. I’d test some of them and check which has the best score.

The linear kernel performed the best, being slightly better than rbf kernel.

Decision Tree Classifier: 79%

Here, I’d use the Decision Tree Classifier to model the problem at hand. I’d vary between a set of max_features and see which returns the best accuracy.

The model achieved the best accuracy at three values of maximum features, 2, 4 and 18.

Random Forest Classifier: 84%

Now, I’d use the ensemble method, Random Forest Classifier, to create the model and vary the number of estimators to see their effect.

The maximum score is achieved when the total estimators are 100 or 500.

K Neighbors Classifier scored the best score of 87% with 8 neighbors.

Happy learning 🙂

Follow MachineX for more”

Written by 

Shubham Goyal is a Data Scientist at Knoldus Inc. With this, he is an artificial intelligence researcher, interested in doing research on different domain problems and a regular contributor to society through blogs and webinars in machine learning and artificial intelligence. He had also written a few research papers on machine learning. Moreover, a conference speaker and an official author at Towards Data Science.