Music Genre Classification: Identification Of The Audio

Reading Time: 4 minutes

In this blog, we will discuss and build a music genre classification model to predict the genre/label of the music/song.

Music Genre Classification

Today we will build a Tensorflow sequential model to automatically classify different musical genres from the given input audio files.


To train our ml classifier model to predict the audio’s genre/label, we will use the GTZAN Dataset.
You can download the dataset from here.
The dataset contains 2 directories and 2 CSV files.

  • genres_original: Collection of 1000 audio files, 10 genres consisting of 100 audio files in each genre.
    • Blues
    • Classical
    • Country
    • Disco
    • Hip-hop
    • Jazz
    • Metal
    • Pop
    • Reggae
    • Rock
  • image_orginal: Visual representation of each audio file, to classify data using Neural Networks if we want. Audio files were converted to Mel Spectrogram to make this possible.
  • features_30_sec.csv: Containing features of the audio files (30 seconds long). Mean and variance are computed over multiple features that can be extracted from an audio file.
  • features_3_sec.csv: Same structure as above, but the songs were split before 3 seconds from audio files (Increasing 10 times the amount of data we provide to our classification model).

Feature Extraction

The very first step of every machine learning/ai project is data preprocessing.
To build the music genre classification model, we will extract the features and the components from the audio files. Identifying the linguistic content and discarding noises.

Mel Frequency Cepstral Coefficients

In sound processing, the Mel-frequency cepstrum is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear Mel scale of frequency. Mel-frequency cepstral coefficients are coefficients that collectively make up an MFC.

The MFCC feature extraction technique basically includes windowing the signal, applying the DFT, taking the log of the magnitude, and then warping the frequencies on a Mel scale, followed by applying the inverse DCT.

Audio signals are constantly changing, we divide these signals into smaller frames.
Each frame is around 20-40 ms long.
Identification of different frequencies present in each frame.
Separating linguistic frequencies from the noise.
To discard the noise, it takes discrete cosine transform (DCT) of these frequencies. Using DCT we keep only a specific sequence of frequencies that have a high probability of information.

Let’s code now.

# importing neccessary libraries for feature extraction

import pandas as pd

import numpy as np

import os

import librosa

import librosa.display

from tqdm import tqdm

# function to extract the mfccs scaled features for an given input file

# returns a list of mfcss scaled features for each input file

def feature_extractor(input_file):

    audio, sample_rate = librosa.load(input_file,

                                      res_type = 'kaiser_fast'

    mfccs_features = librosa.feature.mfcc(y=audio,



    mfccs_features_scaled = np.mean(mfccs_features.T,

                                    axis= 0

    return mfccs_features_scaled

# function to do the complete extraction of all the audio files present in

# the dataset.

# return a dataframe consiting the mfcss scaled features for each audio file

def feature_extraction():

    metadata = pd.read_csv(METADATA_PATH)

    extracted_features = []

    for index, row in tqdm(metadata.iterrows(), desc='Extracting'):


            class_labels = row['label']

            input_file_name = os.path.join(os.path.abspath(DATASET_PATH),

                                           class_labels + '/',



            extracted_data = feature_extractor(input_file_name)

            extracted_features.append([extracted_data, class_labels])

        except Exception as e:

            print(f'Error Occurred: {e}')


    extracted_features_df = pd.DataFrame(extracted_features,

                                         columns=['feature', 'class']


    return extracted_features_df

The output data frame will look like this

Split dataset

Now, let’s split out the dataset.
We are splitting our dataset into 9:1 which means 90% of data will be used for training and 10% will be used for testing.

The shape of the training and testing data will be

Model Training

Now, let’s build and train our classification model.
We are going to use the Sequential model to build our classifier.
Sequential model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor.
To know more about the sequential model Click here!

EPOCHS = 100


NUM_LABELS = y_train.shape[1]

# Defining the model layers

model = Sequential()

model.add(Dense(1024, input_shape=(40,),



model.add(Dense(512, input_shape=(40,),



model.add(Dense(256, input_shape=(40,),



model.add(Dense(128, input_shape=(40,),



model.add(Dense(64, input_shape=(40,),



model.add(Dense(32, input_shape=(40,),



# Final Layer

model.add(Dense(NUM_LABELS, activation="softmax"))


# Defining Checkpointer for callbacks

check_pointer = ModelCheckpoint(





# Storing Start time

start_time =

# Training model

history_model =,




                          validation_data=(X_test, y_test),




print(f'Total Time taken in training is { - start_time}')

print("-----Model Evaluation----")

print(model.evaluate(X_test, y_test, verbose=0))

The accuracy of the model is 95% on the training data.

Making Predictions

Now we have trained our model, it’s time to test the model and make some predictions.

To test the model, we are going to pick a song whose genre is Blues.
Let’s see what genre the model predicts for our song.

# predicting genre of audio

# predicting the label for test audio

label_predicted = np.argmax(model.predict(mfccs_scaled_features), axis=1)

print(f'The predicted label: {label_predicted}')

# predicting class of the test audio

encoder = split_dataset.LabelEncoder()

class_predicted = encoder.inverse_transform(label_predicted)


Our model correctly identified the genre/label for our given input.


So, In this blog, we have learned about sequential modeling, feature extraction of audio files, and Mel Frequency Cepstral Coefficients. We successfully build and trained a classification model to predict the genre of the music.


Written by 

Durgesh Gupta is a Software Consultant working in the domain of AI/ML.