In this blog, we will discuss and build a music genre classification model to predict the genre/label of the music/song.
Music Genre Classification
Today we will build a Tensorflow sequential model to automatically classify different musical genres from the given input audio files.
Dataset
To train our ml classifier model to predict the audio’s genre/label, we will use the GTZAN Dataset.
You can download the dataset from here.
The dataset contains 2 directories and 2 CSV files.
- genres_original: Collection of 1000 audio files, 10 genres consisting of 100 audio files in each genre.
- Blues
- Classical
- Country
- Disco
- Hip-hop
- Jazz
- Metal
- Pop
- Reggae
- Rock
- image_orginal: Visual representation of each audio file, to classify data using Neural Networks if we want. Audio files were converted to Mel Spectrogram to make this possible.
- features_30_sec.csv: Containing features of the audio files (30 seconds long). Mean and variance are computed over multiple features that can be extracted from an audio file.
- features_3_sec.csv: Same structure as above, but the songs were split before 3 seconds from audio files (Increasing 10 times the amount of data we provide to our classification model).
Feature Extraction
The very first step of every machine learning/ai project is data preprocessing.
To build the music genre classification model, we will extract the features and the components from the audio files. Identifying the linguistic content and discarding noises.
Mel Frequency Cepstral Coefficients
In sound processing, the Mel-frequency cepstrum is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear Mel scale of frequency. Mel-frequency cepstral coefficients are coefficients that collectively make up an MFC.
The MFCC feature extraction technique basically includes windowing the signal, applying the DFT, taking the log of the magnitude, and then warping the frequencies on a Mel scale, followed by applying the inverse DCT.
Audio signals are constantly changing, we divide these signals into smaller frames.
Each frame is around 20-40 ms long.
Identification of different frequencies present in each frame.
Separating linguistic frequencies from the noise.
To discard the noise, it takes discrete cosine transform (DCT) of these frequencies. Using DCT we keep only a specific sequence of frequencies that have a high probability of information.
Let’s code now.
# importing neccessary libraries for feature extraction
import pandas as pd
import numpy as np
import os
import librosa
import librosa.display
from tqdm import tqdm
# function to extract the mfccs scaled features for an given input file
# returns a list of mfcss scaled features for each input file
def feature_extractor(input_file):
audio, sample_rate = librosa.load(input_file,
res_type = 'kaiser_fast'
)
mfccs_features = librosa.feature.mfcc(y=audio,
sr=sample_rate,
n_mfcc=40
)
mfccs_features_scaled = np.mean(mfccs_features.T,
axis= 0
)
return mfccs_features_scaled
# function to do the complete extraction of all the audio files present in
# the dataset.
# return a dataframe consiting the mfcss scaled features for each audio file
def feature_extraction():
metadata = pd.read_csv(METADATA_PATH)
extracted_features = []
for index, row in tqdm(metadata.iterrows(), desc='Extracting'):
try:
class_labels = row['label']
input_file_name = os.path.join(os.path.abspath(DATASET_PATH),
class_labels + '/',
str(row['filename'])
)
extracted_data = feature_extractor(input_file_name)
extracted_features.append([extracted_data, class_labels])
except Exception as e:
print(f'Error Occurred: {e}')
continue
extracted_features_df = pd.DataFrame(extracted_features,
columns=['feature', 'class']
)
return extracted_features_df
The output data frame will look like this

Split dataset
Now, let’s split out the dataset.
We are splitting our dataset into 9:1 which means 90% of data will be used for training and 10% will be used for testing.
The shape of the training and testing data will be



Model Training
Now, let’s build and train our classification model.
We are going to use the Sequential model to build our classifier.
A Sequential
model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor.
To know more about the sequential model Click here!
EPOCHS = 100
BATCH_SIZE = 32
NUM_LABELS = y_train.shape[1]
# Defining the model layers
model = Sequential()
model.add(Dense(1024, input_shape=(40,),
activation="relu"))
model.add(Dropout(0.3))
model.add(Dense(512, input_shape=(40,),
activation="relu"))
model.add(Dropout(0.3))
model.add(Dense(256, input_shape=(40,),
activation="relu"))
model.add(Dropout(0.3))
model.add(Dense(128, input_shape=(40,),
activation="relu"))
model.add(Dropout(0.3))
model.add(Dense(64, input_shape=(40,),
activation="relu"))
model.add(Dropout(0.3))
model.add(Dense(32, input_shape=(40,),
activation="relu"))
model.add(Dropout(0.3))
# Final Layer
model.add(Dense(NUM_LABELS, activation="softmax"))
print(model.summary())
# Defining Checkpointer for callbacks
check_pointer = ModelCheckpoint(
filepath=f'saved_models/genre_classification.hdf5',
verbose=1,
save_best_only=True
)
# Storing Start time
start_time = datetime.now()
# Training model
history_model = model.fit(X_train,
y_train,
batch_size=BATCH_SIZE,
epochs=EPOCHS,
validation_data=(X_test, y_test),
callbacks=[check_pointer],
verbose=1
)
print(f'Total Time taken in training is {datetime.now() - start_time}')
print("-----Model Evaluation----")
print(model.evaluate(X_test, y_test, verbose=0))



The accuracy of the model is 95% on the training data.
Making Predictions
Now we have trained our model, it’s time to test the model and make some predictions.
To test the model, we are going to pick a song whose genre is Blues.
Let’s see what genre the model predicts for our song.
# predicting genre of audio
# predicting the label for test audio
label_predicted = np.argmax(model.predict(mfccs_scaled_features), axis=1)
print(f'The predicted label: {label_predicted}')
# predicting class of the test audio
encoder = split_dataset.LabelEncoder()
class_predicted = encoder.inverse_transform(label_predicted)
print(class_predicted[0])



Our model correctly identified the genre/label for our given input.
Conclusion
So, In this blog, we have learned about sequential modeling, feature extraction of audio files, and Mel Frequency Cepstral Coefficients. We successfully build and trained a classification model to predict the genre of the music.
References
- https://techhub.knoldus.com/dashboard/projects/ml/62b2f79a5b00007f34abf0ad
- https://www.tensorflow.org/guide/keras/sequential_model
- https://en.wikipedia.org/wiki/Mel-frequency_cepstrum#:~:text=Mel%2Dfrequency%20cepstral%20coefficients%20(MFCCs,%2Da%2Dspectrum%22).