Speech Command Recognition: The Ultimate Guide
Speech command recognition systems have become integral to modern technology, enabling seamless interaction with devices through spoken commands. From virtual assistants like Siri and Alexa to automotive voice control systems, these systems play a crucial role in enhancing user experience and accessibility. After previously discussing the most important sound features in our Sound Properties Guide, let’s now leverage this knowledge to select appropriate features for constructing a speech command recognition system capable of identifying spoken words in audio recordings. In this article we’ll cover the following:
- Choose the suitable features for our task.
- Install and import the required libraries.
- Download, load and split the required Mini Speech Commands Dataset.
- Preprocess our data and compute our features.
- Build & train our model.
- Evaluation
Choosing The Suitable Features
In this task, where the goal is to classify spoken words from a limited set rather than developing a full-fledged speech recognition system, mimicking the auditory process is crucial. Therefore, we’ll opt for either Mel-Spectrogram or MFCC features. Given that our aim is to work with compressed feature sets and streamline the process, we’ll stick with MFCC features, which provide a compact representation suitable for our specific classification task.
Building the system
Let’s break down the process of constructing a speech command recognition system step by step
Installing required libraries
!pip install -U --pre tensorflow tensorflow_datasets
!apt install --allow-change-held-packages libcudnn8=8.1.0.77-1+cuda11.2
Importing libraries
import os
import pathlib
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import models
from IPython import display
# Set the seed value for experiment reproducibility.
seed = 42
tf.random.set_seed(seed)
np.random.seed(seed)
Loading the dataset
We will use Google’s Mini Speech Commands Dataset, containing the following commands:
- down
- go
- left
- no
- right
- stop
- up
- yes
DATASET_PATH = 'data/mini_speech_commands'
data_dir = pathlib.Path(DATASET_PATH)
if not data_dir.exists():
tf.keras.utils.get_file(
'mini_speech_commands.zip',
origin="http://storage.googleapis.com/download.tensorflow.org/data/mini_speech_commands.zip",
extract=True,
cache_dir='.', cache_subdir='data')
commands = np.array(tf.io.gfile.listdir(str(data_dir)))
commands = commands[commands != 'README.md']
print('Commands:', commands)
Splitting the dataset
Now let’s split our dataset into train, validation and test sets.
batch_size = 32
train_ds, val_ds = tf.keras.utils.audio_dataset_from_directory(
directory= data_dir,
batch_size= batch_size,
validation_split= 0.2,
seed= 0,
output_sequence_length= 16000,
subset= 'both')
label_names = np.array(train_ds.class_names)
print()
print("label names:", label_names)
def squeeze(audio, labels):
audio = tf.squeeze(audio, axis=-1)
return audio, labels
train_ds = train_ds.map(squeeze, tf.data.AUTOTUNE)
val_ds = val_ds.map(squeeze, tf.data.AUTOTUNE)
test_ds = val_ds.shard(num_shards=2, index=0)
val_ds = val_ds.shard(num_shards=2, index=1)
Data preprocessing
Now let’s define our feature extraction function then use it to extract the features.
# An integer representing the sampling rate.
sr = 16000
# An integer scalar Tensor. The window length in samples.
frame_length = int(sr/40) #25 ms
# An integer scalar Tensor. The number of samples to step.
frame_step = int(sr/100) #10 ms
# An integer scalar Tensor. The size of the FFT to apply.
fft_length = int(sr/40) #25 ms
# An integer representing the num of filterbanks.
num_feats = 40
def get_mfccs(
audio,
channels= 1,
sample_rate= 16000,
frame_length= 400,
frame_step = 160,
fft_length = 400,
num_feats = 40
):
stfts = tf.signal.stft(audio, frame_length=frame_length, frame_step=frame_step, fft_length=fft_length)
spectrograms = tf.abs(stfts)
# Warp the linear scale spectrograms into the mel-scale.
num_spectrogram_bins = stfts.shape[-1]
lower_edge_hertz, upper_edge_hertz, num_mel_bins = 0 , sample_rate/2, num_feats
linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(
num_mel_bins, num_spectrogram_bins, sample_rate, lower_edge_hertz, upper_edge_hertz)
mel_spectrograms = tf.tensordot(
spectrograms, linear_to_mel_weight_matrix, 1)
mel_spectrograms.set_shape(spectrograms.shape[:-1].concatenate(
linear_to_mel_weight_matrix.shape[-1:]))
# Compute a stabilized log to get log-magnitude mel-scale spectrograms.
log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-6)
# Compute MFCCs from log_mel_spectrograms
mfccs = tf.signal.mfccs_from_log_mel_spectrograms(
log_mel_spectrograms)
mfccs = mfccs[..., tf.newaxis]
return mfccs
def make_spec_ds(ds):
return ds.map(
map_func=lambda audio,label: (get_mfccs(audio, sample_rate=sr, frame_length=frame_length, frame_step=frame_step, fft_length=fft_length, num_feats=num_feats), label),
num_parallel_calls=tf.data.AUTOTUNE)
train_mfcc_ds = make_spec_ds(train_ds)
val_mfcc_ds = make_spec_ds(val_ds)
test_mfcc_ds = make_spec_ds(test_ds)
train_mfcc_ds = train_mfcc_ds.cache().shuffle(len(train_mfcc_ds)*batch_size).prefetch(tf.data.AUTOTUNE)
val_mfcc_ds = val_mfcc_ds.cache().prefetch(tf.data.AUTOTUNE)
test_mfcc_ds = test_mfcc_ds.cache().prefetch(tf.data.AUTOTUNE)
Data visualization
Now let’s visualize our data by plotting the audio waveform as in Figure 1, with its MFCC features as in Figure 2.
def plot_mfcc(mfcc, ax):
if len(mfcc.shape) > 2:
assert len(mfcc.shape) == 3
mfcc = np.squeeze(mfcc, axis=-1)
# Convert the frequencies to log scale and transpose, so that the time is
# represented on the x-axis (columns).
# Add an epsilon to avoid taking a log of zero.
log_spec = np.log(mfcc.T + np.finfo(float).eps)
height = log_spec.shape[0]
width = log_spec.shape[1]
X = np.linspace(0, np.size(mfcc), num=width, dtype=int)
Y = range(height)
ax.pcolormesh(X, Y, log_spec)
for example_audio, example_labels in train_ds.take(1):
label = label_names[example_labels[0]]
waveform = example_audio[0]
mfcc = get_mfccs(waveform)
fig, axes = plt.subplots(2, figsize=(12, 8))
timescale = np.arange(waveform.shape[0])
axes[0].plot(timescale, waveform.numpy())
axes[0].set_title('Waveform')
axes[0].set_xlim([0, sr])
plot_mfcc(mfcc.numpy(), axes[1])
axes[1].set_title('MFCC')
plt.suptitle(label.title())
plt.show()
print('Label:', label)
print(f'Waveform shape: {len(waveform)} frames')
print(f'Waveform duration: {len(waveform) / sr} sec')
print(f'MFCC expected shape: ({(len(waveform)-fft_length)//frame_step+1}, {num_feats}, 1)')
print('MFCC actual shape:', mfcc.shape)
print('Audio playback')
display.display(display.Audio(waveform, rate=sr))


Figures 1 & 2 showcases the audio waveform and its corresponding Mel-Frequency Cepstral Coefficients (MFCC) extracted from an audio sample in the Mini Speech Commands Dataset. Audio waveforms depict the raw signal’s amplitude over time, offering fundamental insights into the temporal characteristics of the audio. In contrast, MFCC features capture essential frequency information in a compressed format, facilitating robust feature extraction and enhancing model interpretability in speech command recognition tasks.
Input and output shapes
for example_mfcc, example_spect_labels in train_mfcc_ds.take(1):
input_shape = example_mfcc.shape[1:]
num_labels = len(commands)
print(f"Input shape: {input_shape}")
print(f"Num of labels: {num_labels}")
Defining the model
Define the model architecture using convolutional layers followed by max pooling and dense layers.
steps_per_epoch = np.ceil(len(train_mfcc_ds)*batch_size / batch_size)
reduce_lr=tf.keras.callbacks.ReduceLROnPlateau(monitor='val_accuracy', factor=0.1, patience=2, verbose=1, min_lr=1e-7)
callback = [reduce_lr]
def build_model(input_shape, num_labels):
x_s=layers.Input(shape=input_shape)
x=layers.Conv2D(64,(3,3),padding='same')(x_s)
x=layers.MaxPooling2D((2,2),padding='same')(x)
x=layers.Conv2D(64,(3,3),padding='same')(x)
x=layers.MaxPooling2D((2,2),padding='same')(x)
x=layers.Conv2D(64,(3,3),padding='same')(x)
x=layers.MaxPooling2D((2,2),padding='same')(x)
x=layers.Flatten()(x)
x=layers.Dense(1024, activation='relu')(x)
x=layers.Dropout(0.3)(x)
x_e=layers.Dense(num_labels, activation='softmax')(x)
model=models.Model(inputs=x_s,outputs=x_e)
opt = tf.keras.optimizers.Adam(lr=0.001)
model.compile(opt,loss='sparse_categorical_crossentropy',metrics=['accuracy'])
model.summary()
return model
Building & Training the Model:
epochs = 10
model = build_model(input_shape, num_labels)
history = model.fit(
train_mfcc_ds,
validation_data= val_mfcc_ds,
epochs= epochs,
callbacks= [callback],
steps_per_epoch= steps_per_epoch,
)
model.save(f"epoch-{epochs}.h5")


Model Evaluation
Let’s start by plotting the training and validation both accuracy and loss graphs.
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(len(acc))
plt.plot(epochs, acc, 'r', label='Training accuracy')
plt.plot(epochs, val_acc, 'b', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.legend()
plt.show()
plt.clf()
plt.plot(epochs, loss, 'r', label='Training Loss')
plt.plot(epochs, val_loss, 'b', label='Validation Loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

Upon analyzing the training and validation accuracy curves (Figure 5), we observe a slight performance gap between the training and validation sets. This suggests a potential issue with overfitting, which could be addressed through regularization techniques such as dropout or weight decay

Now let’s evaluate our model on the test set.
test_loss, test_acc = model.evaluate(test_mfcc_ds)
print('test_acc:', test_acc, 'test_loss', test_loss)

To improve our model’s performance, we can experiment with adjusting hyperparameters such as the learning rate, batch size, and dropout rate. For example, reducing the learning rate can help the model converge more gradually, potentially leading to better generalization.
Conclusion
By employing appropriate features and training a straightforward network for just 10 epochs, we achieved an impressive 89% accuracy on the test set. This outcome surpasses the 86% accuracy reported in the Kaggle competition Simple audio recognition: Recognizing keywords | Kaggle.