February 28 2024

Speech Command Recognition: The Ultimate Guide

Mahmoud Elzeiny DEEP LEARNING Command, Deep Learning, Recognition, Speech 0

Speech command recognition systems have become integral to modern technology, enabling seamless interaction with devices through spoken commands. From virtual assistants like Siri and Alexa to automotive voice control systems, these systems play a crucial role in enhancing user experience and accessibility. After previously discussing the most important sound features in our Sound Properties Guide, let’s now leverage this knowledge to select appropriate features for constructing a speech command recognition system capable of identifying spoken words in audio recordings. In this article we’ll cover the following:

Choose the suitable features for our task.
Install and import the required libraries.
Download, load and split the required Mini Speech Commands Dataset.
Preprocess our data and compute our features.
Build & train our model.
Evaluation

Choosing The Suitable Features

In this task, where the goal is to classify spoken words from a limited set rather than developing a full-fledged speech recognition system, mimicking the auditory process is crucial. Therefore, we’ll opt for either Mel-Spectrogram or MFCC features. Given that our aim is to work with compressed feature sets and streamline the process, we’ll stick with MFCC features, which provide a compact representation suitable for our specific classification task.

Building the system

Let’s break down the process of constructing a speech command recognition system step by step

Installing required libraries

!pip install -U --pre tensorflow tensorflow_datasets
!apt install --allow-change-held-packages libcudnn8=8.1.0.77-1+cuda11.2

Importing libraries

import os
import pathlib

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import models
from IPython import display

# Set the seed value for experiment reproducibility.
seed = 42
tf.random.set_seed(seed)
np.random.seed(seed)

Loading the dataset

We will use Google’s Mini Speech Commands Dataset, containing the following commands:

down
go
left
no
right
stop
up
yes

DATASET_PATH = 'data/mini_speech_commands'

data_dir = pathlib.Path(DATASET_PATH)
if not data_dir.exists():
  tf.keras.utils.get_file(
      'mini_speech_commands.zip',
      origin="http://storage.googleapis.com/download.tensorflow.org/data/mini_speech_commands.zip",
      extract=True,
      cache_dir='.', cache_subdir='data')
  
commands = np.array(tf.io.gfile.listdir(str(data_dir)))
commands = commands[commands != 'README.md']
print('Commands:', commands)

Splitting the dataset

Now let’s split our dataset into train, validation and test sets.

batch_size = 32
train_ds, val_ds = tf.keras.utils.audio_dataset_from_directory(
    directory= data_dir,
    batch_size= batch_size,
    validation_split= 0.2,
    seed= 0,
    output_sequence_length= 16000,
    subset= 'both')

label_names = np.array(train_ds.class_names)
print()
print("label names:", label_names)

def squeeze(audio, labels):
  audio = tf.squeeze(audio, axis=-1)
  return audio, labels

train_ds = train_ds.map(squeeze, tf.data.AUTOTUNE)
val_ds = val_ds.map(squeeze, tf.data.AUTOTUNE)

test_ds = val_ds.shard(num_shards=2, index=0)
val_ds = val_ds.shard(num_shards=2, index=1)

Data preprocessing

Now let’s define our feature extraction function then use it to extract the features.

# An integer representing the sampling rate.
sr = 16000
# An integer scalar Tensor. The window length in samples.
frame_length = int(sr/40) #25 ms
# An integer scalar Tensor. The number of samples to step.
frame_step = int(sr/100) #10 ms
# An integer scalar Tensor. The size of the FFT to apply.
fft_length = int(sr/40) #25 ms
# An integer representing the num of filterbanks.
num_feats = 40

def get_mfccs(
        audio,
        channels= 1,
        sample_rate= 16000,
        frame_length= 400,
        frame_step = 160,
        fft_length = 400,
        num_feats = 40
    ):

    stfts = tf.signal.stft(audio, frame_length=frame_length, frame_step=frame_step, fft_length=fft_length)

    spectrograms = tf.abs(stfts)

    # Warp the linear scale spectrograms into the mel-scale.
    num_spectrogram_bins = stfts.shape[-1]
    lower_edge_hertz, upper_edge_hertz, num_mel_bins = 0 , sample_rate/2, num_feats
    linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(
      num_mel_bins, num_spectrogram_bins, sample_rate, lower_edge_hertz, upper_edge_hertz)
    mel_spectrograms = tf.tensordot(
      spectrograms, linear_to_mel_weight_matrix, 1)
    mel_spectrograms.set_shape(spectrograms.shape[:-1].concatenate(
      linear_to_mel_weight_matrix.shape[-1:]))

    # Compute a stabilized log to get log-magnitude mel-scale spectrograms.
    log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-6)

    # Compute MFCCs from log_mel_spectrograms
    mfccs = tf.signal.mfccs_from_log_mel_spectrograms(
      log_mel_spectrograms)
    mfccs = mfccs[..., tf.newaxis]

    return mfccs

def make_spec_ds(ds):
  return ds.map(
      map_func=lambda audio,label: (get_mfccs(audio, sample_rate=sr, frame_length=frame_length, frame_step=frame_step, fft_length=fft_length, num_feats=num_feats), label),
      num_parallel_calls=tf.data.AUTOTUNE)

train_mfcc_ds = make_spec_ds(train_ds)
val_mfcc_ds = make_spec_ds(val_ds)
test_mfcc_ds = make_spec_ds(test_ds)

train_mfcc_ds = train_mfcc_ds.cache().shuffle(len(train_mfcc_ds)*batch_size).prefetch(tf.data.AUTOTUNE)
val_mfcc_ds = val_mfcc_ds.cache().prefetch(tf.data.AUTOTUNE)
test_mfcc_ds = test_mfcc_ds.cache().prefetch(tf.data.AUTOTUNE)

Data visualization

Now let’s visualize our data by plotting the audio waveform as in Figure 1, with its MFCC features as in Figure 2.

def plot_mfcc(mfcc, ax):
  if len(mfcc.shape) > 2:
    assert len(mfcc.shape) == 3
    mfcc = np.squeeze(mfcc, axis=-1)
  # Convert the frequencies to log scale and transpose, so that the time is
  # represented on the x-axis (columns).
  # Add an epsilon to avoid taking a log of zero.
  log_spec = np.log(mfcc.T + np.finfo(float).eps)
  height = log_spec.shape[0]
  width = log_spec.shape[1]
  X = np.linspace(0, np.size(mfcc), num=width, dtype=int)
  Y = range(height)
  ax.pcolormesh(X, Y, log_spec)

for example_audio, example_labels in train_ds.take(1): 
  label = label_names[example_labels[0]]
  waveform = example_audio[0]
  mfcc = get_mfccs(waveform)

  fig, axes = plt.subplots(2, figsize=(12, 8))
  timescale = np.arange(waveform.shape[0])
  axes[0].plot(timescale, waveform.numpy())
  axes[0].set_title('Waveform')
  axes[0].set_xlim([0, sr])

  plot_mfcc(mfcc.numpy(), axes[1])
  axes[1].set_title('MFCC')
  plt.suptitle(label.title())
  plt.show()

  print('Label:', label)
  print(f'Waveform shape: {len(waveform)} frames')
  print(f'Waveform duration: {len(waveform) / sr} sec')
  print(f'MFCC expected shape: ({(len(waveform)-fft_length)//frame_step+1}, {num_feats}, 1)')
  print('MFCC actual shape:', mfcc.shape)
  print('Audio playback')
  display.display(display.Audio(waveform, rate=sr))

MFCC features — **Figure 2: MFCC Features**

Figures 1 & 2 showcases the audio waveform and its corresponding Mel-Frequency Cepstral Coefficients (MFCC) extracted from an audio sample in the Mini Speech Commands Dataset. Audio waveforms depict the raw signal’s amplitude over time, offering fundamental insights into the temporal characteristics of the audio. In contrast, MFCC features capture essential frequency information in a compressed format, facilitating robust feature extraction and enhancing model interpretability in speech command recognition tasks.

Input and output shapes

for example_mfcc, example_spect_labels in train_mfcc_ds.take(1):
  input_shape = example_mfcc.shape[1:]

num_labels = len(commands)
print(f"Input shape: {input_shape}")
print(f"Num of labels: {num_labels}")

Defining the model

Define the model architecture using convolutional layers followed by max pooling and dense layers.

steps_per_epoch = np.ceil(len(train_mfcc_ds)*batch_size / batch_size)

reduce_lr=tf.keras.callbacks.ReduceLROnPlateau(monitor='val_accuracy', factor=0.1, patience=2, verbose=1, min_lr=1e-7)

callback = [reduce_lr]

def build_model(input_shape, num_labels):
    x_s=layers.Input(shape=input_shape)
    x=layers.Conv2D(64,(3,3),padding='same')(x_s)
    x=layers.MaxPooling2D((2,2),padding='same')(x)
    x=layers.Conv2D(64,(3,3),padding='same')(x)
    x=layers.MaxPooling2D((2,2),padding='same')(x)
    x=layers.Conv2D(64,(3,3),padding='same')(x)
    x=layers.MaxPooling2D((2,2),padding='same')(x)
    x=layers.Flatten()(x)
    x=layers.Dense(1024, activation='relu')(x)
    x=layers.Dropout(0.3)(x)
    x_e=layers.Dense(num_labels, activation='softmax')(x)
    model=models.Model(inputs=x_s,outputs=x_e)

    opt = tf.keras.optimizers.Adam(lr=0.001)
    model.compile(opt,loss='sparse_categorical_crossentropy',metrics=['accuracy'])
    model.summary()
    return model

Building & Training the Model:

epochs = 10

model = build_model(input_shape, num_labels)
history = model.fit(
    train_mfcc_ds,
    validation_data= val_mfcc_ds,
    epochs= epochs,
    callbacks= [callback],
    steps_per_epoch= steps_per_epoch,
)

model.save(f"epoch-{epochs}.h5")

Model Evaluation

Let’s start by plotting the training and validation both accuracy and loss graphs.

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'r', label='Training accuracy')
plt.plot(epochs, val_acc, 'b', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.legend()
plt.show()
plt.clf()

plt.plot(epochs, loss, 'r', label='Training Loss')
plt.plot(epochs, val_loss, 'b', label='Validation Loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

**Figure 5: Speech Command Recognition Training and Validation Accuracy**

Upon analyzing the training and validation accuracy curves (Figure 5), we observe a slight performance gap between the training and validation sets. This suggests a potential issue with overfitting, which could be addressed through regularization techniques such as dropout or weight decay

Figure 6: **Speech Command Recognition** **Training and validation loss graph**

Now let’s evaluate our model on the test set.

test_loss, test_acc = model.evaluate(test_mfcc_ds)
print('test_acc:', test_acc, 'test_loss', test_loss)

To improve our model’s performance, we can experiment with adjusting hyperparameters such as the learning rate, batch size, and dropout rate. For example, reducing the learning rate can help the model converge more gradually, potentially leading to better generalization.

Conclusion

By employing appropriate features and training a straightforward network for just 10 epochs, we achieved an impressive 89% accuracy on the test set. This outcome surpasses the 86% accuracy reported in the Kaggle competition Simple audio recognition: Recognizing keywords | Kaggle.

Resources & References

Author

Mahmoud Elzeiny

View all posts

Speech Command Recognition: The Ultimate Guide

Choosing The Suitable Features

Building the system

Installing required libraries

Importing libraries

Loading the dataset

Splitting the dataset

Data preprocessing

Data visualization

Input and output shapes

Defining the model

Building & Training the Model:

Model Evaluation

Conclusion

Resources & References

Author

Leave a Comment Cancel reply

Subscribe to our newsletter

Speech Command Recognition: The Ultimate Guide

Choosing The Suitable Features

Building the system

Installing required libraries

Importing libraries

Loading the dataset

Splitting the dataset

Data preprocessing

Data visualization

Input and output shapes

Defining the model

Building & Training the Model:

Model Evaluation

Conclusion

Resources & References

Author

Related Posts

Understanding Convolutional Neural Networks

Introduction To Deep Learning

How Neural Networks Learn: Understanding BackPropagation

Leave a Comment Cancel reply

Subscribe to Our Newsletter