Keyword Spotting - Part I (Data Collection and Data Annotation)

Overview

This tutorial is a step-by-step guide on how to build keyword spotting applications using the SensiML keyword spotting pipeline. Keyword spotting is a technique used to recognize specific words or phrases in audio signals, usually with the intent of triggering an action. It is used for recognizing commands, allowing users to control a device or application by speaking predetermined commands. For example, a keyword spotting system might be used to control a smart home device by recognizing specific commands like “turn on the light” or “play music”. Keyword spotting algorithms are often used in voice-enabled applications, such as voice-activated assistants, smart speakers, and interactive customer service. Other applications involve detecting acoustic events, such as baby’s cry or coughs, or identifying the speaker characteristics, such as gender or age. Keyword spotting algorithms can identify speech in noisy environments, such as a crowded room or outdoors. They can be used in wake word applications too. These types of application are primarily used in smart home devices. Most home edge devices remain in hibernating mode to conserve energy. When the wake word is heard, the device becomes activated and responds to extra voice commands and/or runs complicated tasks.

The process of keyword spotting typically involves breaking down the audio signal into a series of small, overlapping windows, and then applying a signal processing technique such as a Fast Fourier Transform (FFT) to each window. This will convert the audio signal from the time domain into the frequency domain, to extract relevant features from the audio signal. These features are then passed through a machine learning algorithm, in this case, a deep convolutional neural network, to identify keywords. The neural network is trained using a labeled dataset of audio samples containing the keywords, and without. Once trained, the network can be used to predict whether the keyword or phrase is present within the provided frame of audio signal.

Objectives

This instruction shows how to build a keyword spotting model that identifies four keywords, i.e. “On”, “Off”, “Yes” and “No”. The major steps are

  • Data Collection. You will need to collect audio samples for the keywords you want to detect. The audio data condition must meet all requirements of the desired application. For examples, the audio quality should match the condition of the deployment environment, such as but not limited to, the audio amplitude, background noise level, user gender, and diversity of the speakers. The collected audio sample must fairly represent the real-world scenario.
  • Data Annotation. Each segment of the audio signal that covers a specific keyword must be labelled accordingly.
  • Training. Once the data is fully annotated, you will train a model offered by one of the SensiML keyword spotting templates. Each template loads a neural network which is pre-trained using the Google speech command dataset. Adopting the transfer learning approach, the training process re-tunes some of the network parameters to accommodate the keywords of your dataset.
  • Testing/Evaluating. You will need to evaluate the trained model to make sure it is detecting the keywords accurately. To do this, you can use a test set of audio samples that you set aside from the beginning for this purpose.
  • Deploying. You flash the model to the edge device of interest. If you are satisfied with the accuracy of the generated model, you can compile it, download it and deploy it on your application device. Once the model is deployed, it is important to monitor its performance and accuracy in the production environment. If the model accuracy does not meet the requirements, you can iteratively revisit previous steps, for instance collecting additional data in a more realistic environment, and/or adjusting the training parameters.

Required Software/Hardware

This tutorial uses the SensiML Toolkit to manage collecting and annotating sensor data, creating a sensor preprocessing pipeline, and generating the firmware. Before you start, sign up for the free SensiML Community Edition to get access to the SensiML Analytics Toolkit.

Software

  • SensiML Data Studio (Windows 10) to collect and label the audio data.
  • SensiML Analytics Studio for data preparation and managing pipelines. This interface enables you to generate an appropriate Knowledge Pack for deployment on any of the supported devices.
  • Optional: SensiML Open Gateway or Putty/Tera Term to display the model classifications in real-time.

Hardware

:memo: Note: Although this tutorial is agnostic to the chosen edge device, due to the complexity of the generated model, some devices might not have enough memory to store the model or generate classifications in a reasonable time. Devices with the capability of accelerating the matrix arithmetic operations are recommended, but not necessary.

Collecting Data

Starting with the Data Studio

We use Data Studio (install) in order to connect to the audio sensor and to collect data. If you have already collected your audio data, you can follow these steps to import your captured data into the SensiML server. If you are about to collect new data, please first consult with the Supported Devices section in the left menu bar of the SensiML documentation and flash the proper Data Collection Firmware to the device. If you don’t find your device in the list, please refer to this page to learn how to integrate your data into the Data Studio.

As an example, we have collected some data and stored them in a SensiML project. Follow the steps below to import the project to your account.

  1. Download the example project

Keyword Spotting Demo.zip

  1. Import the project to your account using the Data Studio.

This project includes an example dataset of WAV files and four labels to represent keywords, “On”, “Off”, “Yes” and “No” as well as the additional label “Unknown” which is reserved for background noise and any other audio that does not include the known keywords.

In the next image, you can see one of the WAV files displayed in the Data Studio.

  • The upper track shows the audio signal in blue with the labelled segments highlighted in orange.
  • The lower track illustrates the corresponding audio spectrogram. Spectrogram is a visual representation of how the sound energy is distributed over different frequencies. It helps with identifying patterns, tracking changes over time, and to examining the frequency balance of an audio signal.

:memo: Note: The SensiML keywords spotting algorithm requires the audio dataset to include some data that are labelled as Unknown. The “Unknown” label must be exactly spelled the same way (i.e., in capitalized format).* To see the list of labels, from the top menu of the Data Studio, click on Edit> Project Properties. You can click on the “+” sign to add new labels or right click on any of the labels to modify/delete them.

../_images/kw_image4.png

Recording Audio Data

We use Data Studio to connect to a device and collect audio data. Please also refer to this tutorial for further details on how to collect data using the Data Studio. Here, we briefly cover the main steps of data collection.

:memo: Note: Make sure that your device has been flashed with data collection firmware.

To collect new data, click on the Live Capture button in the left navigation bar

../_images/dcl-navigation-bar-left-live-capture-button.png

Now, we will prepare the Data Studio to communicate with the device and record data at the desired sampling rate. On the bottom of the Data Studio window, clicking on the Connect button, opens a window that allows you to scan your system. Find your device and adjust the capture settings (as explained here). The current SensiML keyword spotting models require the audio sampling rate to be 16000 Hz.

../_images/dcl-sensor-connect.png

Once your device is connected to the Data Studio, the audio signal is displayed in real-time. At this point you can Start Recording the audio. You also have the option to adjust Capture Settings such as recording length, size and range of the display window.

../_images/dcl-live-capture-start-recording.png

If you are recording data for the keywords, leave enough space between the keyword events to experience a straightforward annotation later. To better track your workflow as you extend your dataset, it is recommended that each recording only include one keyword.

When you are done recording a file, click on the Stop Recording button and fill in the metadata form accurately.

We suggest you decide on the metadata fields before you start your data collection. You can add as many metadata fields as necessary.

In this tutorial, we require uses to include a specific metadata field to keep track of data subsets. Usually 20-30% of the collected data is set aside for cross validation and testing purposes. Training data should not be taken in the validating and testing tasks. To make sure this condition always holds, we define a metadata field “Set” to store the category that the recorded data belong to. The Set column can take three values: “Train”, “Validate” and “Test”. By adding these options, we can guarantee the same data is never used for training, validation, and testing.

In this project, each recording belongs to only one Set and consists of one audio keyword.

If you are collecting data for multiple individuals, dedicate a separate metadata field to keep track of speakers.

Do not worry if you missed a few metadata fields and you want to introduce them later as your project evolves.

You always have the option to add, review, and modify metadata fields. To do so, from the main menu of the Data Studio click on Edit> Project Properties and go to the Metadata section. You can right click on any of the metadata items to delete/modify them or use the plus icon (+) to introduce new ones.

../_images/kw_image9.png

In this example, double clicking on “Set” opens the list of all possible options that will be accessible through a dropdown menu.

../_images/kw_image10.png

If you want to assign values to your newly defined metadata field of your previous recordings, or change their values, open the list of all recording by clicking on the “Project Explorer” on the top left, right click on the file name and select Metadata> Edit.

Data Annotation

Defining Labels

In case you have downloaded the example project, it already includes all four keyword labels (“Yes”, “No”, “On” and “Off”) plus the “Unknown” label for annotating audio noise and random speech events.

If you have created a new project for another set of keywords and have not yet defined your desired labels on the Data Studio, you can go to Top Menu> Edit> Project Properties and define as many labels as your project needs by using the plus icon (+) on the bottom right side of the window.

Defining Labeling Session

The Data Studio organizes label information in sessions. A session separates your labeled events (segments) into a group. This allows you to experience a better workflow with storing different versions of labels in separate sessions that can be later targeted by the data query block of the modeling pipeline. In order to make a new session, you can click on the session options button above the graph.

../_images/dcl-session-segment-explorer-change.png

Click on “Add New Session” to create a new one. You can also switch between multiple labelling sessions.

Sessions can be leveraged in multiple ways. For instance, they can be used to keep track of the classifications made by various models on the same test data, or to store annotations produced by different protocols.

In this example, we used “Training Session” to store labels we use to build our keyword spotting model. We devote a separate session, “Model Testing”, to store the labels that are generated when testing a model.

Labeling the Audio Data

First, we switch to the labeling session we wish to use for training our model. Here, it is called “Training Session”.

Known Keywords

In this project, the keyword spotting model needs 1 second of the audio segment, consisting of 16,000 audio samples at the rate of 16 kHz. Don’t worry if the size of your keywords are slightly larger than 1 second. Our model is still capable of making reasonable classifications if a significant part of each keyword is covered within a 1-second data window.

Although each annotated segment of data must include at least 16,000 samples, we recommend increasing the size of segments by 25% and include about 20,000 samples in each segment. Here, the only condition is that every 16,000 subsegment must cover a significant part of the audio event. Segments that are smaller than 16,000 samples would not be considered in the model building process.

It is easier to set the default value of the segment size to a reasonable value. In this project, we set this value to be 20,000 samples.

Change the default segment size by going to Top Menu> Edit> Settings> Label and set the Default Segment Length to 20,000 samples.

Once you set the default segment size, open your keyword files, and label them accordingly.

  • To generate a new segment, right click on the signals where you want the segment to start. Try to place keywords at the center of segments.
  • Select the “Edit” button in the segment explorer box to change the label. By dragging the mouse pointer, you can also adjust the location and size of a selected segment.
  • During the data collection process, try to leave enough space between keywords to avoid overlapping segments. It is still acceptable of the edge of adjacent segments mildly overlap, as long the main body of the events are covered by different segments.

Unknown Audio Signals, Background Noise

As previously mentioned, in addition to the recording audio instances for each target keyword, you need to collect samples for variety of background noise conditions, such as traffic noise, room noise, and various types of ambient noise. We label these audio samples with the “Unknown” label. Having a good set of Unknown audio signals helps to improve the model accuracy by reducing the rate of false positive detection. In the context of an audio signal, this would mean that the signal is incorrectly classified as one of the model keywords while it belongs to a different random word or an environmental sound.

In this project, we consider two kinds of Unknown signals: (1) background noise and (2) random audio words.

For every project, the Unknown signals are carefully selected to address the project specifications. For most human keyword spotting applications, we suggest you include a good dataset of white/pink/blue random noise signals. You can simply play various environmental noise videos from the internet and capture the audio signal by your device. Few examples of interesting background noise that might influence the performance of any audio classification model are “fan noise”, “crowd noise”, “street noise”, “shower noise”, “kitchen noise”. Depending on the design of your device microphone, some audio noises might have significant effects and some might not even be detected.

Displayed audio signal in the following figure belongs to a noisy restaurant. The entire range of the signal has been annotated with the “Unknown” label.

In addition to random noise, you need to have a set of random audio words in your Unknown dataset to reduce the chance of false positive detections. These audio events usually consist of words with variable lengths and different pronunciations. The content of these audio words should not be too similar to the project keywords or other audio signals that are already in the dataset. These types of Unknown signals are not limited to the human voice and can include very distinct intense audio events that may happen frequently in the deployment environment.

As an example, the following figure displays an audio signal made by knocking on a table. These events are labeled with the segments of the same default length as is used for the keywords.