What is the architecture of the KWS model offered by Piccolo AI? Are there alternative ML solutions known for being more compact while delivering superior performance?
Overview
The Piccolo AI wake-word system is constructed using a depthwise separable convolutional neural network (DS-CNN) which is designed to be less complicated than models such as ResNet and VGG-X while being more efficient. For more details, please refer to this document: ( Sørensen, P.M., Epp, B. & May, T. A depthwise separable convolutional neural network for keyword spotting on an embedded system. J AUDIO SPEECH MUSIC PROC. 2020, 10 (2020))
This keyword spotting systems has 4 main stages
- [A] Registering the audio data at the rate of 16 KHz. The recognition is activated if the energy level of the received signal meets the energy threshold constraints.
- [B] Audio data is processed within windows of 480 samples with a slide size of 320 samples. 23 MFCC features are generated for each audio window. Features of 49 successive windows are stacked to approximately cover 1-second of data, resulting in an array of size 48x23 for every one second of data stream.
- [C] Feature tensors are processes by a DS-CNN
- [D] The flattened arrays are then mapped into the corresponding keywords
The DS-CNN classifier has the following form. The adopted number of depthwise convolutional blocks, N, and the number of filters in each layer depend on the complexity of the task and the available resources on the target embedded device.
We take the transfer learning approach to build our keyword spotting system. First, we train a foundation model (a.k.a. foundation model) using a large dataset of numerous keywords. Then we customize our model by removing the last fully connected layer and replacing it with several fully connected layers. This new network is subsequently trained using the available data for the specific keyword(s).
Notes
To optimize the latency, we can utilize a portion of the feature matrix that retains the most relevant information in the frequency space. The following figure illustrate the original (top) and trimmed (bottom) versions of the same spectrogram. The temporal axis is horizontal, and the frequency dimension is vertical. In this case, we trimmed two of the lowest and four of the highest frequency rows. As a result, the resulting tensor is of shape [49, 16, 1]. While this trimming process doesn’t significantly influence accuracy, it reduces the number of matrix operations, thus decreasing the recognition latency.