Building a Smart Lock Application using an Audio Sensor

Imagine a smart lock that doesn’t require fumbling with keys or codes. Here, I explore the exciting possibility of using audio data to detect the status of a door lock using the acoustic data.

For more thorough explanations, you can also refer to the SensiML documentation on how to build a Smart Lock Demo. Explore more about this application by following our blog post series.

Goals

The objective of this model is to detect and classify four different acoustic events:

  • Door Knocking: Ideally, all sorts of knocks on the door where the lock is installed. Other similar sounds should be ignored and classified as Unknown.
  • Locking/Unlocking: Turning the key inside the keyhole or using the knob. This class also includes other engagements with the door lock. If an intruder attempts to pick the lock and the engagement time exceeds a specific threshold, the application can report the incident and warn the landlord, security forces, or a custodian for further follow-ups.
  • Key In/Out: This event occurs when the key is inserted into or removed from the keyhole.
  • Unknown: Any other intense acoustic events, whether similar or dissimilar to the previously known events, to be logged and ignored.

We have considered the EFR32xG24 Dev Kit to build a Proof of Concept (PoC). This kit features an Arm Cortex-M33 MCU, two microphones, IMU sensors, and an AI/ML accelerator. SensiML Data Studio is utilized to collect, manage, and annotate our data.

For a sample dataset please refer to: Smart Lock Demo.zip

We mainly processed our data using Piccolo AI. Given the limited amount of collected data and the difficulty of recreating each class in a controlled lab environment, the collected data is imbalanced. To balance and enrich the dataset, we altered some of the original data by adding background noise and randomly shifting it along the time axis. As a result, the model performs better when deployed in noisy environments.

smart-door-lock-audio-recognition_14_1

Model Definition and Training

The nature of the problem requires us to use a more complex ML algorithm, such as a deep neural network. We utilize Mel-Frequency Cepstral Coefficients (MFCC) to extract acoustic features and train a simplified convolutional neural network (CNN) to generate a model sensitive to the objective sound events.

Our recommended model is designed to capture all prominent details in the feature space that aid in the classification process while being compact and efficient enough to fit within the resource constraints of the chosen device.

We trained our model iteratively. Before each training epoch, we select a random but balanced subsample of our dataset. We undersample classes that have too many instances compared to others. This random undersampling process prior to each training epoch, exposes the model to a greater variety of instances, allowing it to stochastically adapt to the general trends instead of getting overfilled. This approach also prevent the model from falling into local optima.

Here is the model implementation in TensorFlow

optimization_metric = "accuracy"

tf_model = tf.keras.Sequential()

# input layer
tf_model.add(keras.Input(shape=(x_train[0].shape[0], x_train[0].shape[1], 1)))

# convolutional layers #1
tf_model.add(layers.Conv2D(16, (2,2), padding="valid", activation="relu"))
tf_model.add(layers.Dropout(0.25))
tf_model.add(layers.Conv2D(16, (2,2), padding="valid", activation="relu"))

# avoding overfitting
tf_model.add(layers.BatchNormalization(axis = 3))
tf_model.add(layers.Dropout(0.25))

# convolutional layers #2
tf_model.add(layers.Conv2D(8, (2,2), padding="valid", activation="relu"))
tf_model.add(layers.Dropout(0.25))
tf_model.add(layers.Conv2D(8, (2,2), padding="valid", activation="relu"))

# fully connected layers
tf_model.add(layers.Flatten())
tf_model.add(layers.Dense(16, activation='relu', ))

tf_model.add(layers.Dense(len(class_map.keys()), activation='softmax'))
tf_model.compile(optimizer='Adam', loss='categorical_crossentropy', metrics=[optimization_metric])

tf_model.summary()

The summary table of the model parameters is as follows

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #
=================================================================
 conv2d (Conv2D)             (None, 19, 14, 16)        80

 dropout (Dropout)           (None, 19, 14, 16)        0

 conv2d_1 (Conv2D)           (None, 18, 13, 16)        1040

 batch_normalization (BatchN  (None, 18, 13, 16)       64
 ormalization)

 dropout_1 (Dropout)         (None, 18, 13, 16)        0

 conv2d_2 (Conv2D)           (None, 17, 12, 8)         520

 dropout_2 (Dropout)         (None, 17, 12, 8)         0

 conv2d_3 (Conv2D)           (None, 16, 11, 8)         264

 flatten (Flatten)           (None, 1408)              0

 dense (Dense)               (None, 16)                22544

 dense_1 (Dense)             (None, 4)                 68

=================================================================
Total params: 24,580
Trainable params: 24,548
Non-trainable params: 32
_________________________________________________________________

After training the model for 5 epochs, we achieved a classification accuracy of 96%. However, this accuracy was reduced to 93% after quantization.

Below is the detailed confusion matrix. The horizontal axis represents the ground truth classes, while the vertical axis represents the predicted labels.

Real-World Challenges and Performance Optimization:

While our initial model demonstrates the potential of audio-based smart locks, deploying it in real-world environments presents additional challenges such as

  • Background Noise: Real-world settings can introduce background noise from traffic, conversations, or even appliances. This can lead to a decrease in model performance compared to the controlled demo environment.

  • Environmental Variations: The texture and material of the surface upon which the lock is mounted can slightly alter the audio signature of the acoustic events received by the device. This highlights the importance of collecting data under various conditions.

Strategies for Enhanced Performance

Here are some approaches for performance improvement and prepare the model for real-world scenarios:

  • Improving Data Collection: Collecting more data under different noise conditions and on various surfaces allows the model to learn a wider range of audio patterns.

  • Advanced Data Augmentation Techniques: More sophisticated data augmentation techniques can be utilized to simulate even more diverse scenarios and edge cases, improving the model robustness.

  • Hyperparameter Tuning: Hyperparameters are settings within the model that can be adjusted to influence its performance. Techniques like grid search can be used to evaluate multiple model architectures and hyperparameter combinations. Some potential hyperparameters that can be optimized are as follows:

    • Number of MFCC Parameters: Adjust the number of MFCCs used in feature generation to find the optimal set for accurate classification.

    • Audio Segment Size: Currently, our model requires segments consisting of 6000 samples (equivalent to 1/2 seconds at 16 KHz) to generate classifications. By experimenting with different segment sizes, one may identify the optimal value that maximizes model performance.

    • Window/Sliding Size: We currently use a window size of 400 samples to generate MFCC features, with 15 successive windows covering the entire segment of 6000 samples. The sliding value is 400, meaning windows do not overlap. To improve the model performance and to increase the sensitivity of the extracted feature vector to finer details, one may decrease the sliding parameter to allow some overlap between windows.

Complexity and Optimization

One may play with the model architecture to improve the performance. One of the options is to adding or removing layers, adjusting their sizes, and/or changing their operations. This can be done systematically through grid search or by adopting an evolutionary framework such as genetic algorithm.

Profiling

Silicon Labs xG24 SOC includes a dedicated hardware AI accelerator and is supported by a TensorFlow-optimized SDK, to develop advanced inference models for efficient AI acceleration. To explore the potentials of this device, we conducted a thorough real-time performance evaluations across a range of audio models with varying complexities.

This particular device integrate an AI acceleration unit specifically designed to expedite the matrix multiplication tasks essential for neural network inference. To gauge the effectiveness of this AI acceleration, we meticulously measured the MCU cycle counts required for model execution with and without utilizing the accelerator. These counts were then converted into latency values using the device nominal 78MHz clock speed, providing clear insights into performance improvements.

Our algorithm intricately divides the classification process into two primary stages. Initially, it captures audio data at a high sampling rate of 16kHz and extracts critical feature vectors such as MFCC. This initial processing step operates entirely within the MCU software domain, using optimized signal processing functions from Arm Cortex-A and Cortex-M CMSIS DSP library. The subsequent stage involves feeding these extracted feature vectors into a quantized CNN for classification. We systematically alter the CNN input size and explore its impact on overall classification performance under different operational scenarios.

We explore three different categories. Category 1 comprises models that require 15×400 audio samples (375 msec @ 16 KHz) to make the classification. Categories 2 and 3 consist of smaller models that need 250 and 150 msec of audio inputs, respectively. The following diagram is an illustration of the Category 1 model.

We begin with Category 1 model to demonstrate the benefits of the AI accelerator in computational tasks. For each network architecture, we run the model with and without accelerator assistance, illustrating the classification latencies in the diagram below. Open circles indicate latencies when classifications are handled solely by the MCU, while green-filled points show latencies for the same models when the accelerator is utilized.

In analyzing our findings, we observe a significant reduction in classification time thanks to the accelerator, typically by a factor of 1.5 or more. This efficiency gain becomes more pronounced as model complexity increases, necessitating more matrix operations. For instance, a CNN model with approximately one million parameters experiences nearly a twofold acceleration in classification speed when utilizing the accelerator.

However, for our smart lock application, achieving reliable classifications within a tight 25-millisecond window (equivalent to 400 audio samples) presents challenges, especially in noisy environments. Complex models designed for high accuracy in controlled settings often struggle during real-time execution on the device due to the inability to process streaming data concurrently with classification. To address this issue, it may be advantageous to develop models that can tolerate some data loss or to adopt simpler CNN architectures with smaller feature vectors. These approaches can help maintain robust performance in dynamic operational environments where data continuity is not guaranteed during classification intervals.

In the following plot, we tested Category 2 and 3 models with varying complexities. Open symbols denote classification latencies using only the MCU, while filled symbols indicate latencies for corresponding models utilizing the accelerator. It is evident that the accelerator reduces classification latencies by a factor ranging from 1.5 to 1.7.

The primary objective when developing applications for edge devices is to create accurate models that operate efficiently within the desired inference rate, as indicated by the dashed horizontal line in the diagram above. The AI accelerator proves beneficial by enabling the deployment of models featuring 10×20 size feature vectors, capable of processing outputs at the audio sampling rate.

It is important to emphasize that this case study just serves as an example, demonstrating how this method can be applied to other applications with different requirements and device configurations.