Removing Outliers

I have collected data from an IMU sensor to build a gesture recognition application. How can I remove outliers before the model building process?

Outliers can significantly affect the performance of machine learning models, so it’s important to handle them appropriately.

1. Understand Your Data

  • Visualize Data: Plot your IMU sensor data to visually inspect for any obvious outliers. Tools like Matplotlib or Seaborn in Python can be useful for this purpose. For this purpose, you can use SensiML Python SDK to access segments of your data, and explore your data using well-known libraries.
  • Statistical Summary: Calculate statistical measures (mean, median, standard deviation) to get an understanding of the data distribution. Usually values that are far from average within a specific threshold defined by factors of standard deviation can be considered as outliers or be studied in more details to find the cause of their high variation.

2. Define What Constitutes an Outlier

  • Domain Knowledge: Use your knowledge of the application to define thresholds for what values are considered normal and what values are outliers.
  • Statistical Methods: Use statistical techniques such as Z-score or IQR (Interquartile Range) to identify outliers.

How to remove outliers in the Piccolo AI pipeline

This is the guide on how to remove outliers from your IMU sensor data when building your model using the SensiML model builder.

1- In the Feature Extraction block of your pipeline click on a “+” sign and select an “Outlier Filter” block.

2- Define one or more outlier filers. The objective here is to define these filters to to remove as much as unwanted data possible prior to start the modeling process. Therefore, any prior data exploration and insight might be very helpful at this stage. Here is the list of all offered filters:

Untitled

  • Local Outlier Factor Filtering: The local outlier factor (LOF) to measure the local deviation of a given data point with respect
    to its neighbors by comparing their local density.

    The LOF algorithm is an unsupervised outlier detection method which computes the local density deviation of a given data point with respect to its neighbors. It considers as outlier samples
    that have a substantially lower density than their neighbors.

  • Zscore Filter: A z-score filter is a way to standardize feature vectors by transforming each
    feature in the vector to have a mean of zero and a standard deviation of one. The z-score, or standard score, is a measure of how many standard deviations a data point is from the mean of the distribution. This features that have z-score outside of a cutoff threshold are removed.

  • Sigma Outliers Filtering: A sigma outlier filter algorithm is a technique used to identify and remove outliers from feature vectors based on their deviation from the mean. In this algorithm, an outlier is defined as a data point that falls outside a certain number of standard deviations (sigma) from the mean of the distribution.

  • One Class SVM filtering: An Unsupervised Outlier Detection to estimate the support of a high-dimensional distribution. The implementation is based on libsvm.

  • Robust Covariance Filtering: An Unsupervised Outlier Detection for detecting outliers in a Gaussian distributed dataset.

  • Isolated Forest Filtering: Isolation Forest Algorithm returns the anomaly score of each sample using the Isolation Forest algorithm. The “Isolation Forest” isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.