At the end of this post, I mentioned that my newly re-trained wake word model might have been overfitting to noise in the training data:
As you can see, the model performed much better with data collected through my headphones, but it now struggled with samples collected through my laptop's microphone. Based on the training logs, this is likely happening because the model is overfitting to noise in the training data.
Here's the data in the training logs I was referring to:
Training set accuracy: 0.99747
Test set accuracy: 0.99956
Dev set accuracy:: 0.83334
It's clear to see that the model performed well on data that it had already seen (training set), and on data that it had not seen that followed a similar distribution (test set). Yet it failed to generalize well to data that it would see in the "real world" (dev set) with the same accuracy. In general, this problem is called overfitting, and it can have quite a few causes such as:
For my particular case, I think #2 is the most likely culprit. In this post, I'll share some of my reasoning for this, as well as how I confirmed this suspicion.
Before sharing the reasoning for my hypothesis, it's helpful to first understand all of the data sources I used, as well as how they were transformed before being passed into the model for training. I used three different sources of data:
Since the overall number of samples used for the model was quite low, I duplicated samples that would be used for training the model, and then used SpecAugment to randomly remove frequencies in order to make the duplicated data slightly different.
This whole process is illustrated below:
Ultimately, the way that the model would be evaluated would be through the test app, and that would just consist of samples of my voice. Given this fact, it's easy to see how I had the intuition that the data being used to train the model wasn't representative of how it would be evaluated. In particular, I was concerned that:
In this section I'll share how I confirmed my hypothesis, and how I improved the model's accuracy.
First I wanted to test out the data sources used to train the model in order to see which of them were actually helpful. In order to do this, I trained 3 different versions of the model and evaluated each of their performances on the dev set. Here's how the 3 models differed:
After training the model, here's how each of the models performed on the dev set:
As you can see, the model that had the highest dev set accuracy was the one that was trained only on samples from my test app and the ambient noise data. The model that was trained on only samples from my test app also did quite well. This helped confirm my theory that the model was overfitting to the data in the Common Voice dataset.
I also wanted to test out the effect that different data augmentation techniques (duplication and SpecAugment) had on the model's performance. In order to evaluate this, I trained 5 different versions of the model. Note that all of these models were trained on the same data sources (test app samples and ambient noise samples) in order to keep things consistent. Here's how the 5 different models differed:
Here's how each of these models performed on the dev set:
From these results, it's harder to definitively conclude which version of augmentation was most useful for improving the model's performance, but we can draw two conclusions:
It looks like tweaking the data sources used to train the model had the biggest impact on improving it's accuracy, but from this whole process, I had two other important takeaways: