State of the Art Audio Data Augmentation with Google Brain’s SpecAugment and Pytorch

Photo by Steve Harvey on UnsplashState of the Art Audio Data Augmentation with Google Brain’s SpecAugment and PytorchImplementing SpecAugment with Pytorch & TorchAudioZach CBlockedUnblockFollowFollowingApr 30Google Brain recently published SpecAugment: A New Data Augmentation Method for Automatic Speech Recognition, which achieved state of the art results on various speech recognition tasks.

Unfortunately, Google Brain did not release code and it seems like they wrote their version in TensorFlow.

For practitioners who prefer Pytorch, I’ve published an implementation of SpecAugment using Pytorch’s great companion library torchaudio and some functionality borrowed from an ongoing collaboration with other FastAI students: fastai-audio.

SpecAugment BasicsIn speech recognition, raw audio is often transformed into an image-based representation.

These images are typically spectrograms, which encode properties of sound in a format that many models find easier to learn.

Instead of doing data augmentation on raw audio signal, SpecAugment borrows ideas from computer vision and operates on spectrograms.

SpecAugment works.

Google Brain reports fantastic results:SOTA results using SpecAugmentSpecAugment features three augmentations.

Time Warptime warping a spectrogramPut simply, Time Warp shifts the spectrogram in time by using interpolation techniques to squeeze and stretch the data in a randomly chosen direction.

Time Warp is SpecAugment’s most complex and computationally expensive augmentation.

Deep learning engineer Jenny Cai and I worked through Tensorflow’ssparse_image_warp functionality until we had Pytorch support.

If you’re interested in the nitty-gritty details, you can check out SparseImageWarp.

ipynb in the repo.

Google Brain’s research suggests that Time Warp is the least effective of the augmentations so, if performance is an issue, you might consider dropping this one first.

Frequency and Time MaskingFrequency Masking and Time Masking are similar to the cutout data augmentation technique commonly used in computer vision.

Put simply, we mask a randomly chosen band of frequencies or slice of time steps with the mean value of the spectrogram or, if you prefer, zero.

With time on the X axis and frequency bands on the Y axis, here’s what Time Masking looks like:time masking a spectrogramAnd here’s Frequency Masking:frequency masking a spectrogramNaturally, you can apply all three augmentations on a single spectrogram:All three augmentations combined on a single spectrogramHopefully these new Pytorch functions will prove useful in your deep learning workflows.

Thanks for reading!.

. More details

Leave a Reply