Introducing Wav2letter++

The results were so promising that the FAIR team decided to open source an initial implementation of this approach.Wav2letter++The recent advancements in deep learning technologies has resulted in an increase on the number of automatic speech recognition(ASR) frameworks and toolkits available to developers. However, the progress shown in the fully convolutional speech recognition model, inspired the FAIR team to create Wav2letter++, a deep learning speech recognition toolkit completely based on C++. The core design of Wav2letter++ is motivated by three key principles:1) Implement the foundation to be able to efficiently train models on datasets containing many thousands of hours of speech.2) Enabling a simple and extensible model to express and incorporate new network architectures, loss functions, and other core operations in speech recognition systems.3) Streamline the transition from research to deployment of speech recognition models.With those design principles as a guideline, Wav2latter++ implemented a very straightforward architecture shown in the following diagram:There are some points that are worth highlighting in order to better understand the Wav2letter++ architecture:· ArrayFire Tensor Library: Wav2letter++ uses ArrayFire as the primary library for tensor operations..ArrayFire enables the modeling of high performance, parallel computations in a hardware-agnostic model that can execute on multiple back-ends including a CUDA GPU back-end and a CPU back-end.· Data Preparation and Feature Extraction: Wav2letter++ supports feature extraction across different audio formats..The framework computes features on the fly prior to each network evaluation and enforces asynchrony and parallelization in order to maximize efficiency during the training of models.· Models: Wav2letter++ includes a rich portfolio of end-to-end sequence models as well as a wide range of network architectures and activation functions.· Scalable Training: Wav2letter++ supports three main modes of training: train (flat-start training), continue (continuing with a checkpoint state), and fork (for e.g. transfer learning)..The training pipeline scales seamlessly using data-parallel, synchronous stochastic-gradient-descent as well as inter process communications powered by NVIDIA Collective Communication Library.· Decoding: The Wav2letter++ decoder is based on the beam-search decoder of the fully convolutional architecture explored previously..The decoder is responsible for outputting the final transcription of an audio file.Wav2letter++ in ActionThe FAIR team tested Wav2letter++ against a series of speech recognition models such as ESPNet, Kaldi and OpenSeq2Seq..The experiments were based on the famous Wall Street Journal CSR dataset..The initial results showed that Wav2letter++ was able to outperform alternatives in every aspect of the training cycle.The implementation of speech recognition systems completely based on CNNs is certainly an interesting approach that can optimize the computing power and training data required by these type of deep learning models..Facebook’s Wav2letter++ implementation of this approach already classifies as one of the fastest speech recognition frameworks in the market..We are likely to see more advancements in this area of research in the near future.. More details

Leave a Reply