A team of researchers from Google Brain has improved the SOTA on the LibriSpeech automatic speech recognition task, with their score of 1.4 percent/ 2.6 percent word-error-rates bettering the previous 1.7 percent/ 3.3 percent. The team’s novel approach leverages a combination of recent advancements in semi-supervised learning, using noisy student training with adaptive SpecAugment as the iterative self-training pipeline and giant Conformer models pretrained using the wav2vec 2.0 pretraining method.
Semi-supervised learning aims to use a large unlabelled dataset that combines a small amount of labelled data during training to improve the performance of a machine learning task. The Google Brain team used audio from the Libri-Light dataset as the unlabelled data, and detail their automatic speech recognition method in a new paper.
The researchers note that self-training had been a promising research direction for semi-supervised learning, and they chose the noisy student training method. This involves training a series of models where a given model serves as a teacher generating labels on the unlabelled dataset. The student models are then trained on a dataset obtained by combining the supervised set with the teacher-labelled dataset. Google Brain employed a series of pretrained models in a self-training loop where the models pretrained using wav2vec 2.0, a method inspired by its recent successes in NLP research.
The automatic speech recognition network’s core is a sequence transducer with an LSTM decoder and a Conformer encoder. “The Conformer encoder could naturally be split into a ‘feature encoder’ consisting of the convolution subsampling block and a ‘context network’ made of a linear layer and a stack of Conformer blocks,” explains the paper. In this way, the feature sequence lengths are reduced, and pretraining methods optimize the contrastive loss for better model performance.
The researchers introduced scaled-up and giant versions of the Conformers, dubbed Conformer XL, Conformer XXL, and Conformer XXL+, with 600 million, 1 billion, and 1.05 billion parameters, respectively. The team emphasized that making the model bigger on its own does not result in performance gains — the benefits of enlarged model size are observed only after applying semi-supervised learning methods.
This article has been published from the source link without modifications to the text. Only the headline has been changed.