Facebook AI open-sourced a new deep-learning natural-language processing (NLP) model, robustly-optimized BERT approach (RoBERTa). Based on Google’s BERT pre-training model, RoBERTa includes additional pre-training improvements that achieve state-of-the-art results on several benchmarks, using only unlabeled text from the world-wide web, with minimal fine-tuning and no data augmentation.
The Facebook team announced their work in a recent blog post as “part of Facebook’s ongoing commitment to advancing the state-of-the-art in self-supervised systems that can be developed with less reliance on time- and resource-intensive data labeling.” The team re-implemented Google’s BERT neural-network architecture in PyTorch, made several changes to the model’s hyperparameters, and trained the network with an order-of-magnitude more data and for more iterations. The model was evaluated on three common NLP benchmarks: General Language Understanding Evaluation (GLUE), Stanford Question Answering Dataset (SQuAD), and ReAding Comprehension from Examinations (RACE). RoBERTa outperformed BERT on these tests, and in some cases also outperformed the current leading model, XLNet.
Many machine-learning tasks require a labeled dataset, which consists of input examples tied to correct output values, against which the training process checks the AI’s answers. Because they often require human work, very large labeled datasets are relatively rare, especially compared to the wealth of unlabeled data for NLP that exist on the internet; for example, the contents of Wikipedia or Google News. Pre-training is an NLP strategy that uses large unlabeled datasets to create “general purpose language representation models”, which can then be “fine-tuned” for a specific NLP task on smaller labeled datasets. Open-sourced in late 2018, BERT, or Bidirectional Encoder Representations from Transformers, is an NLP architecture that uses pre-training to learn relationships between words, by predicting masked words in input sentences. BERT is based on the Transformer architecture, and was the first bi-directional deep-learning NLP model, meaning it could use words after the masked word, as well as those preceding it, as context for predicting the answer. BERT also models relationships between sentences by training on next-sentence prediction (NSP); given two sentences, does the second sentence truly follow the first in the original text?
In creating RoBERTa, the Facebook team first ported BERT from Google’s TensorFlow deep-learning framework to their own framework, PyTorch. Next, they modified the word-masking strategy; BERT used a static mask, where the words were masked from sentences during pre-processing. RoBERTa uses dynamic masking, with a new masking pattern generated each time a sentence is fed into training. Next, RoBERTa eliminated the NSP training, as Facebook’s analysis showed that it actually hurt performance. Finally, RoBERTa was trained using larger mini-batch sizes: 8K sequences compared to BERT’s 256.
RoBERTa was evaluated against common NLP benchmarks and compared to the original BERT results and to XLNet, another transformer-based architecture that currently has the high scores on several of the benchmarks. RoBERTa outscored BERT and XLNet on both the RACE benchmark and GLUE’s single-task benchmark. GLUE also has a public leaderboard for its ensemble benchmark, and RoBERTa achieved “highest average score to date” on it. One the SQuAD v2.0 “dev” benchmark, RoBERTa set a new high-score, and on SQuAD’s public leaderboard is the top system that does not rely on training data augmentation.
RoBERTa’s technical details and experiments are described more fully in a paper published on arXiv. Paper co-author Myle Ott joined a Reddit comment thread about the paper, providing more context and answering several questions. Ott said that “more data isn’t as important as training longer,” and
Even training for significantly more epochs than past work, we still couldn’t overfit the BERT objective and consistently saw improved end-task results each time we trained for longer.
One commenter pointed out that the comparison with XLNet was not quite “apples-to-apples.” Ott agreed, saying:
- Another difference, in addition to the ones you noted, is the data size and composition are different between XLNet and RoBERTa. We ultimately abandoned doing a direct comparison to XLNet-large for this work, since we wouldn’t be able to control for the data unless we retrained XLNet on our data.