How To Train a BERT Model

Many of the articles have been focused on BERT — the model that came and dominated the world of natural language processing (NLP) and marked a new age for language models.

For those of you that may not have used transformers models (eg what BERT is) before, the process looks a little like this:

  • pip install transformers
  • Initialize a pre-trained transformers model — from_pretrained.
  • Test it on some data.
  • Maybe fine-tune the model (train it some more).

Now, this is a great approach, but if we only ever do this, we lack the understanding behind creating our own transformers models.

And, if we cannot create our own transformer models — we must rely on there being a pre-trained model that fits our problem, this is not always the case.

So in this article, we will explore the steps we must take to build our own transformer model — specifically a further developed version of BERT, called RoBERTa.

An Overview

There are a few steps to the process, so before we dive in let’s first summarize what we need to do. In total, there are four key parts:

  • Getting the data
  • Building a tokenizer
  • Creating an input pipeline
  • Training the model

Once we have worked through each of these sections, we will take the tokenizer and model we have built — and save them both so that we can then use them in the same way we usually would with from_pretrained.

Getting The Data

As with any machine learning project, we need data. In terms of data for training a transformer model, we really are spoilt for choice — we can use almost any text data.

And, if there’s one thing that we have plenty of on the internet — it’s unstructured text data.

One of the largest datasets in the domain of text scraped from the internet is the OSCAR dataset.

The OSCAR dataset boasts a huge number of different languages — and one of the clearest use-cases for training from scratch is so that we can apply BERT to some less commonly used languages, such as Telugu or Navajo.

Unfortunately, the only language I can speak with any degree of competency is English — but my girlfriend is Italian, and so she — Laura, will be assessing the results of our Italian-speaking BERT model — FiliBERTo.

So, to download the Italian segment of the OSCAR dataset we will be using HuggingFace’s datasets library — which we can install with pip install datasets. Then we download OSCAR_IT with:

In [1]: from datasets import load_dataset
Load the Italian part of the OSCAR dataset. This is a huge dataset so download can take a long time:
In [2]: dataset = load_dataset('oscar', 'unshuffled_deduplicated_it')
Reusing dataset oscar (C:\Users\James\.cache\huggingface\datasets\oscar\unshuffled_deduplicated_it\1.0.0\e4f06cecc7ae02f7adf85640b4019bf476d44453f251a1d84aebae28b0f8d51d)

Let’s take a look at the dataset object.

The dataset is a DatasetDict containing a single train dataset.
In [3]: dataset
Out[3]: DatasetDict({
train: Dataset(
features: [‘id’, ‘text’],
num_rows: 28522082
})
})

We can access the dataset itself through the train key. From here we can view
more information, like the number of rows and structure of the dataset.
In [5]: dataset[‘train’]
Out[5]:

Dataset({
    features: ['id', 'text'],
    num_rows: 28522082
})
In [6]: dataset[‘train’].features
Out[6]:
{'id': Value(dtype='int64', id=None), 'text': Value(dtype='string', id=None)}

Let’s take a look at a single sample:

In [7]: dataset[‘train’][0]
Out[7]:
{'id': 0,
 'text': "La estrazione numero 48 del 10 e LOTTO ogni 5 minuti e' avvenuta sabato 15 settembre 2018 alle ore 04:00 a Roma, nel Centro Elaborazione Dati della Lottomatica Italia (ora GTech SpA), con la supervisione della Amministrazione Autonoma dei Monopoli di Stato (AAMS), incaricata di vigilare sulla regolarità delle operazioni di sorteggio.\nIl Montepremi della 48ª estrazione viene ripartito tra i vincitori delle singole categorie di premio.\nRicorda di controllare il Numero ORO 53. E, se lo hai giocato, anche il DOPPIO ORO 53 e 66. Se indovini puoi vincere premi più ricchi.\nIl nostro sito web impiega cookies per migliorare la navigazione del visitatore. L’utente è consapevole che, continuando a visitare il nostro sito web, accetta l’utilizzo dei cookies Accetto Informazioni\n(C) Copyright 2013-2017 10elotto.biz | Il presente sito è da considerarsi un sito indipendente, NON collegato alla rete ufficiale Gtech SpA."}

Great, now let’s store our data in a format that we can use when building our tokenizer. We need to create a set of plaintext files containing just the text feature from our dataset, and we will split each sample using a newline \n.

In [8]:
from tqdm.auto import tqdm

text_data = []
file_count = 0

for sample in tqdm(dataset['train']):
    sample = sample['text'].replace('\n', '')
    text_data.append(sample)
    if len(text_data) == 10_000:
        # once we git the 10K mark, save to file
        with open(f'../../data/text/oscar_it/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
            fp.write('\n'.join(text_data))
        text_data = []
        file_count += 1
# after saving in 10K chunks, we will have ~2082 leftover samples, we save those now too
with open(f'../../data/text/oscar_it/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
    fp.write('\n'.join(text_data))
100%|██████████| 28522082/28522082 [33:32<00:00, 14173.48it/s]

Over in our data/text/oscar_it directory we will find:

Building a Tokenizer

Next up is the tokenizer! When using transformers we typically load a tokenizer, alongside its respective transformer model — the tokenizer is a key component in the process.

When building our tokenizer we will feed it all of our OSCAR data, specify our vocabulary size (number of tokens in the tokenizer), and any special tokens.

Now, the RoBERTa special tokens look like this:

Token Use
<s> Beginning of sequence (BOS) or classifier (CLS) token
</s> End of sequence (EOS) or seperator (SEP) token
<unk> Unknown token
<pad> Padding token
<mask> Masking token
So, we make sure to include them within the special_tokens parameter of our tokenizer’s train method call.

Get a list of paths to each file in our oscar_it directory.

In [1]:
from pathlib import Path
paths = [str(x) for x in Path('../../data/text/oscar_it').glob('**/*.txt')]

Now we move onto training the tokenizer. We use a byte-level Byte-pair encoding (BPE) tokenizer. This allows us to build the vocabulary from an alphabet of single bytes, meaning all words will be decomposable into tokens.

In [2]:
from tokenizers import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer()
In [3]:
tokenizer.train(files=paths[:5], vocab_size=30_522, min_frequency=2,
                special_tokens=['<s>', '<pad>', '</s>', '<unk>', '<mask>'])
Our tokenizer is now ready, and we can save it file for later use:
In [4]:
import os

os.mkdir('./filiberto')

tokenizer.save_model('filiberto')
Out[4]:
['./filiberto\\vocab.json', './filiberto\\merges.txt']
Now we have two files that define our new FiliBERTo tokenizer:
  • merges.txt — performs the initial mapping of text to tokens
  • vocab.json — maps the tokens to token IDs

And with those, we can move on to initializing our tokenizer so that we can use it as we would use any other from_pretrained tokenizer.

Initializing the Tokenizer

We first initialize the tokenizer using the two files we built before — using a simple from_pretrained:

In [1]:
from transformers import RobertaTokenizer

# initialize the tokenizer using the tokenizer we initialized and saved to file
tokenizer = RobertaTokenizer.from_pretrained('filiberto', max_len=512)
Now our tokenizer is ready, we can try encoding some text with it. When encoding we use the same two methods we would typically use, encode and encode_batch.
In [6]:
# test our tokenizer on a simple sentence
tokens = tokenizer('ciao, come va?')
In [7]:
print(tokens)
{'input_ids': [0, 16834, 16, 488, 611, 35, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
In [11]:
tokens.input_ids
Out[11]:
[0, 16834, 16, 488, 611, 35, 2]
From the encodings object tokens we will be extracting the input_ids and attention_mask tensors for use with FiliBERTo.
Creating the Input Pipeline

The input pipeline of our training process is the more complex part of the entire process. It consists of us taking our raw OSCAR training data, transforming it, and loading it into a DataLoader ready for training.

Preparing the Data

We’ll start with a single sample and work through the preparation logic.

First, we need to open our file — the same files that we saved as .txt files earlier. We split each based on newline characters \n as this indicates the individual samples.

In [3]:
with open('../../data/text/oscar_it/text_0.txt', 'r', encoding='utf-8') as fp:
    lines = fp.read().split('\n')
Then we encode our data using the tokenizer — making sure to include key parameters like max_lengthpadding, and truncation.
In [4]:
batch = tokenizer(lines, max_length=512, padding='max_length', truncation=True)
len(batch)
Out[4]:
10000
And now we can move onto creating our tensors — we will be training our model through masked-language modeling (MLM). So, we need three tensors:
  • input_ids — our token_ids with ~15% of tokens masked using the mask token <mask>.
  • attention_mask — a tensor of 1s and 0s, marking the position of ‘real’ tokens/padding tokens — used in attention calculations.
  • labels — our token_ids with no masking.

If you’re not familiar with MLM, I’ve explained it here.

Our attention_mask and labels tensors are simply extracted from our batch. The input_ids tensors require more attention however, for this tensor we mask ~15% of the tokens — assigning them the token ID 3.

In [6]:
import torch

labels = torch.tensor([x.ids for x in batch])
mask = torch.tensor([x.attention_mask for x in batch])
In [7]:
# make copy of labels tensor, this will be input_ids
input_ids = labels.detach().clone()
# create random array of floats with equal dims to input_ids
rand = torch.rand(input_ids.shape)
# mask random 15% where token is not 0 [PAD], 1 [CLS], or 2 [SEP]
mask_arr = (rand < .15) * (input_ids != 0) * (input_ids != 1) * (input_ids != 2)
# loop through each row in input_ids tensor (cannot do in parallel)
for i in range(input_ids.shape[0]):
    # get indices of mask positions from mask array
    selection = torch.flatten(mask_arr[i].nonzero()).tolist()
    # mask input_ids
    input_ids[i, selection] = 3  # our custom [MASK] token == 3

We have 10000 tokenized sequences, each containing 512 tokens.

In [8]:
input_ids.shape
Out[8]:
torch.Size([10000, 512])

We can see the special tokens here, 1 is our [CLS] token, 2 our [SEP] token, 3 our [MASK] token, and at the end we have two 0 – or [PAD] – tokens.

In [9]:
input_ids[0][:200]
Out[9]:
tensor([    1,   693, 18623,  1358,  7752,     3,  1056,   280,     3,  6321,
          776,     3,  2145,   280,    11, 10205,  3778,  1266,     3,  1197,
            3,  1142, 10293,    30,   552,     3,  1340,    16,   385,     3,
          458,  9777,  5942,   376, 25475,  2870,  1201,   391,  2691,   421,
        17927, 16996,   739,     3,     3, 22814,   376,  7950, 17824,   980,
          435, 18388,  1475,     3,     3,   391,    37, 24909,   739,  2689,
        27869,   275,  5803,   625,   770, 13459,   483,  4779,   275, 12870,
          532,    18,   680,  3867, 24138,   376,  7752, 17630, 18623,  1134,
         8882,   269,   431,   287, 12450,     3,  8041,  6056,   275,  5286,
           18, 11755,     3,   275,  6161,   317, 10528,     3,     3, 13181,
           18,   458,     3,   372,   456,  2150, 12054,    16,     3,   317,
         6122,  5324,  3329,   570,  1594, 13181,   280, 14634,    18,   763,
            3,  6323,  2484,  6544,  5085,   469,  9106,    18,   680,     3,
          842,  1518, 25737,  3653,   303,  3300,   306,  3063,   292,     3,
           18,   381,   330,  2872,   343,  4722,     3,    16, 16848,   267,
         5216,   317,  1009,   842,  1518,    16,     3,   338,   330,  2757,
          435,  3653, 27081, 10965,    12,    39,    13,     3,  1865,    17,
         5580,  1056,   992,   363,     3,   360,    94,  1182,   589,  1729,
            3,     3,   351, 12863,   300,     3,  5240,     3,     3, 10799,
          480,  2261,     3,   421, 14591,     3,    18,     2,     0,     0])
In the final output, we can see part of an encoded input_ids tensor. The very first token ID is 1 — the [CLS] token. Dotted around the tensor we have several 3 token IDs — these are our newly added [MASK] tokens.

Building the DataLoader

Next, we define our Dataset class — which we use to initialize our three encoded tensors as PyTorch torch.utils.data.Dataset objects.

In [7]:
encodings = {'input_ids': input_ids, 'attention_mask': mask, 'labels': labels}
In [8]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        # store encodings internally
        self.encodings = encodings

    def __len__(self):
        # return the number of samples
        return self.encodings['input_ids'].shape[0]

    def __getitem__(self, i):
        # return dictionary of input_ids, attention_mask, and labels for index i
        return {key: tensor[i] for key, tensor in self.encodings.items()}

Next we initialize our Dataset.

In [9]:
dataset = Dataset(encodings)

And initialize the dataloader, which will load the data into the model during training.

In [10]:
loader = torch.utils.data.DataLoader(dataset, batch_size=16, shuffle=True)
Finally, our dataset is loaded into a PyTorch DataLoader object — which we use to load our data into our model during training.

Training the Model

We need two things for training, our DataLoader and a model. The DataLoader we have — but no model.

Initializing the Model

For training, we need a raw (not pre-trained) BERTLMHeadModel. To create that, we first need to create a RoBERTa config object to describe the parameters we’d like to initialize FiliBERTo with.

In [11]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=30_522,  # we align this to the tokenizer vocab_size
    max_position_embeddings=514,
    hidden_size=768,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1
)
Then, we import and initialize our RoBERTa model with a language modeling (LM) head.
In [12]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config)
Training Preparation
Before moving onto our training loop we need to set up a few things. First, we set up GPU/CPU usage. Then we activate the training mode of our model — and finally, initialize our optimizer.

Setup GPU/CPU usage.

In [13]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# and move our model over to the selected device
model.to(device)

Activate the training mode of our model, and initialize our optimizer (Adam with weighted decay – reduces chance of overfitting).

In [15]:
from transformers import AdamW

# activate training mode
model.train()
# initialize optimizer
optim = AdamW(model.parameters(), lr=1e-4)
Training

Finally — training time! We train just as we usually would when training via PyTorch.

In [20]:
epochs = 2

for epoch in range(epochs):
    # setup loop with TQDM and dataloader
    loop = tqdm(loader, leave=True)
    for batch in loop:
        # initialize calculated gradients (from prev step)
        optim.zero_grad()
        # pull all tensor batches required for training
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        # process
        outputs = model(input_ids, attention_mask=attention_mask,
                        labels=labels)
        # extract loss
        loss = outputs.loss
        # calculate loss for every parameter that needs grad update
        loss.backward()
        # update parameters
        optim.step()
        # print relevant info to progress bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())
Epoch 0: 100%|██████████| 12500/12500 [1:29:47<00:00,  2.32it/s, loss=0.358]
Epoch 1: 100%|██████████| 12500/12500 [1:22:20<00:00,  2.53it/s, loss=0.31]
In [21]:
model.save_pretrained('./filiberto')  # and don't forget to save filiBERTo!
If we head on over to Tensorboard we’ll find our loss over time — it looks promising.

The Real Test

Now it’s time for the real test. We set up an MLM pipeline — and ask Laura to assess the results. You can watch the video review at 22:44 here:

We first initialize a pipeline object, using the 'fill-mask' argument. Then begin testing our model like so:

In [1]:
from transformers import pipeline
In [2]:
fill = pipeline('fill-mask', model='filiberto', tokenizer='filiberto')
Some weights of RobertaModel were not initialized from the model checkpoint at filiberto and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
In [3]:
fill(f'ciao {fill.tokenizer.mask_token} va?')
Out[3]:
[{'sequence': '<s>ciao come va?</s>',
  'score': 0.33601945638656616,
  'token': 482,
  'token_str': 'Ġcome'},
 {'sequence': '<s>ciao, va?</s>',
  'score': 0.13736604154109955,
  'token': 16,
  'token_str': ','},
 {'sequence': '<s>ciao mi va?</s>',
  'score': 0.05658061057329178,
  'token': 474,
  'token_str': 'Ġmi'},
 {'sequence': '<s>ciao chi va?</s>',
  'score': 0.047595467418432236,
  'token': 586,
  'token_str': 'Ġchi'},
 {'sequence': '<s>ciao ci va?</s>',
  'score': 0.03684385493397713,
  'token': 435,
  'token_str': 'Ġci'}]
“ciao come va?” is the right answer! That’s as advanced as my Italian gets — so, let’s hand it over to Laura.

We start with “buongiorno, come va?” — or “good day, how are you?”:

In [3]:
fill(f'buongiorno, {fill.tokenizer.mask_token} va?')
Out[3]:
[{'sequence': '<s>buongiorno, chi va?</s>',
  'score': 0.299,
  'token': 586,
  'token_str': 'Ġchi'},
 {'sequence': '<s>buongiorno, come va?</s>',
  'score': 0.245,
  'token': 482,
  'token_str': 'Ġcome'},
 {'sequence': '<s>buongiorno, cosa va?</s>',
  'score': 0.116,
  'token': 1021,
  'token_str': 'Ġcosa'},
 {'sequence': '<s>buongiorno, non va?</s>',
  'score': 0.041,
  'token': 382,
  'token_str': 'Ġnon'},
 {'sequence': '<s>buongiorno, che va?</s>',
  'score': 0.037,
  'token': 313,
  'token_str': 'Ġche'}]
The first answer, “buongiorno, chi va?” means “good day, who is there?” — eg nonsensical. But, our second answer is correct!

Next up, a slightly harder phrase, “ciao, dove ci incontriamo oggi pomeriggio?” — or “hi, where are we going to meet this afternoon?”:

In [3]:
fill(f'ciao, dove ci {fill.tokenizer.mask_token} oggi pomeriggio? ')
Out[3]:
[{'sequence': '<s>ciao, dove ci vediamo oggi pomeriggio? </s>',
  'score': 0.400,
  'token': 7105,
  'token_str': 'Ġvediamo'},
 {'sequence': '<s>ciao, dove ci incontriamo oggi pomeriggio? </s>',
  'score': 0.118,
  'token': 27211,
  'token_str': 'Ġincontriamo'},
 {'sequence': '<s>ciao, dove ci siamo oggi pomeriggio? </s>',
  'score': 0.087,
  'token': 1550,
  'token_str': 'Ġsiamo'},
 {'sequence': '<s>ciao, dove ci troviamo oggi pomeriggio? </s>',
  'score': 0.048,
  'token': 5748,
  'token_str': 'Ġtroviamo'},
 {'sequence': '<s>ciao, dove ci ritroviamo oggi pomeriggio? </s>',
  'score': 0.046,
  'token': 22070,
  'token_str': 'Ġritroviamo'}]
And we return some more positive results:
✅ "hi, where do we see each other this afternoon?"
✅ "hi, where do we meet this afternoon?"
❌ "hi, where here we are this afternoon?"
✅ "hi, where are we meeting this afternoon?"
✅ "hi, where do we meet this afternoon?"

Finally, one more, harder sentence, “cosa sarebbe successo se avessimo scelto un altro giorno?” — or “what would have happened if we had chosen another day?”:

In [3]:
fill(f'cosa sarebbe successo se {fill.tokenizer.mask_token} scelto un altro giorno?')
Out[3]:
[{'sequence': '<s>cosa sarebbe successo se avesse scelto un altro giorno?</s>',
  'score': 0.251,
  'token': 6691,
  'token_str': 'Ġavesse'},
 {'sequence': '<s>cosa sarebbe successo se avessi scelto un altro giorno?</s>',
  'score': 0.241,
  'token': 12574,
  'token_str': 'Ġavessi'},
 {'sequence': '<s>cosa sarebbe successo se avessero scelto un altro giorno?</s>',
  'score': 0.217,
  'token': 14193,
  'token_str': 'Ġavessero'},
 {'sequence': '<s>cosa sarebbe successo se avete scelto un altro giorno?</s>',
  'score': 0.081,
  'token': 3609,
  'token_str': 'Ġavete'},
 {'sequence': '<s>cosa sarebbe successo se venisse scelto un altro giorno?</s>',
  'score': 0.042,
  'token': 17216,
  'token_str': 'Ġvenisse'}]
We return a few good more good answers here too:
✅ "what would have happened if we had chosen another day?"
✅ "what would have happened if I had chosen another day?"
✅ "what would have happened if they had chosen another day?"
✅ "what would have happened if you had chosen another day?"
❌ "what would have happened if another day was chosen?"

Overall, it looks like our model passed Laura’s tests — and we now have a competent Italian language model called FiliBERTo!

That’s it for this walkthrough of training a BERT model from scratch!

We’ve covered a lot of ground, from getting and formatting our data — all the way through to using language modeling to train our raw BERT model.

This article has been published from the source link without modifications to the text. Only the headline has been changed.

Source link