Finetuning DziriBERT for Dialect Detection
DziriBERT is a BERT language model trained and tested on 1.1 Million Algerian tweets, it was introduced in this paper and is available on the Hugging Face Model Hub in this link, which means it is fairly easy to use and finetune the model. In this blog post i'll show how to finetune DziriBERT for dialect detection. The dataset i will be using is of one the MSDA open datasets, it contains tweets labelled by country-level dialect. All of the five dialects in this dataset have their roots in Arabic, which means that there are lots of similarities between them especially the ones that are geographically close (e.g. Moroccan and Algerian dialects).
We start by installing the transformers library and importing what we need for the implementation.
!pip install -q transformers
import pandas as pd
from sklearn.model_selection import train_test_split
from pathlib import Path
import torch
from torch.utils.data import Dataset, DataLoader
import transformers
from transformers import AutoModel, BertTokenizerFast, AutoModelForSequenceClassification, Trainer, TrainingArguments
from google.colab import drive
drive.mount('/content/drive')
path = Path('/content/drive/MyDrive/ml/projects/msda_dataset/dialect')
The dataset is available as a csv file and can be found in this link. I want to mention that you can apply the finetuning process in this post on the other MSDA text classification datasets, you just might need to take into consideration that they are imbalanced.
df = pd.read_csv(path/'dialect.csv')
df.tail()
This is the number of training examples for each class in our dataset.
df.dialect.value_counts().plot.bar(x=df.dialect.unique(), title='Tweet Distribution')
We encode the labels by mapping each dialect to an index.
dialects = df['dialect'].unique()
lbl2idx = {d: i for i, d in enumerate(dialects)}
df['dialect'] = df['dialect'].map(lbl2idx)
df['dialect'].unique()
We split the dataset randomly, 70% will be used for training, and 30% for validation and testing of the model.
train_texts, temp_texts, train_labels, temp_labels = train_test_split(df['Twits'], df['dialect'], random_state=42,
test_size=0.3)
val_texts, test_texts, val_labels, test_labels = train_test_split(temp_texts, temp_labels, random_state=42,
test_size=0.5)
len(train_texts), len(val_texts), len(test_texts)
We need to specify the name of the model on the Hugging Face Hub, we download first the pretrained DziriBERT tokenizer, which will take care of tokenizing and numeracalizing our sequences as DziriBERT model expects them.
BERT_MODEL_NAME = 'alger-ia/dziribert'
tokenizer = BertTokenizerFast.from_pretrained(BERT_MODEL_NAME)
Tokenization is the process of splitting text into tokens, which can be words, characters or subwords. BERT uses an algorithm called WordPiece for tokenization, it's a subword tokenization algorithm that was pretrained to know which groups of characters to keep together, the most frequent word pieces are the one kept in the vocabulary of the model. We can see bellow that some words are split into pieces while some others are kept as they are. Some tokens are preceded by "##"
, this means that that is a completion of a word and should be attached to the previous token when decoding.
tokens = tokenizer.tokenize(train_texts[0])
print(tokens)
This is the original sequence.
train_texts[0]
Before feeding our inputs to the model, we need to convert them to numbers. What we see bellow is the index of each token in the vocabulary of the pretrained DziriBERT WordPiece tokenizer. When finetuning the model, these indices will be used to retrieve the embedding vector of each token that was learned during pretraining.
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
A vocabulary is a dictionary that maps the tokens learned by WordPiece to indices, tokens that are not in the vocabulary are usually replaced with a special token (e.g. [UNK]
).
tokenizer.vocab[tokens[0]]
encode_plus
is a chain of multiple steps to prepare the inputs of our model, this includes the ones we discussed before (tokenize
and encode_tokens_to_ids
), along with others like padding. We can see it has two outputs, input_ids
which is similar to the output of encode_tokens_to_ids
, and an another output which is attention_mask
, this is used to indicate to the model the position of the padding tokens, this is done so that the attention layers can ignore them when computing token representations. 1 indicates an original token and 0 is for the padding token, which we can't see in the output because padding is not needed, it is usually useful when batching sequences together and we will see that bellow, we will also see where that 2 and 3 in the input_ids
came from.
tokenizer.encode_plus(train_texts[0], return_token_type_ids=False)
We usually want our model to take a batch of sequences, to do that, sequences must have the same length, one way we can do this is to grab the longest sequence in a batch and pad all the other sequences to have the same length.
batch_encode_plus
encodes a batch of sequences, we can see that the second sentence is padded with zeros to the length of the longest sequence in the batch, which is the first sentence, and its attention_mask
also contains zeros indicating which tokens are padding.
tokenizer.batch_encode_plus(train_texts[:2].to_list(), padding=True, return_token_type_ids=False)
Decoding is reversing the operations that were applied to the text, but we see that there are two new tokens ([CLS]
and [SEP]
) in our sequence, these are called special tokens and their indices in the vocabulary are 2 and 3, they are added by the tokenizer and to explain why, i will talk a bit about the training objectives of BERT.
BERT language models are usually trained with two training objectives, Masked Language Modelling (MLM) and Next Sentence Prediction (NSP).
Masked Language Modelling (MLM) is used in most if not all tranformer based models, at first, we mask a percentage of the input tokens, 15% in the original BERT but 25% in the case of DziriBERT as they state in their paper, the reason as they mention is that DziriBERT is trained on tweets and they have a short length. The model is asked to predict the masked tokens based on their context (non-masked tokens), this way the model will learn language representations that can be useful when finetuning on downstream tasks.
Next Sentence Prediction is a classification task where BERT takes a pair of sentences, and tries to predict whether the second sentence is the next sentence or not. This was shown to increase the performance of BERT especially in tasks that involve pairs of sentences (e.g. Question Answering (QA), Natural Language Inference (NLI), ...) as was declared in the BERT paper. The pairs of sentences are separated by [SEP]
token, and you can see bellow how the input of BERT should look like when doing NSP. The [CLS]
token is used at the beginning of each input, as BERT outputs a vector representation for each token in the sequence, it will use only the representation of the [CLS]
token to do Next Sentence Prediction. This way, the [CLS]
token is thought to encode information about the whole input sequence, and that is why it is usually used in text classification tasks like ours.
out = tokenizer.encode_plus(train_texts[0], return_token_type_ids=False)
tokenizer.decode(out['input_ids'])
This is what normally the input of BERT would look like during pretraining with an NSP objective.
tok_out = tokenizer(train_texts[0], train_texts[2])
tokenizer.decode(tok_out['input_ids'])
We take a look at the length of sequences in our training set.
seq_len = [len(tokenizer.encode(i)) for i in train_texts]
pd.Series(seq_len).hist(bins = 30)
The maximum sequence length that BERT can take is 512, but that would be too much for our case, since most sequences contain less than 40 tokens. We specify that for our tokenizer in max_length
and set truncation
to true
, what this means is that any sequence longer than 40 will be truncated to max_seq_len
. With that, we will lose some informations but it will reduce the training time, and that matters a lot when training on free colab 😄.
max_seq_len = 40
train_encodings = tokenizer(train_texts.to_list(), truncation=True, padding=True, max_length=max_seq_len)
val_encodings = tokenizer(val_texts.to_list(), truncation=True, padding=True, max_length=max_seq_len)
test_encodings = tokenizer(test_texts.to_list(), truncation=True, padding=True, max_length=max_seq_len)
We create a Pytorch dataset class, which has two essential methods, __len__
that should return the number of samples in the dataset, and __get_item__
which will return the item (encoding and label) at index idx
.
class TweetDataset(Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels.to_list()
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.labels)
train_dataset = TweetDataset(train_encodings, train_labels)
val_dataset = TweetDataset(val_encodings, val_labels)
test_dataset = TweetDataset(test_encodings, test_labels)
To evaluate our model while training, we need to pass a function that takes the model's predictions, and compute the metrics that we care about. In our case we will compute the accuracy, f1 score, precision and recall because our dataset is imbalanced, and also to compare with the baselines in MSDA paper.
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='micro')
acc = accuracy_score(labels, preds)
return {
'accuracy': acc,
'f1': f1,
'precision': precision,
'recall': recall
}
Now is time to load the pretrained DziriBERT model from Hugging Face Hub. What AutoModelForSequenceClassification
does is to remove the pretraining head of the model, and replace it with a classification head that will be initialized randomly. This classification head is just a linear layer that will take something called pooler_output
as input and output 5 numbers (as specified in num_labels
) for each sequence, these are called logits and they are passed through a softmax function to get probabilities when computing the loss or to when computing metrics.
pooler_output
is one of the outputs of a BERT model, it is the representation of the [CLS]
token after it was passed through a linear layer followed by a tanh activation function.
model = AutoModelForSequenceClassification.from_pretrained(BERT_MODEL_NAME, num_labels=5)
Taking a look at logits.
out = tokenizer.encode_plus(train_texts[0], return_token_type_ids=False, return_tensors='pt')
model.eval()
with torch.no_grad():
print(model(**out)['logits'])
The Trainer
is what takes care of training our models, it takes arguments through the TrainingArguments
class, and takes our model, datasets and the method to compute metrics and output them on a nice table during training. calling train
will start finetuning our classification model, and what we need now is to find something to do instead of looking at the progress bar 😁.
training_args = TrainingArguments(
output_dir='./results', # output directory
num_train_epochs=5, # total number of training epochs
per_device_train_batch_size=32, # batch size per device during training
per_device_eval_batch_size=64, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
evaluation_strategy='epoch', # evaluate at the end of each epoch
logging_strategy='epoch' # log at the end of each epoch
)
trainer = Trainer(
model=model, # the instantiated Transformers model
args=training_args, # training arguments
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset, # evaluation dataset
compute_metrics=compute_metrics # method we defined before to compute our metrics
)
trainer.train()