Building a language model for Moroccan Darija using fastai
This is a small effort to build a darija language model, i use Moroccan Darija Wikipedia to train an AWD_LSTM model using fastai, it is a small dataset which means that this language model won't be perfect for language generation but it might be useful to finetune it on a task like text classification following the ULMFiT approach, where you train a language model on Wikipedia text like we do in this notebook to gain some knowledge about the language of your choice, then finetune it on domain-specific data using the same objective of your pretrained language model, in order to bridge the gap between the language used in wikipedia text and the language used in your dataset (e.g., formal language -> informal language), and finally, finetune the language model on the task of your choice.
This model can be improved by:
- Throwing more data at it of course
- Some text preprocessing
- Tuning the hyperparameters
- I thought also about pretraining on arabic which might be a good idea given the similarities between arabic and darija
Let's start by upgrading fastai and installing SentencePiece to use for subword tokenization:
!pip install fastai -q --upgrade
!pip install -q sentencepiece!=0.1.90,!=0.1.91
import sys
from gensim.corpora import WikiCorpus
from fastai.text.all import *
import torch as torch
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/content/drive/')
path = Path('/content/drive/MyDrive/ml/projects/darija/')
dls_path = path/'dls'
model_path = path/'models'
spm_path = model_path/'spm'
dls_path.mkdir(exist_ok=True, parents=True)
model_path.mkdir(exist_ok=True, parents=True)
spm_path.mkdir(exist_ok=True, parents=True)
This is how we can download The Moroccan Darija Wikipedia data, it's available in this link.
!wget https://dumps.wikimedia.org/arywiki/latest/arywiki-latest-pages-articles.xml.bz2 -O '/content/drive/MyDrive/ml/projects/darija/arywiki-latest-pages-articles.xml.bz2'
We make use of WikiCorpus from gensim to convert the XML file we downloaded to a text corpus.
def make_corpus(in_f, out_f):
"""Convert Wikipedia xml dump file to text corpus"""
output = open(out_f, 'w')
wiki = WikiCorpus(in_f)
for i, text in enumerate(wiki.get_texts()):
output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
if (i % 1000 == 0):
print('Processed ' + str(i) + ' articles')
output.close()
print('Processing complete!')
make_corpus(f'{path}/arywiki-latest-pages-articles.xml.bz2', f'{path}/wiki_darija.txt')
path.ls()
Now we load our text data as a pandas dataframe, and we take a look at it using the most advanced EDA technique 😄, we can see that there are words from other languages that will most likely disappear due to their low frequency, and we can tell fastai the minimum word frequency (by default it’s 3) we can tolerate using fastai DataBlocks that we discuss below.
df = pd.read_csv(path/'wiki_darija.txt', header=None, names=['text'])
df.head()
Subword tokenization refers to constructing our vocabulary using the most frequently occurring groups of letters, for instance, the word “transformer” could be split into “trans” and “former”. I find it better to use subword tokenization with a relatively smaller vocabulary size in the case of a small dataset, to avoid the p>>n problem (where the number of features exceeds the number of training examples), and also because if we decide to use words as our tokens, we are going to have a lot of words that appear only a few times throughout the corpus, and the model won’t be given a decent chance to learn about them.
I use a maximum vocabilary size of 1000 specified by the max_vocab_sz
parameter, but you can use less or more, its another hyperparamter you can tune based on the metric you care about.
The data block API is provided by fastai to customize the creation of our dataloaders, blocks
parameter is used to specify the type of our independent and dependent variables, when TextBlock
is passed, fastai takes care of preprocessing for us, we just need to pass it our subword tokenizer since it uses word tokenization by default, we also tell fastai that we are building this for language modeling with is_lm
and that our text is in a dataframe.
And finally we create our dataloaders, it's a dataloader with s because it includes the training and validation dataloaders, the validation set is 10% of our data as we specify in our RandomSplitter
.
bs=128
tok = SubwordTokenizer(cache_dir=spm_path, max_vocab_sz=1000)
dls_lm = DataBlock(blocks=TextBlock.from_df('text', is_lm=True, tok=tok),
splitter=RandomSplitter(0.1, seed=42),
get_x=ColReader('text')
).dataloaders(df, bs=bs)
We save our dataloader since we can't afford to create it each time because of our huge dataset 😅.
torch.save(dls_lm, dls_path/'dls_lm.pkl')
This is how our preprocessed text looks like, spaces in the original text are replaced by ▁, xxbos is a special token added by fastai to signify the beginning of a sentence, fastai also adds other special tokens to make learning easier for the model, we can see them when we check our vocab below.
dls_lm.show_batch(max_n=6)
Special tokens in fastai start with letters xx, they are useful to help our model handle the shift from original text to our preprocessed text. For example, xxunk
is used to replace the tokens that don't exist in our vocab,
as it can be useful to help our model learn to deal with missing tokens.
print(dls_lm.vocab[:20])
Now it's time to create our language model learner, we pass it our dataloaders, and use the version of AWD_LSTM provided by fastai.
Perplexity is usually used to evaluate language models, a model with a low perplexity is one that assigns a high probabilty to the correct output, in our case, the model learns by trying to predict the next token in a sequence, so the lower the perplexity the better is our model at predicting the next token correctly. Perplexity is a good metric to look at when training your language model and tuning the different hyperparamters, but i think the best way to measure the quality of a language model is to actually apply it on a task (text classifcation, question ansewring, ...) and look at your accuracy (or any other metric) going up or down.
learn = language_model_learner(dls_lm, AWD_LSTM,
metrics=[accuracy, Perplexity()], pretrained=False)
AWD_LSTM is just LSTM layers with lots of regularization, and we can see that in the hyperparamters below in fastai's implementation, all the parameters that end with 'p' are the ammount of dropout applied to some part of the network. Other regularization techniques are also used, like activation regularization which is similar to weight decay but applied to the activations instead of the weights. Another interesting technique is weight tying, that is based on the intuition that out embedding layer is a mapping from darija to a vector representation, and our output layer is a mapping from a vector representation back to darija, so why not use the same weight matrix for both of them, this turn out to be a useful method to reduce the number of parameters of a language model especially when we have a huge vocabulary size.
awd_lstm_lm_config
Time to train our model using the one cycle policy, that was introduced in this paper. It is a method that suggests varying the learning rate linearly in two steps, one where we go from a small learning rate to a maximum learning rate that we specify (in this case max_lr=1e-2
), then decreasing it to a lower value than the one we started with.
Starting from a low learning rate is used as a warm-up step and has the effect of allowing the model to get used to the data before jumping to a high learning rate, when we reach the maximum learning rate, it acts as a regularization that helps the model escape saddle points and avoid steep areas of the loss and prefer a flatter minimum, that we can navigate while decreasing our learning rate in the second step.
learn.fit_one_cycle(n_epoch=50, lr_max=1e-2)
learn.save(model_path/'darija_lm')
def decoder(sentence):
s = ''.join(sentence)
return s.split('▁')
We use the predict method to generate two sentences with 100 subwords each, it will take a piece of text as input and start doing the usual work of predicting the next token. We don't just take the word with the highest probability but we randomly sample from a probability distribution (the output of the softmax), this is done because we want our model to be a little creative and not just keep repeating itself; a high temperature will smooth this probability distribution and give tokens with low probability a higher chance of being sampled.
text = 'رسام صانع طباعة'
n_toks = 100
n_sentences = 2
preds = [learn.predict(text, n_toks, temperature=0.75, decoder=decoder)
for _ in range(n_sentences)]
This is the output of our model, we have ourselves a drunk GPT-3 😄, but we can see that it's able to generate words correctly even though we are using subwords, this is more apparent when we look at the output of our model without the decoder below.
preds
text = 'رسام صانع طباعة'
n_toks = 100
n_sentences = 2
preds = [learn.predict(text, n_toks, temperature=0.75)
for _ in range(n_sentences)]
The output without a decoder, tokens are separated by spaces, while the actual space is replaced by ▁, you might notice that the output is different because we are randomly sampling.
preds