I have some free time this weekend and I always wanted to fine tune a language
model to gain some hands-on experience. Huge thanks to Huggingface for making
everything so accessible and easy.
In this blog, i share things that i learnt from this fun project and break down
the idea of fine-tuning language model into 3 parts:
Data
Model
Trainer
Data
In this example, we are using
wikitext dataset. The WikiText
language modeling dataset is a collection of over 100 million tokens extracted
from the set of verified Good and Featured articles on Wikipedia.
Dataset Preview:
As we can see, some of the texts are a full paragraph of a Wikipedia article
while others are just titles or empty lines. The first step is to tokenize them
and then split them in small chunks of a certain block_size using the following
code:
We can refer to the dummy example below to understand how the group_texts()
function works. The idea is to concatenate input_ids of all examples (i.e, make it a long list of int) together
and split them into small chunks (i.e., make it a list of list of int).
The remainder 4 is excluded at the end. You may also notice that we duplicate the input_ids for the labels. This is because HF model class will automatically apply the shifting to right, so we don’t need to any thing manually. One can refer to the official pytorch code implemention to learn more about that.
Model
For the model part, we just need to load the pretrained DistilGPT-2 model and define training arguments.
As we wanted to push to the final trained model to HF model hub, we set the argument push_to_hub=True and set the output_dir argument (i.e, the first one positional argument in TrainingArguments()). If things went well, we could download and try our model using namespace:
Trainer
We then can start training them as usual. We just need to run 2 more commands to push the model and tokenizer to our repo: distilgpt2-finetuned-wikitext2.
Inferecing example
One of the easiet way to try the model is to use pipeline(), as follows: