How to train spaCy text classifier for multiclass or multilabel problem
This article describes the steps of training spaCy text classifiers.
This afternoon, I stumbled across a spacy tweet on my Twitter timeline and
realised that I haven’t been using or training spaCy model for a long time. So
I think it might be a good idea to train a simple spaCy text classification
model to refresh my memory. This is definitely not a new or hot thing in the AI
world since the field is moving so fast and everyone is talking about large
language models (LLMs), ChatGPT, and other big models.
This post aims to give a quick introduction to spaCy text classification models
for others and also to serve as a learning note for future use.
In this article, the term “multiclass” refers to a given example belonging to
only one positive class out of K number of classes, whereas “multilabel” means
a given example can belong to zero or more labels at the same time. For
instance, we use multiclass to classify animal types like [dog, cat, fish]
(obviously you can’t be dog and cat at the same time), while a multilabel task
can be used for movie categorisation, tagging a movie as [romance, funny]
simultaneously.
We can break down this classification task into 5 parts:
Load dataset
Dataset Conversion
Generate Config
Train
Evalute
1. Load dataset
In this example, we are training a Chinese Legal text classification model. The
CAIL dataset is a Chinese
legal NLP dataset for judgment prediction and contains over 1m criminal cases.
The dataset provides labels for relevant article of criminal code prediction,
charge (type of crime) prediction, imprisonment term (period) prediction, and
monetary penalty prediction. The goal is to predict how severe was the
committed crime with respect to the imprisonment term. We approximate crime
severity by the length of imprisonment term, split in 6 classes (0, <=12,
<=36, <=60, <=120, >120 months).
2. Dataset Conversion
Note that, for each sample we have to specify 1/0 for all possible labels in
each doc.cat attribute. For instance, you are doing sentiment analysis and
you have 3 possible labels: [positive, neutral, negative]. A positive sentiment
article should be prepared like this:
3. Generate Config
We use spacy init command to generate a default config file to train a
multiclass ensemble text classifer focusing on accuracy.
4. Train
5. Evaluate
We can also load the saved model and predict on unseen sample.
That’s it. Check out the following notebooks for more information: