TIL: Evaluator from Huggingface

This article describe the usage of evaluator from Huggingface

June 4, 2023 - 3 minute read -

TIL that Hugging Face has a library called evaluate which includes a useful Evaluator class for quickly evaluating transformer models on datasets. I will cover how to use this Evaluator class and also address a minor issue we currently have, along with its workaround.

The following code is to load the banking77 dataset and one of the models that I fine-tuned and pushed to HF mode hub:

!pip install -U transformers datasets evaluate

import evaluate
from evaluate import evaluator
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification

task_evaluator = evaluator("text-classification")
data = load_dataset("banking77")

tokenizer = AutoTokenizer.from_pretrained("lxyuan/banking-intent-distilbert-classifier")
model = AutoModelForSequenceClassification.from_pretrained("lxyuan/banking-intent-distilbert-classifier")

# currently this feature development is on-hold so we will use sklearn to do the evaluation
# https://discuss.huggingface.co/t/combining-metrics-for-multiclass-predictions-evaluations/21792/11
results = task_evaluator.compute(
    model_or_pipeline=model,
    data=data["test"],
    metric=evaluate.combine([
        "accuracy",
#        evaluate.load("precision", average="macro"),
#        evaluate.load("recall", average="macro"),
#        evaluate.load("f1", average="macro")
    ]),
    tokenizer=tokenizer,
    label_mapping=model.config.label2id,
    strategy="simple",
)

>>> {'accuracy': 0.9243506493506494,
 'total_time_in_seconds': 30.070161925999855,
 'samples_per_second': 102.42711720607363,
 'latency_in_seconds': 0.009763039586363589}

As mentioned in the code snippet that the development of Evaluator feature is currently on-hold so it doesn’t support precision/recall/f1 metrics for multiclass problem. If you uncomment them and run the code, you will see error message like:

ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

The workaround is simply switch to sklearn classification report function, as follows:

from transformers import pipeline
from sklearn.metrics import classification_report

text_classifier = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0)

x_test = [text for text in data["test"]["text"]]
y_test = [label for label in data["test"]["label"]]
y_pred = text_classifier(x_test)
y_pred = [model.config.label2id[pred['label']] for pred in y_pred]

label_names = [label for id, label in model.config.id2label.items()]
report = classification_report(y_test, y_pred, target_names=label_names, digits=4)
print("Classification Report:\n", report)

>>>
Classification Report:
                                                   precision    recall  f1-score   support

                                activate_my_card     1.0000    0.9750    0.9873        40
                                       age_limit     0.9756    1.0000    0.9877        40
                         apple_pay_or_google_pay     1.0000    1.0000    1.0000        40
                                     atm_support     0.9750    0.9750    0.9750        40
                                automatic_top_up     1.0000    0.9000    0.9474        40
                                ...
                                   verify_top_up     1.0000    1.0000    1.0000        40
                        virtual_card_not_working     1.0000    0.9250    0.9610        40
                              visa_or_mastercard     0.9737    0.9250    0.9487        40
                             why_verify_identity     0.9118    0.7750    0.8378        40
                   wrong_amount_of_cash_received     1.0000    0.8750    0.9333        40
         wrong_exchange_rate_for_cash_withdrawal     0.9730    0.9000    0.9351        40

                                        accuracy                         0.9244      3080
                                       macro avg     0.9282    0.9244    0.9243      3080
                                    weighted avg     0.9282    0.9244    0.9243      3080

That’s it!

Related: