Zero-Shot Text Classification with pretrained LLM

According to this article,

Zero-shot text classification is a task in natural language processing where a model is trained on a set of labeled examples but is then able to classify new examples from previously unseen classes.

Simply put, zero-shot text classification is to use preexisting models on classification tasks that the models are not trained upon. Large Language Models backed by attention have a lot of great applications, such as summarization, chatbot, code completion and etc. It aslo gives zero-shot text classification a huge potential since most LLMs are pretrained on tremendous data which cover most common use case already. LLMs with strong reasoning capability such as deepseek can even perform well on unseen data. In this article, I want to discuss some pratical ways to use pretrained LLMs to do zero-shot classification using 🤗 Transformers.

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    AutoModelForSequenceClassification,
    TextClassificationPipeline,
    TextGenerationPipeline
)
import torch

import pandas as pd

There has been a lot of research on zero-shot classification, for example, [1], [2] and etc. Given a set of labels ${y_i}$, and a sample $x$, we will implement the basic method

$$\mathrm{arg max}_i P(y_i|x).$$

As an example, we will use a sentiment dataset financial-sentiment. In this dataset, we need to classify some finanical text into 3 labels - negative, netural and positive.

# https://huggingface.co/datasets/vumichien/financial-sentiment
# use the valid split - 
data_path = "hf://datasets/vumichien/financial-sentiment/data/valid-00000-of-00001.parquet"
df_data = pd.read_parquet(data_path)
df_data = df_data.rename(columns={"label_experts": "label"})
print(f"Sample size: {df_data.shape[0]}")
print(f"Labels: {df_data['label'].unique().tolist()}")
for label in df_data["label"].unique():
    label_size = df_data[df_data["label"] == label].shape[0]
    print(f"Label sample size for {label}: {label_size}")

Sample size: 453
Labels: ['negative', 'neutral', 'positive']
Label sample size for negative: 61
Label sample size for neutral: 265
Label sample size for positive: 127

There are roughly two types of pretrained models, base models and instruct models. Base models are pretrained only on a large corpus of text. Instruct models are further trained on instructions to give capability to models to follow instructions. I am going to use Qwen2.5-family models as there are base and instruct models. For simplicity, I will use Qwen/Qwen2.5-0.5B-Instruct throughout the article. In theory, one can use any completion models, base or instruct. Personally, I prefer instruct models as they can follow instructions better, as you can already tell from "Instruct" in their names.

Instruct Model as Text Classifier

GPT models are generally pretrained on predicting the next token. The idea here to use LLM as a text classifier is to create a prompt so that the next token from LLM is limited to the labels of the classification tasks. For our example with financial sentiment ananlysis, we should create a prompt so that the LLM output negative, netrual, or positive. Here is one possible prompt:

What is the sentiment of the following text related to finance?
negative, neutral or positive: {text}
Give your answer in one word.

model_path = "Qwen/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
lm_model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto"
)

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.

Let's get our prompt template.

user_prompt_template = """What is the sentiment of the following text related to finance?
negative, neutral or positive: {text}
Give your answer in one word."""
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": user_prompt_template}
]
prompt_template = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
print(prompt_template)

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the sentiment of the following text related to finance?
negative, neutral or positive: {text}
Give your answer in one word.<|im_end|>
<|im_start|>assistant

# apply the prompt template to all samples
df_data["prompt"] = df_data["text"].apply(lambda text: prompt_template.format(text=text))

We have our prompt template ready. There is a place holder {text} where we put in our text to be classified. Let's run it on one text sample.

text_sample = df_data.iloc[0]
print(text_sample["text"])
print(f"Sentiment (Truth): {text_sample['label']}")

In Q2 of 2009 , profit before taxes amounted to EUR 13.6 mn , down from EUR 26.8 mn in Q2 of 2008 .
Sentiment (Truth): negative

# the prompt we will send to the model
prompt = text_sample["prompt"]
print(prompt)

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the sentiment of the following text related to finance?
negative, neutral or positive: In Q2 of 2009 , profit before taxes amounted to EUR 13.6 mn , down from EUR 26.8 mn in Q2 of 2008 .
Give your answer in one word.<|im_end|>
<|im_start|>assistant

# get the prediction from the model
model_inputs = tokenizer([prompt], return_tensors="pt").to(lm_model.device) # step 1

with torch.no_grad():
    lm_model_output = lm_model(**model_inputs) # step 2

predicted_token_id = lm_model_output.logits[0, -1].argmax() # step 3
predicted_label = tokenizer.decode(predicted_token_id) # step 4

print(text_sample["text"])
print(f"Sentiment (Truth): {text_sample['label']}")
print(f"Sentiment (Predicted): {predicted_label}")

In Q2 of 2009 , profit before taxes amounted to EUR 13.6 mn , down from EUR 26.8 mn in Q2 of 2008 .
Sentiment (Truth): negative
Sentiment (Predicted): negative

Working great! We get the correct label for the sample using the model. But how does it work? Let's break it down step by step.

Step 1. Tokenize the text input into numbers that can be accepted by the model.

Step 2. Run the model to get the prediction of next tokens. Note that we don't just get one predicted tokens, but a list of tokens. For example, if the text input is "This is an input sentence", then roughly the model doing the following predictions:
- This -> is
- This is -> an
- This is an -> input
- This is an input -> sentence
- This is an input sentence -> ?

So only the last prediction is interesting to us, as this is the token that continues the whole input seqeunce.

Step 3. Get the predicted token id for the last prediction.

Step 4. Convert the token id to word.

Now, we will show how to run the model of all samples more efficiently.

def classify_with_lm(tokenizer, model, texts, batch_size=32):
    """Get the prediction for a list of texts.

    For faster inference in a memory efficent way, we will do batch scoring.
    """
    predictions = []
    texts = list(texts)
    texts_count = len(texts)
    # batch runs
    for batch_begin in range(0, texts_count, batch_size):
        batch_end = batch_begin + batch_size
        batch_texts = texts[batch_begin: batch_end]
        # tokenize batch of texts
        batch_inputs = tokenizer(
            batch_texts,
            padding=True,
            return_tensors="pt",
        ).to(model.device)
        # scoring
        with torch.no_grad():
            batch_outputs = model(**batch_inputs)
        # get the last predicted tokens
        logits = batch_outputs.logits
        input_ids = batch_inputs.input_ids
        # since inputs has different length, they are padded
        # we want the last non pad location
        non_pad_mask = (input_ids != tokenizer.pad_token_id).to(logits.device, torch.int32)
        token_indices = torch.arange(input_ids.shape[-1], device=logits.device)
        last_non_pad_token = (token_indices * non_pad_mask).argmax(-1)
        # get logits of last non pad token
        pooled_logits = logits[torch.arange(len(batch_texts), device=logits.device), last_non_pad_token]
        # get the predicted token ids
        predicted_token_ids = pooled_logits.argmax(-1)
        # convert token ids back to tokens
        predicted_tokens = tokenizer.batch_decode(predicted_token_ids.reshape((-1, 1)))
        predictions.extend(predicted_tokens)
    return predictions

The above function classify_with_lm does batch inference instead for efficiency. In order to do batch inference, the inputs must be padded to the same length. That is the reason for non_pad_mask to figure out the last non pad token.

lm_results = classify_with_lm(tokenizer, lm_model, df_data["prompt"])
print(f"The set of predicted labels: {set(lm_results)}")
count = 0
for label, prediction in zip(df_data["label"], lm_results):
    count += label == prediction
print(f"Correct count = {count}/{len(lm_results)} = {count/len(lm_results):.2%}")

The set of predicted labels: {'Negative', 'Neutral', 'positive', 'neutral', 'negative', 'Positive'}
Correct count = 323/453 = 71.30%

We get it working, and the accuracy is not bad. But we immediately notice a problem: the predicted labels "Negative" and "Positive" are not exactly the wanted labels. In our examples, they are easy to handle. We just need to make them lower cases. However, there is still a risk that the predicted label is random.

# accuracy when we convert the labels to lower cases.
count = 0
for label, prediction in zip(df_data["label"], lm_results):
    count += label == prediction.lower()
print(f"Correct count = {count}/{len(lm_results)} = {count/len(lm_results):.2%}")

Correct count = 350/453 = 77.26%

Before I end the discussion with using CausalLM as classifier, I want to point out that we can use TextGenerationPipeline to replace classify_with_lm. But keep in mind, TextGenerationPipeline requires padding the sequence on the left side for batch inference.

# load a tokenizer with left padding
lm_tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    padding_side="left" # NOTE: must set to left for batch inference in TextGenerationPipeline
)
# set up the pipeline
lm_pipe = TextGenerationPipeline(
    tokenizer=lm_tokenizer,
    model=lm_model,
)
# run on a sample
lm_pipe_result = lm_pipe(
    text_sample["prompt"],
    return_full_text=False, # we just need the output token
    max_new_tokens=1, # we only want one token in the output
    do_sample=False, # we want the token with maximal probability
    # unset the following arguments to avoid warning
    repetition_penalty=None,
    temperature=None,
    top_k=None,
    top_p=None
)[0]

print(f"Predicted label from lm_pipe: {lm_pipe_result['generated_text']}")

Device set to use mps

Predicted label from lm_pipe: negative

# run the pipeline on all samples
lm_pipe_results = lm_pipe(
    df_data["prompt"].tolist(),
    batch_size=32,
    return_full_text=False, # we just need the output token
    max_new_tokens=1, # we only want one token in the output
    do_sample=False, # we want the token with maximal probability
    # unset the following arguments to avoid warning and discrepancy
    repetition_penalty=None,
    temperature=None,
    top_k=None,
    top_p=None
)

lm_pipe_results_clean = [
    result[0]["generated_text"]
    for result in lm_pipe_results
]

# accuracy with TextGenerationPipeline
count = 0
for label, prediction in zip(df_data["label"], lm_pipe_results_clean):
    count += label == prediction.lower()
print(f"Correct count = {count}/{len(lm_results)} = {count/len(lm_results):.2%}")

Correct count = 352/453 = 77.70%

There are some slight discrepancy between the results from classifiy_with_lm vs TextGenerationPipeline. The reason is that we do right padding in classify_with_lm but left padding in TextGenerationPipeline. Please note that we make do_sample False since we want the token with maximal probability instead of sampling. We unset temperature, top_k and top_p to be compatible with do_sample=False. We also unset repetition_penalty to get minimal discrepancy as we didn't do repetition penalty in classify_with_lm either.

Load Models as Sequence Classifier

To address the problem to avoid random output, we can force the output tokens to the labels. The idea is to ignore logits of all tokens but the labels. We can simply modify the classify_with_lm function above to focus on only logits of the labels. I will go one step forward to convert the model from CausalLM to SequenceClassification.

labels = ["positive", "negative", "neutral"]
labels_token_ids = [
    tokenizer.encode(label)[0]
    for label in labels
]

for label, label_id in zip(labels, labels_token_ids):
    print(f"The label `{label}` has token id: {label_id}")

The label `positive` has token id: 30487
The label `negative` has token id: 42224
The label `neutral` has token id: 59568

The order of the labels is chosen so that the token ids form an ascending order. This is an important details if we want to have a consistent result with CausalLM. I will leave it as an exercise for you to figure out why it matters - A hint for you is the behavior of torch.argmax when there are multiple maximums.

# load classification model
cls_model = AutoModelForSequenceClassification.from_pretrained(
    model_path,
    num_labels=len(labels),
    torch_dtype="auto",
    device_map="auto"
)

# extract the weights for the labels
state_dict = lm_model.state_dict()
cls_weights = state_dict["lm_head.weight"][labels_token_ids]

# load them into the scoring module
with torch.no_grad():
    cls_model.score.weight.copy_(cls_weights)

Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen2.5-0.5B-Instruct and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Okay, what is going on here? The LLM models are classification models as well - they predict the labels for the next tokens. Now for our sentiment analysis, we only need to know their predictions on a limited set of 3 tokens: negative, neutral and positive. So we can extract the weights to calculate the logits for the labels and use them to initialize a SequenceClassification model. Let's see how to use it for a sample.

# get the prediction from the model
model_inputs = tokenizer([prompt], return_tensors="pt").to(cls_model.device)

with torch.no_grad():
    cls_model_output = cls_model(**model_inputs)

label_id = cls_model_output.logits.argmax(dim=-1)
predicted_label = labels[label_id]

print(text_sample["text"])
print(f"Sentiment (Truth): {text_sample['label']}")
print(f"Sentiment (Predicted): {predicted_label}")

In Q2 of 2009 , profit before taxes amounted to EUR 13.6 mn , down from EUR 26.8 mn in Q2 of 2008 .
Sentiment (Truth): negative
Sentiment (Predicted): negative

One of the good things using SequenceClassification is that we can assemble a TextClassificationPipeline.

# set the labels for the classification model
labels_count = len(labels)
label2id = dict(zip(labels, range(labels_count)))
id2label = dict(zip(range(labels_count), labels))
cls_model.config.label2id = label2id
cls_model.config.id2label = id2label

We set the label2id and id2label so that TextClassificationPipeline can return the correct labels.

# set the padding token so that we can run the pipeline in batches
cls_model.config.pad_token_id = tokenizer.pad_token_id
pipe = TextClassificationPipeline(
    tokenizer=tokenizer, model=cls_model
)

pipe(text_sample["prompt"])

Device set to use mps

[{'label': 'negative', 'score': 0.6845569014549255}]

cls_results = pipe(df_data["prompt"].to_list(), batch_size=32)
count = 0
for result, label in zip(cls_results, df_data["label"]):
    count += result["label"] == label
print(f"Correct count = {count}/{len(cls_results)} = {count/len(cls_results):.2%}")

Correct count = 351/453 = 77.48%

We set the batch size = 32 in the pipeline to be consistent with what we used before. Setting it to other number might affect the results a bit. The reason is that the number of pading tokens has some impact of logits, since the embedding of the padding token is non-zero. We almost reproduce the same result as using CausalLM above, except that we get one more sample right.

for result1, result2 in zip(lm_results, cls_results):
    if result1.lower() != result2["label"]:
        print(f"LM prediction: {result1}")
        print(f"CLS prediction: {result2['label']}")
    count += 1

LM prediction: Neutral
CLS prediction: positive

Since we only allow "positive", "negative" and "neutral" in SequenceClassifier, it does change the behavior if CausalLM outputs other labels than the 3 above.

Prompt Matters

In the zero-shot classification, the input prompt matters, as it is the main way we can change the performance of the pretrained LLM towards our tasks. To demonstrate the effect, let's run another prompt with a task given in the system message.

system_prompt = """"You are a helpful assistant that good at sentiment analysis on texts related to finance.
""".strip()
prompt2 = """What is the sentiment of the following text related to finance?
negative, neutral or positive: {text}
Give your answer in one word."""
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": prompt2}
]
prompt_template2 = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
print(prompt_template2)

<|im_start|>system
"You are a helpful assistant that good at sentiment analysis on texts related to finance.<|im_end|>
<|im_start|>user
What is the sentiment of the following text related to finance?
negative, neutral or positive: {text}
Give your answer in one word.<|im_end|>
<|im_start|>assistant

df_data["prompt2"] = df_data["text"].apply(lambda text: prompt_template2.format(text=text))
cls_results2 = pipe(df_data["prompt2"].to_list(), batch_size=32)
count = 0
for result, label in zip(cls_results2, df_data["label"]):
    count += result["label"] == label
print(f"Correct count = {count}/{len(cls_results2)} = {count/len(cls_results2):.2%}")

Correct count = 374/453 = 82.56%

With a simple update in the system message, we gain 5% in performance! We can keep doing prompt engineering following best practices. But it is not efficient. A more efficient way is to do fine-tuning with some data. Since our focus is zero-shot classification, I will leave the discussion out. But one of the advantages to set up SequenceClassifier is that it is readily for fine-tuning as a classifier with task specific data.

classification LLM