Home > Blog

RAG-Shot Learning

Labeled data can be used in a RAG-like pipeline instead of tuning, to enhance performance of unmodified models. Let's benchmark it.

Motivation

With a small corpus of labeled data, one can improve LLM performance with tuning, but practical realities of infrastructure do not always allow this. But with labeled data, we can gather relevant examples on a per-inference basis and deliver them as a few-shot prompt.

Few-shot Learning

Few-shot learning is the simple addition of explicit examples to the prompt.

Task: Classify the review as either `1` (bad) or `2` (good). Write nothing else.

Example Review: `Worst sandwich ever!`
Sentiment: `1`

Example Review: `I love this bar!`
Sentiment: `2`

RAG

RAG is any method where we use cheap methods to select helpful information for inclusion with a prompt.

Often, this means a knowledge base with a kd-tree index on text fragments and their embeddings. That's a blunt instrument, and you can do a more sophisticated, application-specific pipeline, but for this, the blunt instrument is enough.

RAG-Shot

Combine these like so.

  1. Build a training set. Can be small.
  2. Embed the training set's inputs. Build a table.
  3. During inference, embed the input.
  4. Take top-K input-output pairs from the training set, according to the similarity of the training input to you actual input.
  5. Pass these in with your prompt.

Benchmarking

It's not enough to imagine it works. Let's try it.

All benchmarks are done with a logit bias; the model is restricted to emit only the task's valid emissions. All benchmarks are done with hnswlib for ANN (approximate nearest neighbor) lookups.

Yelp Review Polarity

This might be too simple, but I'd like to compare results to my prefix-tuning experiment, in which I run the Yelp Review Polarity benchmark on Llama-3.2-1B-Instruct.

Baseline

Having two classes, rand() would give us 50% accuracy.

The first test bypasses the forward pass altogether, takes the ANN of the test sample's embedding, and assumes that training sample's classification. This is the bluntest, simplest way to solve this task with a language model.

The second test, is a plain, minimal instruction.

The third test adds a hand-written, minimal pair of examples.

Tests with llama3-1B-Instruct. Score is points out of 2048 samples.
N Train N Examples Points Accuracy
1024 ANN 1398 68%
N/A 0 1752 85%
Fixed 2 1525 74%

Curiously, the few-shot prompt performs worse than zero-shot. Perhaps the task is too easy, and the examples can only distract?

Onward:

Test RAG-shot with llama3-1B-Instruct. Score is out of 2048.
N Train N Examples Points Accuracy
1024 2 1775 86%
1024 4 1870 91%
1024 4 (Balanced) 1896 92%

With four samples, we get 91%, which approaches the 96% figure achieved with prefix-tuning on this model. I've written two versions of this test; a pure-ANN version (ragshot.py) and a "balanced" version (ragshot.balanced.py) which ensures the examples represent both classifications.

I tried again with a larger model.

Same tests, with the larger llama3-3B-Instruct.
N Train N Examples Score Accuracy
1024 ANN 1441 70%
N/A 0 1835 89%
Fixed 2 1987 97%
1024 2 1851 90%
1024 4 1836 89%
1024 4 (Balanced) 1989 97%

Here, the tables turn against it; the instruction prompt with two minimal examples scores 97%! The RAG-enhanced few-shot scores slightly higher, but only with four balanced samples. You can see how sensitive it is. The results from the 1B model do not transfer to the 3B variant.

You can find similar effects if, for example, you're running ChatGPT-3.5-Turbo and decide to drop in ChatGPT-4o. Is it better? Are you sure? How would you know? In a previous experiment, GPT-4 smoked every newer model in a deduplication task.

And why not? None of them were trained for my task, so their ability to complete it is incidental.

Go Emotions

Google's Go Emotions seemed like a good test; short fragments are classified into twenty seven categories. This puts heavy load on the means of expression.

Text I stand corrected. As I said, wonderful charities. I will always enjoy GDQ, just not revalant to my experience.
Label joy
Llama 1B disappointment, disapproval, disapproval, disgust
Both of these labels are debatable, but one is correct "officially".

This task proved exceptionally hard for OTS, pre-trained LLMs. I expect it will crack with a fine-tune or trained classifier head, but we're not trying that today.

What's odd about this task, and what makes the technique relevant, is the labels aren't always intuitive. Deep Learning is imitative, and the actual job is to imitate the labelers. You're automating their framing and intution. Perhaps a given text, labeled "joy", doesn't look like "joy" to everyone, but the labelers are the designated authoritative source, so it's "joy" officially.

I decided to admit only single-label examples, in order to keep this task in reach of the small models.

Crude Baseline

With 28 categories (their 27-part list plus "neutral"), rand() should manage 3.5% accuracy.

We try ANN first:

With llama3-1B-Instruct embeddings, classify test samples via LUT.
N Train Score Accuracy
1024 455 22%
4096 506 24%
16384 573 27%

We then try the models with a zero-shot prompt.

Model Score Accuracy
Llama3-1B 88 4%
Qwen3-8B 78 3%
ChatGPT-4.1 92 4%
Off-the-shelf models with instructions only.

These models are roughly as good as rand(), and a lot more expensive. Usually, that means we goofed mistake, but a close examination found only the test is very ambiguous.

RAG-Shot

With llama3-1B-Instruct vs. 2048 test samples.
N-Train N-Examples Score Accuracy
1024 2 238 11%
1024 8 490 23%
1024 16 693 33%
1024 24 722 35%
4096 24 750 36%
16384 24 803 39%
With qwen3-8B vs. 2048 test samples.
N-Train N-Examples Score Accuracy
1024 2 151 7%
4096 8 354 17%
4096 24 524 25%

I tried to improve scores by translating the numeric classifications for the model, instead of providing a table in the instructions. Surprisingly, this hurt performance in all configurations. Models, eh?

I ported the test apparatus to ChatGPT-4.1, taking care to provide the examples in the system message, because it's trained against self-imitation (see: the Waluigi Effect).

gpt-4-1
N-Train N-Examples Score Accuracy
0 0 92 4%
1024 4 144 7%
16384 24 428 20%

OK. I scoured code for some evidence of a fault. I tweaked and adjusted. I can find nothing wrong to justify this.

Comparing Llama-3.2-1B and ChatGPT-4.1 may seem off topic, but the benchmarks are only meaningful if they work. This numbers could imply it's answering from the examples, and ignoring the actual payload. However, it seemed to be in working order. My best guess is informed by evaluations like these:

Text I’m really sorry about your situation :( Although I love the names Sapphira, Cirilla, and Scarlett!
Label remorse
ChatGPT caring

ChatGPT gives answers that don't match the labelers, but seem plausible. Maybe its "inferior" score betrays a higher innate disagreeability?

Conclusion

We can see RAG-enhanced few-shot learning boosts performance in every case, using exclusively default, unmodified models. But it's never self-evident how the technique transfers from one task & model to another.

Whatever you're doing, don't overwork it until you've got a plan to measure! Every LLM call that matters should have a benchmark and test plan. Don't drive blind; automate testing.