RAG-Shot Learning

Labeled data can be used in a RAG-like pipeline instead of tuning, to enhance performance of unmodified models. Let's benchmark it.

Motivation

With a small corpus of labeled data, one can improve LLM performance with tuning, but practical realities of infrastructure do not always allow this. But with labeled data, we can gather relevant examples on a per-inference basis and deliver them as a few-shot prompt. This is called Retrieval-based In-Context Learning, but I like to call it RAG-shot.

Few-shot Learning

Few-shot learning is the simple addition of explicit examples to the prompt.

Task: Classify the review as either `1` (bad) or `2` (good). Write nothing else.

Example Review: `Worst sandwich ever!`
Sentiment: `1`

Example Review: `I love this bar!`
Sentiment: `2`

RAG

RAG is any method where we use cheap methods to select helpful information for inclusion with a prompt. What matters with RAG is:

The surfaced content is more helpful than distracting.
The process is cheaper than adding everything you have to context window.

Often, this means a knowledge base with a kd-tree index on text fragments and their embeddings. That's a blunt instrument, and you can do a more sophisticated, application-specific pipeline, but for this, the blunt instrument is enough.

RAG-Shot

Combine these like so.

Build a training set.
Embed the training set's inputs. Build a table.
During inference, embed the input.
Take top-K input-output pairs from the training set, according to the similarity of the training input to you actual input.
Pass these in with your prompt.

Benchmarking

It's not enough to imagine it works. Let's try it.

All benchmarks are done with a logit bias; the model is restricted to emit only the task's valid emissions. All benchmarks are done with hnswlib for ANN (approximate nearest neighbor) lookups.

Yelp Review Polarity

This might be too simple, but I'd like to compare results to my prefix-tuning experiment, in which I run the Yelp Review Polarity benchmark on Llama-3.2-1B-Instruct.

Baseline

Having two classes, rand() would give us 50% accuracy.

The first test bypasses the forward pass altogether, takes the ANN of the test sample's embedding, and assumes that training sample's classification. This is the bluntest, simplest way to solve this task with a language model.

The second test, is a plain, minimal instruction.

The third test adds a hand-written, minimal pair of examples.

Tests with `llama3-1B-Instruct`. Score is points out of 2048 samples.
N Train	N Examples	Points	Accuracy
1024	ANN	1398	68%
N/A	0	1752	85%
Fixed	2	1525	74%

Curiously, the few-shot prompt performs worse than zero-shot. Perhaps the task is too easy, and the examples can only distract?

Onward:

Test RAG-shot with `llama3-1B-Instruct`. Score is out of 2048.
N Train	N Examples	Points	Accuracy
1024	2	1775	86%
1024	4	1870	91%
1024	4 (Balanced)	1896	92%

With four samples, we get 91%, which approaches the 96% figure achieved with prefix-tuning on this model. I've written two versions of this test; a pure-ANN version (ragshot.py) and a "balanced" version (ragshot.balanced.py) which ensures the examples represent both classifications.

I tried again with a larger model.

Same tests, with the larger `llama3-3B-Instruct`.
N Train	N Examples	Score	Accuracy
1024	ANN	1441	70%
N/A	0	1835	89%
Fixed	2	1987	97%
1024	2	1851	90%
1024	4	1836	89%
1024	4 (Balanced)	1989	97%

Here, the tables turn against it; the instruction prompt with two minimal examples scores 97%! The RAG-enhanced few-shot scores slightly higher, but only with four balanced samples. You can see how sensitive it is. The results from the 1B model do not transfer to the 3B variant.

You can find similar effects if, for example, you're running ChatGPT-3.5-Turbo and decide to drop in ChatGPT-4o. Is it better? Are you sure? How would you know? In a previous experiment, GPT-4 smoked every newer model in a deduplication task.

And why not? None of them were trained for my task, so their ability to complete it is incidental.

Go Emotions

Google's Go Emotions seemed like a good test; short fragments are classified into twenty seven categories. This puts heavy load on the means of expression.

Text	I stand corrected. As I said, wonderful charities. I will always enjoy GDQ, just not revalant to my experience.
Label	joy
Llama 1B	disappointment, disapproval, disapproval, disgust

Both of these labels are debatable, but one is correct "officially".

This task proved exceptionally hard for OTS, pre-trained LLMs. I expect it will crack with a fine-tune or trained classifier head, but we're not trying that today.

What's odd about this task, and what makes the technique relevant, is the labels aren't always intuitive. Deep Learning is imitative, and the actual job is to imitate the labelers. You're automating their framing and intution. Perhaps a given text, labeled "joy", doesn't look like "joy" to everyone, but the labelers are the designated authoritative source, so it's "joy" officially.

I decided to admit only single-label examples, in order to keep this task in reach of the small models.

Crude Baseline

With 28 categories (their 27-part list plus "neutral"), rand() should manage 3.5% accuracy.

We try ANN first:

With `llama3-1B-Instruct` embeddings, classify test samples via LUT.
N Train	Score	Accuracy
1024	455	22%
4096	506	24%
16384	573	27%

We then try the models with a zero-shot prompt.

Model	Score	Accuracy
Llama3-1B	88	4%
Qwen3-8B	78	3%
ChatGPT-4.1	92	4%

Off-the-shelf models with instructions only.

These models are roughly as good as rand(), and a lot more expensive. Usually, that means we goofed mistake, but a close examination found only the test is very ambiguous.

RAG-Shot

With `llama3-1B-Instruct` vs. 2048 test samples.
N-Train	N-Examples	Score	Accuracy
1024	2	238	11%
1024	8	490	23%
1024	16	693	33%
1024	24	722	35%
4096	24	750	36%
16384	24	803	39%

With `qwen3-8B` vs. 2048 test samples.
N-Train	N-Examples	Score	Accuracy
1024	2	151	7%
4096	8	354	17%
4096	24	524	25%

I tried to improve scores by translating the numeric classifications for the model, instead of providing a table in the instructions. Surprisingly, this hurt performance in all configurations. Models, eh?

I ported the test apparatus to ChatGPT-4.1, taking care to provide the examples in the system message, because it's trained against self-imitation (see: the Waluigi Effect).

`gpt-4-1`
N-Train	N-Examples	Score	Accuracy
0	0	92	4%
1024	4	144	7%
16384	24	428	20%

OK. I scoured code for some evidence of a fault. I tweaked and adjusted. I can find nothing wrong to justify this.

Comparing Llama-3.2-1B and ChatGPT-4.1 may seem off topic, but the benchmarks are only meaningful if they work. This numbers could imply it's answering from the examples, and ignoring the actual payload. However, it seemed to be in working order. My best guess is informed by evaluations like these:

Text	I’m really sorry about your situation :( Although I love the names Sapphira, Cirilla, and Scarlett!
Label	remorse
ChatGPT	caring

ChatGPT gives answers that don't match the labelers, but seem plausible. Maybe its "inferior" score betrays a higher innate disagreeability?

Conclusion

We can see RAG-enhanced few-shot learning boosts performance in every case, using exclusively default, unmodified models. But it's never self-evident how the technique transfers from one task & model to another.

Whatever you're doing, don't overwork it until you've got a plan to measure! Every LLM call that matters should have a benchmark and test plan. Don't drive blind; automate testing.