Home

Semantic Deduplication

In this article, I'm going to benchmark various models to find the best way to deduplicate factoids. This will be my first post, writing about AI as an application developer.

Our Task

I would like to deduplicate factoids.

I'm extracting granular factoids from text, and ingesting them into an index for use with a chat bot. If factoids X and Y are similar enough to each other, any query which leads to X will lead also Y, and if we can only load N records in our context window, the duplicate figures will crowd out novel information. We would like to find duplicate claims and remove them.

It's not clear to me how to put this all together. Let's find out.

I see three problems.

  1. X and Y can be semantically identical, but differ significantly on a bit-for-bit basis.
  2. The number of possible comparisons scales exponentially with the number of factoids, as a triangular number.
  3. The claim "X and Y are redundant" is "subjective."

The first item means a good comparison requires the kind of reading comprehension that language models afford us. This makes each comparison expensive.

The second item means a large number of comparisons will be made, and therefore each comparison must be cheap and avoided.

The third item cuts to the heart of what Deep Learning is and why we employ it.

Subjective?

In classical software, you employ formal methods of exploiting quantitative data (math). Everything is reframed as numerical systems the computer can deal with. A "fighter jet" in Civilization is but a skin over a record in a table with range, state and HP.

But in language modeling, you grapple with qualitivate data. Your methods can be formal, but not so formal like a mathematical equation; they can be formal in the sense that a court is formal, where a process exists, but it matters very much who stands in each role.

Let's consider "Mary had a little lamb" vs. "Mary had a tiny lamb". These claims are very similar. But do they mean the same thing? Certainly, "tiny" is a stronger adjective than "little". You may feel they're the same, but another person may feel they're not. It cannot be decided independently; a person (a "subject") must feel a certain way about it.

Semantic deduplication can have no formal solution independent of a person.

Solving Subjectivity

And yet, we must commit. For any given pairing in your dataset, you need a simple, scalar, boolean conclusion. The pairing shall be deemed redundant, or not redundant. The conclusion shall be recorded, and the matter shall be considered closed. That's where Deep Learning comes in.

Deep Learning is Imitative

Models are trained from corpora, which represent authoritative examples of ideal behavior. This may be yourself, a hired labeler, user feedback, or another, stronger model which you have designated to stand in for you. Subjectivity is solved by literally choosing an authority, and training the computer imitate that person.

Context Matters

Fuzzy, ambiguous, contestable comparisons tend to become very clear once you commit to a purpose and context. For example, if we know we're judging poetic rhetoric, "little" and "tiny" are not the same. But if the question is "who owns what?", "little" and "tiny" are probably irrelevant.

Acceptance

Finally, we accept at the outset that what we are doing is more akin to manufacturing than math. There is not only a COGS, but a yield. Instead of physical artifacts, we mass manufacture conclusions, and some of those conclusions will be wrong.

The rate of defective conclusions can be reduced with more labor and compute, but never to zero, as their validity is a function of agreement, and you (the authority) likely don't even agree with yourself 100% of the time. That is life.

Creating a Benchmark

Our central duty. Ideally, we would create a benchmark that resembles actual data. However, before we've built and shipped a system, we might not in fact possess any. Any benchmark you create yourself will serve you better than a common, OTS benchmark due to Goodhart's Law. Case in point, the Quora Duplicate Questions dataset seems like a good approximation for our activity, except that every model we're testing has in fact been trained, tuned or both with it.

We're doing semantic deduplication; we would like to be able to say, for any given pair of records, whether the pair in question is "redundant" or not. And recall we need to model and judge the model's behavior. So to start, we're going to need to gather a realistic set of sentence pairs, then label them.

Raw Material

For demonstration purposes, I've decided to start with MS MARCO. I need is a large pile of sentences, and MS MARCO's "passages" collection is certainly that. Finding the passages TSV, we get 8,841,823 lines like so:

6 Nor will it attempt to substitute for the extraordinarily rich literature on the atomic bombs and the end of World War II. This collection does not attempt to document the origins and development of the Manhattan Project.
8 In June 1942, the United States Army Corps of Engineersbegan the Manhattan Project- The secret name for the 2 atomic bombs.
9 One of the main reasons Hanford was selected as a site for the Manhattan Project's B Reactor was its proximity to the Columbia River, the largest river flowing into the Pacific Ocean from the North American coast.

Each of these records can be quite ambiguous if compared to another, because they contain multiple overlapping claims. In real life, data will be quite messy, but let's assume we'll deal with more granular information than this. With a simple script, I break each passage into sentences, to produce a second file.

23641370 Photo credit: Jeff Golden
23641371 In chemistry, a salt is defined as an ionic compound that is formed from the reaction of an acid and a base
23641374 A mixture of copper (blue) and strontium (red) makes purple

This gives us 23,641,377 records. That, in turn, gives us 588 trillion pairings; a perfect example of why we cannot simply send off each pair to the latest ChatGPT (pricing).

Forming Pairs

We don't need every pairing for the benchmark. Nearly all possible pairings are non-redundant, and we would like a small pile with a good portion of examples of redundancy. Our method does not need to be perfect. It's enough to find a set of pairings which probably include some redundant pairings. And we can do that with an embedding model.

I chose MiniLM-L6, one of the most popular and basic embedding models. I vectorize some large portion of the dataset, then load them into a vector index (hnswlib). Then, taking some random subset of the records (I chose 57k), for each record, I found the approximate nearest neighbor. This gives me candidate pairings, like so:

"The town was incorporated on April 4, 1910" "The city incorporated on December 2, 1896"
"The population was 23,727 at the 2010 census" "The population was 5,677 at the 2010 census"
Code codes

First Labels

This feels like a job for an LLM, but how did we establish the LLM will, in fact, answer how you want?

The mistake rate of an LLM is less obvious when you're working in a product like Cursor. As you type, it writes ahead. When it's right, it feels like magic, and you tab-complete. When it's wrong, you just keep typing over it. You have, in fact, an extremely high tolerance for its madness, provided that it's wholly supervised.

Here, we will not be supervising.

I took a small slice of samples and labeled them by hand:

Alfa Bravo Redundant
"The town was incorporated on April 4, 1910" "The city incorporated on December 2, 1896" false
"The population was 23,727 at the 2010 census" "The population was 5,677 at the 2010 census" false
"The Manhattan Project was a research and development undertaking during World War II that produced the first nuclear weapons" "The Manhattan Project was the name for a project conducted during World War II, to develop the first atomic bomb" true

I run my labeled samples through various LLMs and prompts until I can't improve the score any further:

Screenshot of a few-shot prompt, next to test results for an LLM.
A prompt is written, with both an instruction and examples.
Screenshot of a few-shot prompt, next to test results for an LLM.
Test samples are compared to my labels.

I'm able to test a variety of models:

Model GPA Notes
GPT-4 5.7 New models defeat this older model on popular, public benchmarks, but not this internal one.
GPT-4o 5.4
Qwen-2.5-72B 5.3
gpt-3.5-turbo 5.2
llama-3.3-70b-instruct 4.8
claude-3.5-haiku 4.7 Claude is the one you're supposed to like on Twitter, and Haiku is the little one.
llama-3.1-405B-Instruct 4.2
Mixtral-8x22b-Instruct 0.1 Not compliant enough to work with the same test script. Would require special care.
Mistral-7B 0.0 Simply non-compliant. It's possible it would work reading the logits for the "true" and "false" tokens.
Just for fun, I've tested more models than I needed to and filled a little table.

GPT-4, the winner, is the oldest model on this list. GPT-3.5 beats Haiku. Neither Mistral nor Mixstral write machine readable output. You wouldn't guess this list from popular benchmarks. Is Goodhart's Law in play?

Second Labels

I would like the LLM to produce for me a substantially larger, synthetic dataset, so I have a strong diversity of test articles.

I decided against GPT-4 due to the higher token cost. I initially decided to go with Qwen on OpenRouter, but the system wouldn't correctly return logprobs, so I went with my third choice, GPT-4o, and the following prompt:

Role Content
system We're doing semantic deduplication. Given a JSON array with a pair of strings, write `true` (JSON boolean) both texts largely convey the same declarative information. Ignore meta-text. Write `false` if the information is either contradictory or unrelated. Write nothing else.
user [\"eve has brown hair\",\"the brunette's eve\"]
assistant true
user [\"the moon is cheese\",\"jupiter is hydrogen\"]
assistant false
user [\"Putin topless on a horse\",\"photo: Putin topless on a horse\"]
assistant true
My prompt.

However, it's not enough to ask the model for a simple false and true. I would like to know the model's confidence in each claim, and retain only confident claims. Seemingly ambiguous pairings are valuable if they're important and I can weigh in with clear guidance. But in a case where GPT-4o computes a 51:49 chance of redundancy, that's not what's happening. Highly ambiguous synthetic pairings don't form good requirements on downstream systems, and I would like to filter them out. To this end, I'm going to read the logprobs.

At each step, an LLM does not in fact emit a token, but a vector in which each element represents the weight of a particular token in its dictionary.

For example, suppose we pass to a completion model the text, "I like big butts and I cannot". We should expect a vector indicating a stronger score for "lie" than any other token on the menu, because that's the most probable next word. "stop" should have a lower weight; "banana" should have a lower weight still; "год" should have an even lower weight than that. (In practice, nearly all tokens should have a weight of about zero.)

By calling OpenAI with a single token limit, and requesting the probabilities, we can independently check the probability of both "true" and "false", even if the model would actually write another word altogether. This means:

  1. Each pairing can be placed on a continuum of redundancy.
  2. The model can no longer fail on compliance.

I can now run pairings through the model, producing redundancy scores like so:

Alfa Bravo Redundancy
Army Corps of Engineers Army Corps Of Engineers 0.999
cfo cfwv 0
"05, for example, is 3" "035, or 3" 0.377
"You wouldnât buy a used car without checking how many miles were on it, and you shouldnât buy a used DSLR without knowing how many clicks are on the shutter" Knowing the shutter count is like knowing how many miles are on a car and you should act on that knowledge accordingly 0.99991

Note, given that the corpus is in fact produced by GPT-4o, any model we create from it — even a single-parameter threshold — is a distillation of GPT-4o.

We then filter this list accoding to the model's confidence. That is, we would like to retain claims with labels near 0 or 1, but discard the middle.

retain = abs(label * 2 - 1) > 0.9
Redundancy prediction labels are filtered with an arbitrary, but tight, standard.

This removes 8,752 records, leaving 48,050.

Strategy One: Embeddings

The most blunt and objection solution is to compare embedding vectors. What does this give us?

  1. We employ the model once per sample, not per pairing.
  2. An approximate-nearest-neighbor index optimizes finding candidates for comparison.
  3. Pairwise comparisons are cheap.

But something's missing.

Our question is dichotomous. Our method is not. Our method gives us a distance between two vectors, but that figure is meaningless. Does 0.1 mean redundant? What about 0.05?

This question here was in fact the starting point for this article. Where do I draw the line, and how?

MiniLM-L6

I've tried MiniLM-L6 first as it's the most popular and common embedding model. Let's see how it does.

Scatterplot compares redundancy vs. vector distance, showing little relatonship.
Pairings placed according to their vector distance and predicted redundancy.

I joined GPT-4o's redundancy measures, which are expensive — literally, the X axis of this chart cost $30 — with each pair's cosine distance, according to MiniLM-L6 embeddings.

What a mess! We can see a few patterns, but they're not helpful. We see GPT-4o has classified most figures as unambigously left (distinct) or right (redundant), but these clouds stretch vertically across the same range, except for a handful of distinct pairings with high distance visible in the upper left corner. (Note that this plot is rendered with low-confidence claims which I mention removing in the previous section.)

If we'd like to find the most ideal threshold, we're going to need to compute it. For each candidate threshold, we classify the entire set, then count how often these classifications contradict our benchmark.

Threshold Error Rate Elimination
0.01 0.131 0.022
0.03 0.120 0.050
0.04 0.118 0.066
0.05 0.119 0.088
0.07 0.135 0.146
0.09 0.177 0.231
Rates of error and elimination at various thresholds.
Benchmark showing consistently high error rates at all possible configurations.
We can see the dip on the left where errors are minimized.

We see here, the correlation is not zero; we can score some hits.

At the far left edge of this chart, the threshold is zero, meaning, that no pairings are deemed close enough to count as redundant. Increasing to 0.04, we see the accuracy improve slightly as a handful of pairings are correctly identified as redundant. The corpus is contradicted at a rate of 11.8%, implying an accuracy of 88.2%, while 6% of pairings are flagged as redundant. Presumably, for every figure we correctly flag as redundant, we flag another incorrectly.

Can we do better?

paraphrase-mpnet

mpnet is a larger model of the same family. According to the SBERT website, "The all-mpnet-base-v2 model provides the best quality, while all-MiniLM-L6-v2 is 5 times faster and still offers good quality." Furthermore, their paraphrase finetune should emphasize semantic similarity of text pairs. Its vectors are 768 dimensional instead of 384, so perhaps it can distinguish values with more nuance.

Threshold Error Rate Elimination
0.16 0.110 0.079
0.18 0.108 0.093
0.2 0.108 0.110
0.21 0.107 0.118
0.22 0.108 0.127
0.24 0.111 0.148
0.26 0.118 0.172
0.28 0.128 0.197
A benchmark graph showing a similar error rate, but better elimination rate.
Measurably better.

Notice the ideal threshold with MiniLM-L6 is 0.04, but with paraphrase-mpnet-base-v2, it's significantly higher at 0.21. These are radically different numbers, showing visually how tightly coupled this number is to the actual model chosen.

If we had Googled a number to use "with embedding models" and found one, that number would be wrong.

Error rate is roughly the same as MiniLM-6, but the elimination rate is better: 11.8% instead of 7 are flagged as redundant.

nomic-embed-text-v2-moe

From nomic.ai, nomic-embed-text-v2-moe is an embedding advertised as better than OpenAI's large embedding models, on popular benchmarks. Again, it is drop-in compatible with my benchmark script. I've got a good feeling about it!

Threshold Error Rate Elimination Rate
0.08 0.13168 0.03715
0.1 0.12926 0.04593
0.12 0.12679 0.05628
0.14 0.12491 0.06839
0.15 0.12437 0.07580
0.16 0.12452 0.08435
0.18 0.12683 0.10539
0.2 0.13197 0.13176
0.22 0.14308 0.16419
Chart showing an error rate that remains high.
Nomic employed with various thresholds.

That feeling is wrong. This model never beats mpnet.

OpenAI Text-Embedding-Large

I'm not actually going to try this, because I'm not going to deploy it. It's one thing to use GPT-4o in a one-off manner to produce a corpus. It's another to have to call an API just to get an embedding.

Screenshot from Terminator showing a killer machine.
The world if you run text-embedding-large on your own computer.

Enough Embeddings

I would like a better rate than 10%, so we're going to move on to heavier, pairwise models. We will circle back and test how we can use embeddings to focus work in a large dataset, but that will require a different benchmark.

Cross-Encoder

Cross-encoders work pairwise; the model will take both texts and evaluate them together. Again, this means vastly more compute than we would need if embeddings were enough, but they're not enough. Onward.

SBERT offers a number of pretrained cross-encoder models, exactly for this purpose, and they're quite easy to use. The model emits a score in range [0,1] so we're again going to need to choose a threshold. The benchmark needs a simple rewrite and we're good to go.

There aren't so many options, and the probable best bet is actually fairly obvious: Quora-Roberta-Large. This model is tuned on the Quora duplicate question database to identify probably duplicates, which is basically the same sort of thing we're doing here.

Threshold Error Rate Elimination
0.1 0.0913 0.1313
0.15 0.0884 0.1229
0.17 0.0875 0.1168
0.19 0.0868 0.1101
0.21 0.0873 0.1039
0.23 0.0876 0.1011
0.28 0.0879 0.0961
Graph showing an ambiguous minima for QRoberta.

This model does beat paraphrase-mpnet with an error rate of 8.6% and elimination rate of 11%. This is nice, provided that we can focus the work.

Good Enough to Start?

I don't see any obvious better model to try. I could experiment with a "small" LLM, however the smallest are substantially larger than this one, and would nee to be tuned. But if I'm going to prepare a corpus for tuning, I could start by tuning this one. But this would require, in my opinion, a cleaner corpus which is better tuned to my actual intent, and that's more labor than it's worth for this article. Instead, let's treat this model as good enough and go to the next step.

Avoiding Work

Optimization is about avoiding compute. We wouldn't want to run this pairwise computation on every pairing in an actual dataset, if the dataset has become large enough. What we would like are ways to disqualify pairings cheaply, and that takes us back to embeddings. Let's have a look at the scatterplot, based now on mpnet:

Scatterplot compares redundancy vs. vector distance, showing some visible relatonship.
Redundancy vs. cosine distance in the labeled corpus.

Over a cosine distance of 0.75, nearly every pairing is labeled non-redundant. It would seem that most are under this line, but that's because our corpus is non-representative. For each sample taken, we took the approximate nearest neighbor according to MiniLM-L6. Two open questions:

  • What is the distribution of cosine distances across all pairings?
  • Can we safely skip analysis of pairings based entirely on their cosine distance?

Consider a loose cosine-distance rule to skip deeper analysis of pairings. In other words, we would only bother invoking the cross-encoder if the cosine distance falls below some threshold. How well would this work?

I'll create a second benchmark. First, I'll take an entirely random sample of 256 lines from the MS MARCO sentences. I will then taken the cartesian product, minus self-comparisons. Then, I will compute both their cosine distance and cross-encoder score. We can then compute the fault rate (false negatives) across the range of thresholds from 0.5 to 1.

Scatterplot relates redundancy vs. vector distance in the context of a random sample.
Redundancy vs. cosine distance of a random sample, labeled with Quora-Roberta-Large.
Max. Distance Miss Rate Skip Rate
1.0 0.000% 0.000%
0.9 0.000% 0.021%
0.7 0.000% 2.056%
0.62 0.000% 5.245%
0.61 22.857% 6.039%
0.51 77.143% 21.523%
Miss rate and skip rate if pairings are only evaluated under a given cosine distance.
Line chart shows the effect of various dismissal thresholds.
The effect of various dismissal thresholds.

We find it's not that helpful! Discarding all pairings with a cosine distance over 0.62 lets us skip only 5.245% of pairings. As we bring the ceiling down to 0.61, we suddenly miss 22% of redundant pairings.

Concluding Thoughts

I feel the accuracy is workable, given many of the contested pairings between the cross-encoder and the heavy LLM are legimately confusing. The LLM is going off a few-shot prompt, and we could gain the same benefit in the cross-encoder with tuning. This tuning may also go a long way toward performance; it may be that with a good corpus of production data, a smaller base model could be tuned to acceptable performance.