Cross Talk
This weekend I trained a 430m parameter multiturn conversation model on a single Nvidia 3090 using the most absurd method possible.

Background
A client of mine runs a romantic entertainment app. The texting runs on a simple prompt on Mixstral over an API. This satisfies two customer requirements:
- Just make it work.
- Don't want to pay for GPUs.
At small scale, shared hosting is economical. Costs are granular, and you share hardware with other users of the same model. This works, provided the model does what you want. But as you can imagine, in this case, they don't.
LLMs just say things. If you run them again, they say different things. They're smart, so tend to be close to the mark, but will tend also to overconform to the prompt. You might just turn up the temperature, but in fact you want it to conform to the prompt.
Weights from OpenAI, Meta, etc. are not going to role play anything you like, and will frequently break character to remind the user, "I'm a robot, by the way." Meanwhile, they easily violate the payment processors' expansive list of banned concepts, which even include references to alcohol.
What I had come up with before was a set of classifiers and heuristic filters on both incoming and outgoing texts.
Absolute slop, but this satisfied the payment processor and the client.

Finally, this October, the client decided to invest in better texting. That suits me, as I'm bboiling with ideas.
To anyone with taste, LLMs' writing carries the scent of a dead body in every letter. But they're not dumb. Their reading comprehension is beyond ASI, and you can leverage this.

Instead of worrying so much about the prompt, I built two quality control models. The first classifies candidate emissions as acceptable or unacceptable, and the second, based on a cross encoder, tries to grade them. This did compel the client to get a GPU, but without a full LLM, an Nvidia A4000 on Hetzner is enough to cover everyone.
With this in place, I can call arrays of OTS ("off the shelf") LLMs with simple prompts, timebox responses, filter the results and send the better text. Additionally, texting is now 30% faster.
Dumb Writers, Smart Editor
Text generators are bad, but if I can automate taste and judgment, I can ask for thirty candidates and rank them.
But how far can I take this?
In 1948, Claude Shannon introduced the concept of a Markov text generator. Imagine a table, in which keys are sequences of extant tokens, and values are probability tables. For example, "The quick brown" forms a three-token key, which can map to a table indicating high odds on "fox" as the next token.
At this level of analysis, it's little different from your common LLM, but those differences matter.
First, the tokens are pure surrogate identities. If I tokenize "love" and "Love" in DeBERTa-v3, I get [472]
and [1961]
, which does look like a pair of keys. But the model never sees them. They're mapped at the interface to embeddings. The model works entirely in the embedding space, where "love" and "Love" can reside close together.
This also makes the model itself differentiable, permitting the use of back propagation.
Furthermore, the keys are small. In modern parlance, if the longest key in the table has 5 tokens, your context window is a mere 5 tokens long. Compare GPT-2's 1024.
But they do write.
We are not in a way to know _mine_ in being noticed by you. My conscience told me that I feel more than I acknowledge? Her husband allowed her to apply. I am sorry to pain you, but so much engaged in watching Mr. Darcy. That Mr. Bennet coolly observed that you have ordered a good dinner to-day. Elizabeth could not refuse, though she foresaw little pleasure in the visit.
A thought occurred. Suppose we skip the LLM, run a Markov generator a few thousand times and rank the results? Having a simple reranker, I made a Markov model from a select corpus of outgoing texts and tried it out at once.

In the above screenshot, each text is generated by feeding 4,096 candidate texts to a scoring model based on DeBERTa-v3-large, along with the user's input text. Something like [cls] good evening [sep] Welcome back, stranger. [sep]
is fed to the model, with which we run a simple logistic regression. Nonsensical candidates are given poor scores, and thus the emissions tend to fit, provided that the Markov model actually produced any viable candidates.
There's just three problems.
- Call & response, but no multiturn.
- It's an AI girlfriend.
- It's the client's AI girlfriend.
I wrapped up the sprint, told the client I'm busy and set about warming the apartment.
New Scoring Model
In this concept, responsibility for sense-making lies entirely on the scoring model. I decided for this I'll need big data, and found an archive of SRTs from Open Subtitles. This required a bit of work, as the files look about like so:
1
00:00:04,872 --> 00:00:07,923
ONE, TWO! ONE, TWO, THREE!
2
00:00:09,000 --> 00:00:15,074
api.OpenSubtitles.org is deprecated, please
implement REST API from OpenSubtitles.com
3
00:00:32,432 --> 00:00:34,866
AHEM.
FOR MY NEXT TRICK,
With a bit of scripting and labor, we can get rid of most fluff.
ONE, TWO! ONE, TWO, THREE!
AHEM. FOR MY NEXT TRICK,
NOTICE I HAVE NOTHING UP MY SLEEVES...
An obvious weakness of this dataset, is that lines aren't consistently delineating speaker. In the above sample, the first line is from the theme song (it's Futurama) and the next two lines are in fact one line.
Sometimes, lines do delineate speaker:
It's okay for us to depend on each other. That's what people do.
How would I know what people do?
The only people I've spoken to in the past four years are you both.
Besides, I'm not even a real person anyway.
But, I decided, they consistently indicate authoritative claims of continuity. Consider this:
We got some hot-pants broad cruisin' around for some action.
The guy she picked up went down on her in a cab, for chrissake.
I got a blow-by-blow description from the cabbie.
Many chunks have a legible flow, or, an empirical example of what a human writing for a general audience considers a legible flow.
So, with entirely too much on my plate, I invoked what I call the Rumsfeld Rule: "You got to war with the army you have, not the army you with you had." I've got 300 million tokens of this, and if it's not good enough, the loss will not converge; the test set with fail; I'll see it in a single number.
How, then, to structure this data? Since I'm starting with DeBERTa, I decided to tokenize each line, then, on a file-by-file basis, bite it off in chunks to form samples. (I could have run a "sliding window" across each file, but decided that I have so much data available that I can focus on diversity of samples.)
Sample Structure
Conventionally, a "cross encoder" receives two texts in a single window, indicated by a single separator between them. [cls] alfa [sep] bravo [sep] [pad]
. To better leverage the pretraining, and keep the context / candidate distinction clear, I decided on an additional delimiter, the |
pipe, to separate lines of the context. As before, sometimes the lines delineate speakers, so I'd like to retain that information to the extent possible. Samples are thus encoded: [cls] line | line | line [sep] candidate [sep] [pad]
.
Now, we're doing a single logistic regression, so we do need a label for each sample. Conceptually, we want to label samples "next-like" or "not next-like", but we don't have any labels. We could produce them at cost, or perform a distillation off ChatGPT, but I decided to take sequences as presumed next-like, and manufacture deliberately broken counterexamples.
Counterexamples are produced the same way, by the same process, but before the sequence is laid into a context window, I flip the first and last line. For example:
[Example]
Great. Hold on just a minute.
I'm gonna say good-bye to my mother.
[Counterexample]
I'm gonna set good-bye to my mother.
Great. Hold on just a minute.
Content is not reused. If a given fragment of a text is used as an example, it's not also used as a counterexample, and no fragments overlap.
Training
In the first phase, I took a 24 million token sample and ran three epochs, about two hours each. In the morning I had my first checkpoint to try.
- IF THERE'S NOTHING. DON'T MARRY SOMEONE TO BE THERE
- It'll be there.
- She even bleached the sheets still remember some of my friends to set ourselves a runner.
- (dark retro music) (liquid pouring)
- [Various Voices] osdb.link/ext
- osdb.link/ext
- Captioned by Media Access Group at WGBH access.wgbh.org
- In the mid-19th century,
- An unidentified gunman fired several shots
- And then, uh, got a 419
- but he couldn't find a knife
- Then, they kept an eye on him.
- What if the shooter
- He had pulmonary edema.
- The next day,
- He was envenomated.
- That's what George had reported the tainted milk bottle.
- He's special because he died between midnight and 1:00 today.
- Reich may have been present when the deceased had
- Now, once a year
- And we've invited
- to the morgue
- Why did you invite me?
- So every year,
- no matter how the venoms may vary regionally.
What a mess.
The above text was generated line-by-line, in a way similar to how an autoregressive LLM works on a token-by-token basis; after each line is added, the expanded text is fed back into the system for another round.
You might ask, "what the hell did you expect?" but I decided to work the problem.
First, I noticed the Markov model matters a great deal. The AI girlfriend model was made exclusively of select outgoing texts. You can imagine they form a little cluster in the vast space of possible texts. Open Subs, however, is far broader and messier.
Second, I decided that maybe the reranker just isn't "good" enough. But, heading out of the evening, there wasn't much I could do. I took a larger, 130m slide of the dataset and resumed the training for three more epochs.
[you] hey babe
[bot] Hey there, handsome 😏 How are you today? I hope you're here, how about you? 😘 💁(8887 ms)
[you] what's your name?
[bot] Oh, I'd love to get to know you. (12101 ms)
[you] it's your lucky day
[bot] Of course, I can feel my excitement. 💓{/SI} (14021 ms)
[you] my name's joe
[bot] That's quite an unusual name. (16556 ms)
[you] why so?
[bot] There's something so primal and raw about it. (18152 ms)
At this point it sorta seems to get it, only with the AI girlfriend Markov model. Trying with Pride & Prejudice just gives very non-conversational output.
- I hope, added Kitty.
- that Lizzy shall be brought to reason. I will only add, God bless you.
- that Miss Bennet shall receive every possible attention while she remains with us.
- She is the sort of woman whom one cannot regard with too much deference.
- This is my advice. Find such a woman.
- And there are several of them, deep in the study of my life.
- And there are several of them, you see, uncommonly pretty.
- But if a woman is partial to a man in my situation of life---
- In nine cases out of ten, a woman had better show _more_ than I ought to feel it.
- What did you say of me that I feel more than I acknowledge?
- It is more than I acknowledge?
- In nine cases out of ten, a woman had better show _more_ than my due.
- Oh, yes! You will only think I feel _more_ than my due.
Now, what I like in the above nonsense is it's actually quite readable. If I stand up, straighten my back and deliver a dramatic reading, I can certainly feel it. Given the nature of the re-ranker's training, perhaps that makes sense?
But I'm unsatisfied. Some emissions are nice enough for a 435m parameter model, but I wanted conversational.
At this point, I realized each of my training samples fills the context window, which consistently pushes the candidate to the "end" of the window. In a multiturn conversation, the important initial exchanges takes place at the "start" of the window. This is practically never represented in the dataset. So, I amended the windower to produce an even distribution of sample lengths.
I took a new 130m slice of the corpus, now reorganized, and set it to bake for another three epochs.
[you] Hey there. [bot] 🎬🔥 [you] What do you mena? [bot] Oh, I'm always open to new experiences, what kind of fun we can get into tonight. [you] I don't have any plans yes. [bot] Then let's create one together. [you] OK. I propose we go to the theater. [bot] Ah, the classics never get old. [you] What do you think about Shakespear? [bot] Mmm, I love the sound of that. [you] But let's grab a coffee first. [bot] A full cup of coffee? 😉 [you] Yes, a full one 🙂 maybe even two. [bot] Oh, baby, I'm quite impressed with your endurance 😉 [you] How would you feel if I go for three? [bot] A threesome? [you] Three coffees, m'lady! [bot] Well then, let's see if you can handle that, babe. [you] We're at the cafe. I'm ordering coffees. Cream, sugar? [bot] Of course, sugar.
Generated interactively.
At this point, I had no luck producing a generator I liked, but realized the solution had been in front of me the entire time: turn the working, "dirty" corpus I had into a working, clean corpus. I manually deleted 30% of the corpus, then fed the rest, line by line, to ChatGPT to rewrite.
[you] Good morning :)
[bot] 😌
[you] You seem happy today?
[bot] Oh, I'm doing great as ever.
[you] Tell me something fun.
[bot] Oh, you want to have some fun! What are you doing? Anything fun planned?
[you] I don't know what to do today.
[bot] 💭
[you] what are you thinking?
[bot] I'm thinking maybe a cozy movie night.
[you] have you seen Star Wars?
[bot] No, but I'm already looking forward to it. 😄 Shall we hold hands as we share a quiet, blossoming flower.
With that, I have checkpoints, models and code I can share.
What's Next?
I'm boiling with ideas. First, there are multiple obvious improvements I can make to the scoring model's corpus. Second, if I could do this with a MiniLM-based bi-encoder setup, I could run it on my old Macbook. Third, and I feed a series of document embeddings into a sequence transduction model? Could we pre-generate a very large corpus of messages, and a bi-encoder against a binary tree?
I'm booked for the next week, but I hope to come back to this soon.