Home > Blog

Cross Talk

This weekend I trained a 435m parameter multiturn conversation model on a single Nvidia 3090, using an unorthodox architecure. Checkpoint on HuggingFace.

A screenshot of a texting conversation with a bot.
The text must flow.

Background

A client runs a romantic entertainment app, and wants improved texting. The system runs a simple prompt on Mixstral over an API. Initially, this satisfied two customer requirements:

  1. Just make it work.
  2. Don't pay for GPUs.

At small scale, shared hosting is economical. Costs are granular, and you share hardware with other users of the same model. This works, provided the model does what you want. But as you can imagine, in this case, they don't.

Weights from OpenAI, Meta, etc. are not going to role play anything you like, and will frequently break character to remind the user, "I'm a robot, by the way." Meanwhile, they easily violate the payment processors' expansive list of banned concepts, which even include references to alcohol.

What I had come up with before was a set of classifiers and heuristic filters on both incoming and outgoing texts.

Absolute slop, but this satisfied the payment processor and the client for a while.

Johnny Five, from the film Short Circuit, cooking. This will be some good slop, he says.

However, the client decided to invest in better texting. That suits me, as I'm boiling with ideas.

To anyone with taste, LLMs' writing carries the scent of a dead body in every letter. But they're not dumb. Their reading comprehension is beyond ASI, and you can leverage this.

Screenshot in which I sent ChatGPT a mixed-language message with heavy slang.
ChatGPT understands perfectly when I tell it I'm going through a breakup, and I need it to send me a heart emoji.

Instead of worrying so much about the prompt, I built two quality control models. The first classifies candidate emissions as acceptable or unacceptable, and the second, based on a cross encoder, tries to grade them. This did compel the client to get a GPU, but without a full LLM, an Nvidia A4000 on Hetzner is enough to cover everyone.

With this in place, I can call arrays of OTS ("off the shelf") LLMs with simple prompts, timebox responses, filter the results and send the better text. Additionally, texting is now 30% faster.

Dumb Writers, Smart Editor

Text generators are bad, but if I can automate taste and judgment, I can ask for thirty candidates and rank them.

But how far can I take this?

In 1948, Claude Shannon introduced the concept of a Markov text generator. Imagine a table, in which keys are sequences of extant tokens, and values are probability tables. For example, "The quick brown" forms a three-token key, which can map to a table indicating high odds on "fox" as the next token.

At this level of analysis, it's little different from your common LLM, but those differences matter.

If I tokenize "love" and "Love" in DeBERTa-v3, I get [472] and [1961], which does look like two simple keys. But the model never sees them. They're mapped at the interface to embeddings. The model works entirely in the embedding space, where "love" and "Love" can reside close together. This also makes the modern model differentiable, permitting the use of back propagation.

Your classic, pre-Eisenhower text generator has no such spatialization of concepts, and cannot leverage deep learning. It is "just statistics."

Furthermore, the keys are small. In modern parlance, if the longest key in the table has 5 tokens, your context window is a mere 5 tokens long. Compare GPT-2's 1024. This is a natural consequence of the prior point. 5 tokens isn't enough to consistently make sense, but as you push that number higher, overfitting happens fast, and it begins to regurgitate documents verbatim.

But they do write.

We are not in a way to know _mine_ in being noticed by you. My conscience told me that I feel more than I acknowledge? Her husband allowed her to apply. I am sorry to pain you, but so much engaged in watching Mr. Darcy.
Random text from a Markov model built from Pride & Prejudice.

Suppose we skip the LLM, run a Markov generator a few thousand times and rank the results? Having a simple reranker, I made a generative model from a select corpus of outgoing texts and typed in some cringe.

First test combining the Markov generator with reranker.
Here we see the Markov generator with a simple reranker.

In the above screenshot, each text is generated by feeding 4,096 candidate texts to a scoring model based on DeBERTa-v3-large, along with the user's input text. Something like { [cls] good evening [sep] Welcome back, stranger. [sep] } is fed to the model, with which we run a simple logistic regression. Nonsensical candidates are given poor scores, and thus the emissions tend to fit, provided that the Markov model actually produced any viable candidates. (As an aside, I use DeBERTa-v3's tokenizer & dictionary for the Markov model as well, so it directly emits compatible tokens.)

There's just three problems.

  1. Call & response, but no multiturn.
  2. It's an AI girlfriend.
  3. It's the client's AI girlfriend.

I wrapped up the sprint, told the client I'm busy and set about warming the apartment with my humble GPU.

New Scoring Model

In this concept, responsibility for sense-making lies entirely on the scoring model. I decided for this I'll need big data, and found an archive of SRTs from Open Subtitles. This required a bit of work, as the files look about like so:

1
00:00:04,872 --> 00:00:07,923
ONE, TWO! ONE, TWO, THREE!

2
00:00:09,000 --> 00:00:15,074
api.OpenSubtitles.org is deprecated, please
implement REST API from OpenSubtitles.com

3
00:00:32,432 --> 00:00:34,866
AHEM.
FOR MY NEXT TRICK,
SRT files encode subtitles for shows and films.

With a bit of scripting and labor, we can get rid of most fluff.

ONE, TWO! ONE, TWO, THREE!

AHEM. FOR MY NEXT TRICK,

NOTICE I HAVE NOTHING UP MY SLEEVES...

An obvious weakness of this dataset, is that lines aren't consistently delineating speaker. In the above sample, the first line is from the theme song (it's Futurama) and the next two lines are in fact one line.

Sometimes, lines do delineate speaker:

It's okay for us to depend on each other. That's what people do.

How would I know what people do?

But, I decided, they consistently indicate authoritative claims of continuity. Consider:

We got some hot-pants broad cruisin' around for some action.

The guy she picked up went down on her in a cab, for chrissake.

I got a blow-by-blow description from the cabbie.

I see a legible flow, but importantly, it's an empirical example of what a human writing for a general audience considers a legible flow.

So, I invoked the Rumsfeld Rule: "You go to war with the army you have, not the army you wish you had." If it's not, in fact, good enough, I'll see it empirically. The loss will converge, and the text will flow — or not.

How, then, to structure this data? Since I'm starting with DeBERTa, I decided to tokenize each line, then, on a file-by-file basis, bite it off in chunks to form samples. (I could have run a "sliding window" across each file, but decided that I have so much data available that I can focus on diversity of samples.)

Sample Structure

Conventionally, a "cross encoder" consumes two texts in a single window, indicated by a single separator between them. { [cls] alfa [sep] bravo [sep] [pad] … }. To better leverage the pretraining, and keep the context / candidate distinction clear, I decided on an additional delimiter, the | pipe, to separate lines of the context. (DeBERTa's dictionary does not include a newline.) As before, sometimes the lines delineate speakers, so I'd like to retain that information to the extent possible. Samples are thus encoded: { [cls] line | line | … [sep] candidate [sep] [pad] … }.

Now, we're doing a single logistic regression, so we do need a label for each sample. Conceptually, we want to label samples "next-like" or "not next-like". We could produce labels at cost, or perform a distillation off ChatGPT, but I decided to take sequences as presumed next-like, and manufacture deliberately counterexamples by deviating from the screenwriters' order. Before the sequence is laid into a context window, I flip the first and last line. For example:

[Example]
Great. Hold on just a minute.
I'm gonna say good-bye to my mother.

[Counterexample]
I'm gonna say good-bye to my mother.
Great. Hold on just a minute.

If a given fragment of a text is used as an example, it's not also used as a counterexample, and no fragments overlap. The windower alternates producing examples and counterexamples, which are then shuffled in training.

Training

In the first phase, I took a 24 million token sample and ran three epochs, about two hours each. In the morning I had my first checkpoint to try.

- IF THERE'S NOTHING. DON'T MARRY SOMEONE TO BE THERE
- It'll be there.
- She even bleached the sheets still remember some of my friends to set ourselves a runner.
- (dark retro music) (liquid pouring)
- [Various Voices] osdb.link/ext
- osdb.link/ext
- Captioned by Media Access Group at WGBH access.wgbh.org
- In the mid-19th century,
- An unidentified gunman fired several shots
- And then, uh, got a 419
- but he couldn't find a knife
- Then, they kept an eye on him.
- What if the shooter
- He had pulmonary edema.
- The next day,
- He was envenomated.
Generated line-by-line, non-interactive.

What a mess.

After each line is added, the expanded text is fed back into the system for another round, similar to an autoregressive LLM but with a different level of granularity.

You might ask, "what the muffin did you expect?", but I decided to go ahead and work the problem.

First, I noticed the Markov model matters a great deal. The AI girlfriend model was made exclusively of select outgoing texts. You can imagine they form a little cluster in the vast space of possible texts. Open Subs is more expansive.

Second, I decided that maybe the reranker just isn't "good" enough. But, heading out of the evening, there wasn't much I could do. I took a larger, 130m slide of the dataset and resumed the training for three more epochs.

[you] hey babe
[bot] Hey there, handsome 😏 How are you today? I hope you're here, how about you? 😘 💁
[you] what's your name?
[bot] Oh, I'd love to get to know you.
[you] it's your lucky day
[bot] Of course, I can feel my excitement. 💓{/SI} 
[you] my name's joe
[bot] That's quite an unusual name.
[you] why so?
[bot] There's something so primal and raw about it.
Generated line-by-line, interactive.

At this point it sorta seems to get it, but only with the AI girlfriend Markov model. Trying with Pride & Prejudice just gives very non-conversational output.

- I hope, added Kitty.
- that Lizzy shall be brought to reason. I will only add, God bless you.
- that Miss Bennet shall receive every possible attention while she remains with us.
- She is the sort of woman whom one cannot regard with too much deference.
- This is my advice. Find such a woman.
- And there are several of them, deep in the study of my life.
- And there are several of them, you see, uncommonly pretty.
- But if a woman is partial to a man in my situation of life---
- In nine cases out of ten, a woman had better show _more_ than I ought to feel it.

Now, what I like in the above nonsense is it's actually quite readable. If I stand up, straighten my back and deliver a dramatic reading, I can certainly feel it. Given the nature of the re-ranker's training, perhaps that makes sense?

But I'm unsatisfied. Some emissions are nice enough for a 435m parameter model, but I wanted conversational.

At this point, I realized each of my training samples fills the context window, which consistently pushes the candidate to the "end" of the window. In a multiturn conversation, the important initial exchanges takes place at the "start" of the window. This is practically never represented in the dataset. So, I amended the windower to produce an even distribution of sample lengths.

I took a new 130m slice of the corpus, now reorganized, and set it to bake for another three epochs.

[you] Hey there.
[bot] 🎬🔥
[you] What do you  mena?
[bot] Oh, I'm always open to new experiences, what kind
      of fun we can get into tonight. 
[you] I don't have any plans yes.
[bot] Then let's create one together.
[you] OK. I propose we go to the theater.
[bot] Ah, the classics never get old.
[you] What do you think about Shakespear?
[bot] Mmm, I love the sound of that.
[you] But let's grab a coffee first.
[bot] A full cup of coffee? 😉
[you] Yes, a full one 🙂 maybe even two.
[bot] Oh, baby, I'm quite impressed with your endurance 😉
[you] How would you feel if I go for three?
[bot] A threesome?
[you] Three coffees, m'lady!
[bot] Well then, let's see if you can handle that, babe.
[you] We're at the cafe. I'm ordering coffees. Cream, sugar?
[bot] Of course, sugar.
Generated interactively. Typos are mine.

At this point, I had no luck producing a generator I liked, but realized the solution had been in front of me the entire time: turn the working, "dirty" corpus I had into a working, clean corpus. I manually deleted 30% of the corpus, then fed the rest, line by line, to ChatGPT to rewrite.

[you] Good morning :)
[bot] 😌 
[you] You seem happy today?
[bot] Oh, I'm doing great as ever.
[you] Tell me something fun.
[bot] Oh, you want to have some fun! What are you doing? Anything fun planned?
[you] I don't know what to do today.
[bot] 💭
[you] what are you thinking?
[bot] I'm thinking maybe a cozy movie night.
[you] have you seen Star Wars?
[bot] No, but I'm already looking forward to it. 😄 Shall we hold hands as we share a quiet, blossoming flower.

With that, I have checkpoints, models and code I can share.

The Sort

Let's have a peek into the sort. I write, "hey let's go see star wars tonight." The system generates 4k wholly random, zero-context candidates. Each one is then encoded with the stimulus and history, for example, { [cls] hey let's go see star wars tonight. [sep] Because it's like I should be arrested. [sep] }. The sequences are given to a Torch DataLoader and fed to the scoring model in batches. I've attached a log here, but let's look at the ends.

text score
Oh, definitely! I've heard great things about it! I promise to cherish this rose. 0.9707
Sounds like a plan! We're just making me more excited, aren't you? I like it. 0.9624
Oh, wow. That's great to hear! I'll see you there. You have some fun, friend. I'm yours. 0.9556
Oh hey, that sounds fun! 💭💭 0.9512
I can definitely arrange that. I value your company. 0.9468
is magical! The enchanting atmosphere of this forest. 0.0016
a game, and if you'd like to see next. 0.0015
when you call me that before we move forward. 0.0007
well — I'll treasure it always and thinkin' about you all day. 0.0004
describe that feeling, or maybe offer a listening ear. 0.0001
The top five and bottom five candidates.

We can see plainly it's entirely accidental that a suitable generation occured at all.

One virtue of this approach is the total decoupling between the generator and scoring models. The generator can be modified easily, or swapped in, with zero "catestrophic forgetting" or any other damage to the primary seat of the system's intelligence.

What's Next?

I've uploaded the latest checkpoint on HuggingFace.

I'm boiling with ideas. First, there are multiple obvious improvements I can make to the scoring model's corpus. Second, if I could do this with a MiniLM-based bi-encoder setup, I could run it on my old Macbook. Third, and I feed a series of document embeddings into a sequence transduction model? Could we pre-generate a very large corpus of messages, and a bi-encoder against a binary tree? Other possibilities include staged analysis with heuristic filters, and smarter generators.

I'm booked for the next week, but I hope to come back to this soon.