Your LLM-based document review is probably not defensible

But not because of LLMs.

 

graphic

 

A common question that we hear today is whether LLMs are defensible. As a data scientist, my response to this is always – anything can be made defensible, even if your predictions come from a place like this:

Screenshot 2024-09-19 at 11.27.39 PM

But jokes aside, what matters in defensibility is the process established, not the process that generated the predictions in the first
place, be it a first-gen machine learning model, the latest LLM, or tarot readings from a psychic.

Defensibility of the defensibility process

You can make anything defensible as long as you have a defensible process for establishing defensibility. Let’s say you were going to bet your life on the predictions a psychic would make. You’d probably want to check if that psychic is legit first. An obvious way to do that is to ask them questions, answers to which you already know, and see how accurate they are.

If we were to translate this into legalese, how you perform this testing will determine if your testing methodology is defensible. If the test of your psychic told you that they were accurate in their predictions, but once you walked out the door, all their other predictions didn’t hold up – you messed up somewhere in your testing workflow, i.e., it probably wasn’t defensible. If, on the other hand, the psychic’s predictions of the future were just as accurate as their predictions on your test data, then the workflow you used probably was defensible.

blog_animation_6_2

Can LLMs in document review be defensible?

The short answer is Yes. If psychics can be made defensible, so can LLMs. So why are most applications of LLMs today not defensible? It has to do with how it’s used by people. As it turns out most workflows around LLMs today are fundamentally flawed, and can lead to:

  • Inadvertently reporting results that aren’t defensible
  • Significant delay and wasted effort re-establishing defensibility
  • Wasted cost in LLM requests

Then why are most use-cases of LLMs in document review not currently defensible?

Then if there is nothing inherently wrong with LLMs, then why are most review workflows that utilize them today – not defensible? It all has to do with the workflow. For comparison, let’s have a quick recap of a defensible workflow we are used to when using traditional first-generation TAR model, i.e., trained by examples reviewed by humans:

blog_animation_6_1

Fundamentally, each snapshot of the process is no different than the example with a psychic above. We could freeze the process at any point in the training (e.g., feeding an additional 1,000 examples into a model, training it, then pausing), and ask it to make predictions on the control set – a random sample sampled from the complete collection. Just like with a psychic example earlier, if the process just outlined was defensible, we would expect at any point in that process, the performance on the control sample would be the same as that on the rest of the data – after all the courts, and the other side, only care about the performance on the unseen data:

blog_animation_6_3

As it turns out, although we glossed over some details, overall, the process above would be defensible, i.e., we would expect the performance measured on the control sample to closely reflect the performance on the rest of the collection.

How does this compare to LLM-based review workflows


Now let’s see how this would work if we have an LLM review documents.

The first difference that we encounter is that LLMs today aren’t trained with thousands of examples, instead the common workflow is to engineer 
the prompts in a way that makes the model give as accurate predictions as possible.

So a typical workflow we see commonly deployed today is as follows:

  1. Sample a small random control set from whole data
  2. Write an initial prompt
  3. Have LLM make predictions on that control set
  4. If results aren’t good enough, tweak the prompt and go back to step 3

blog_animation_6_4

 

So, what’s the problem with the LLM workflow?

Let’s summarize—the model is tuned (through prompts rather than examples), and performance is measured on a random sample from the data—just like we did before. Sounds like it should be defensible, right? Wrong.

Take a moment to reflect back on this and see if you can find where the process is fundamentally different from the defensible one we outlined in the previous section.

Ready?

What makes the above process not defensible?

As it turns out just because we use a random sample and call it a “control set”, it isn’t sufficient to make the workflow defensible. What we should rather ask is – what makes the control set a control set to begin with? What’s so special about it, that makes it possible for us to use the results 
reported on it as a defensible standard?

One aspect of it is the fact that it’s sampled randomly, yes. And in both cases, with LLM and in a traditional model, the sampling is random. It turns out that the actual requirement for a control sample is that it has to satisfy two important criteria:

  1. Sampled randomly
  2. There is no information leakage between training data and control sample data

While the control sample employed commonly in LLM prompt-tuning is a random sample (first criteria), it does NOT satisfy the second criteria.

What is information leakage?

Information leakage, simply means, any knowledge gained from the control sample that can be used to improve the performance of the model. In the case of a traditional TAR/CAL model, the Training set is used for this purpose—information contained in the training data is used to tune the model and maximize its performance. The Control set is not used to improve the model—it’s used exclusively to evaluate its performance.

In the case of an LLM workflow described earlier, the so-called “control set” effectively serves the same purpose as a training set in the traditional TAR workflow, except instead of the model being automatically tuned as in the case of traditional TAR, the human crafting the prompts based on the results of the predictions on the “control set” performs that task. Who or how that task is performed is entirely irrelevant – what matters is that information contained in the control sample is used to improve the model. That means that the so-called “control set” in an LLM workflow is a control set in name only—effectively it is a training set.

blog_animation_6_5

What does this imply for defensibility of LLM-based review

If this is the workflow you use, i.e., iterate on a prompt until you get satisfactory results on a control set, you are setting yourself up for big trouble.

First, the performance you will get on that control set will no longer represent the performance of your model (i.e., your prompt as applied to that model) on the entirety of that data. Now, the degree to which the control set performance and general performance will diverge will depend directly on the amount of training or tuning, i.e., prompt engineering iterations, that will go into it. This is a well-known phenomenon in machine learning, known as overfitting.

Regardless of whether we are using traditional TAR models or LLMs, the more training we do, the better the performance will be on the training data. However, at some point that performance stops being reflective of the performance of the model on unseen data, i.e., the remainder of the dataset that we want to predict over—the model starts to “overfit”, i.e., it’s trained to the test, rather than to the entire dataset:

blog_animation_6_6

Why this mistake can be extremely costly

The main reason that this workflow is so prevalent is because with a lot of LLM-based review tools, reviewing a single document with an LLM requires sending it out to an external API, like GPT4, which costs money and time. So instead of reviewing the whole collection, people optimize their prompts on a small sample before sending it out to the rest of their million+ document collection. But as it often turns out, taking a shortcut without fully understanding its implications, can cost much more in the long term.

Here is what the worst-case scenario may be if you use this workflow:

  1. You work hard on iterating on prompts to get good results on the control set
  2. You are satisfied with the results on the control set
  3. You deploy your model on 10,000,000 documents and pay $5,000,000
  4. You get horrible performance, and have to go back to step 1 and redo everything

You wasted time, and money – and now you are at square one.

All of this can be boiled down to one core problem – estimating performance on the same set that you use to optimize performance.

What is the solution?

The solution is actually very simple – you need two samples – a training set, and a true control set that is never looked at during prompt engineering. All you are allowed to do with that sample is to compute performance, e.g., Precision and Recall metrics. As you iterate on the prompt, you can look at the training sample only when optimizing the prompt. So the correct workflow would be:

  1. Start with an initial prompt
  2. Compute performance on the control set
  3. If the performance is not satisfactory:
    1. Look at training set and see where the model makes errors
    2. Adjust the prompts
  4. Continue steps 2 to 3, until control set performance no longer improves or decreases
  5. Once satisfied, apply the model to the rest of your dataset (millions of documents)

blog_animation_6_8

The result of following this workflow is achieving the key objective of defensibility – generating a truthful assessment of performance of the model on the complete collection based only on a control sample. That means that as you continue iterating and tweaking prompts, you can be assured that as long as you keep improving the performance on the control set, when you do bite the bullet and potentially spend millions on running the LLM on the complete dataset, the resulting performance will be reflective of your expectation.

blog_animation_6_7

That’s it!

Related posts