Let’s return to confidence assessments from LLMs and review the consequences of relying on them. Returning to our post on Assessments, recall how essential scoring of predictions is and that the most important thing is not the scores themselves but the resulting ranking that these scores create between all predictions. We use scores to make rankings; we’ve been doing that since the early days of TAR 1.0.
Once our model (whichever kind, TAR 1.0, 2.0/CAL or LLM) gives us a score, we rank examples, and draw a cut-off line:
The way we decide is by drawing a line to separate what the model would call Responsive and Not Responsive. The resulting separation would obviously sometimes have errors, whether false positives or false negatives.
Now, when we change the scoring threshold, obviously we get different results and different error rates. By being more inclusive we may get more false positives but miss less false negatives. That's the fundamental tradeoff we have to do as part of using any machine-learning model—there is no correct answer where we draw the line, ultimately it's a value judgement.
The process described above has been standard for as long TAR has been in existence. As long as a model gives us a score of some kind, the above method can be used to make predictions. Then what's the difference between the old models we are used to in TAR and LLMs?
The only difference is that scores obtained from an LLM, unlike the models we've used in the past, are not deterministic. This means running the same model again and again, on the same exact data, would give different scores, and therefore a different ranking.
For example, asking our trusty old TAR model for a score repeatedly, we'd expect something like this:
Since different scores would result in different rankings, each time you ask an LLM for predictions, you would thus expect different performance metrics as well, i.e., recall and precision. This is not good!
Ok, how do we solve this then?
We mentioned previously that the only way to increase the model's consistency in predictions is to average either the predictions themselves or the score. This means asking the LLM again and again for the score on the same document, and averaging the results. With enough queries, the average would converge to the true prediction score and would stabilize (i.e., become close to appearing deterministic, like in traditional models)
But the problem is that repeating a query to an LLM many times is pretty much impractical, especially with something like GPT-4 where costs are often prohibitive.
Then what are the consequences?
So then let’s look at the consequences of what would happen if we don’t do that – and simply rely on the confidence output of an LLM as it is.
It's very easy to illustrate the consequences -- all we have to do is compare the Precision/Recall curve for the model which is deterministic (like the first-gen TAR models we are used to), and the model that has noise in its predictions (i.e., each time you run it, can produce a different result), like an LLM.
This is trivial to simulate, as all we have to do is take a hypothetical set of scores and add some noise to them -- and then compare the Precision Recall curves:
What's fascinating here is that if the noisy model (i.e., an LLM) was simply run multiple times on the same data, and its scores averaged, then its Precision and Recall would approach that of a deterministic model, i.e., the blue curve. However, relying on a single run of its predictions (the red curve), the performance is way off!
In fact, it would always be under-estimating the true performance of the model.
In the example above, your performance can be reported up to half of what it actually is!
What are the practical implications of this?
In practice, when you look at that Precision-Recall curve from an LLM (the red curve in the example above), all you will see is that its performance is insufficient. You won't know why. Is the model itself poor, for example? Your obvious instinct will be to improve the performance of the model, just like we are used to from the TAR days. Maybe tuning prompts—maybe adding some examples—or some other way. That would be how we would operate in the TAR/CAL world we are used to.
But in this case—all that effort would be wasted! The performance gap is not because the model itself isn't good enough—it's simply because its non-deterministic nature underestimates the performance of the model. The ONLY way to close that performance gap is to ask the LLM for scores multiple times and average its predictions.
Given that you won't be running your model 10x on each document simply for cost and time reasons, are there any solutions to ensure the model's performance is reported accurately? The answer is yes -- and we will look at it in one of our next blog posts. Stay tuned!