What is your first instinct when your TAR model performs poorly on a control set?
A commonly held view is that more training data, i.e., more review effort, would help push the performance (i.e., Recall/Precision) to where you eventually want it to be – and with enough money and time, eventually close to 100%.
Unfortunately, it may take hundreds of wasted review hours before one realizes that the model’s performance is capping out at some far-from-ideal number.
Why does this happen? More importantly – can we anticipate the ceiling on the model’s performance, potentially saving us wasted time, money, and shattered expectations?
Just like genes may encode certain things about us at birth, your data also encodes certain things about the future performance of any model used on that data – one of those things is the performance ceiling, i.e., metrics, such as Precision and Recall, that are the absolute best you could ever expect on that data, regardless of the model you use, be it traditional TAR or GPT8. And it turns out that you can pre-compute these limits by using the data you already have.
In this article, we will describe the root cause of how data can limit the performance of any model. The next article will provide a set of simple formulas and a calculator that you can use yourself, potentially saving you hundreds of hours and headaches of trying to train your models beyond the limits dictated by the data.
When I say that data already encodes the fundamental limit to your model’s performance – what data are we talking about – training or test data (i.e., control set)? The reality is that it’s both – but in this article, we will focus on the control set for two key reasons:
In other words, just by analyzing the statistics of your control set, we can already deduce the best performance you can ever report for any model, including those that don’t even exist today.
While there could be problems with the original data (e.g., poorly extracted text), we are talking about review tags created by human reviewers in this article. As it turns out – problems with the data that lead to performance ceilings boil down to the nature of humans reviewing that data.
It turns out that it’s entirely sufficient to know just one characteristic about each reviewer to predict the performance ceiling of any model. That characteristic is how liberal vs. conservative they are in applying review tags – quantitatively, it’s reflected in their responsive rate or the richness of the data they generate.
To illustrate why richness or a reviewer's responsive rate is a critical determiner of the performance ceiling, we will use two prototypical reviewers—a Liberal reviewer (i.e., one who is generous in handing out the responsive tag when classifying documents, and a Conservative reviewer (i.e., one who is particularly selective in tagging documents). In reality, reviewers are on a continuum in terms of how they tag, but the illustration of the two extremes will be helpful in explaining why this phenomenon leads to a performance cap.
To understand the root of the problem, let’s look at a very simple example.
Imagine that:
Below is a simple visualization of this process. For illustration purposes, the Conservative reviewer has a responsiveness rate of 10%, and the Liberal reviewer has a responsiveness rate of 20%.
The question to answer is: What is the theoretical best performance possible for each model for each reviewer (i.e., what is the performance ceiling)?
If we take away errors or self-inconsistency of each reviewer – the best possible performance on each control set would, of course, be 100% (100% Recall and 100% Precision):
Now, let’s imagine that the situation was modified slightly – the control set for the model trained by the Conservative reviewer was created by the Liberal reviewer, and vice-versa. In other words, a different reviewer from the one who trained the model created the control set.
Before scrolling down further – try to answer the same question again: what is the best theoretical performance on each control set now?
If you think about it for a moment, it will become obvious why. If a Liberal reviewer created the control set with a responsive rate of 20%, the Conservative reviewer with a responsive rate of 10%, even if they were maximally consistent, would only tag half of them as responsive, i.e., giving the Recall rate of 50%.
Think about the implications of this for a moment—there is no amount of training that the Conservative reviewer can do to increase that Recall rate – as long as the reviewer training the model and testing it (i.e., via control set)—had a fundamental disparity in their responsive rate – the performance could not get better than 50%!
The examples above are useful to illustrate the root cause of the problem. In practice, however, both the training data and the control sets are created by multiple reviewers – with reviewers often represented in varying proportions (and often different proportions between the training data and the control sets).
While this complicates the formula for computing the performance ceiling, it doesn’t change the fundamental result, which is the main conclusion of this article:
As long as the control set contains tags by reviewers with different responsive rates, there will exist a performance ceiling that is completely independent of the model.
What’s fascinating is that this will hold true regardless of how the training data was created, e.g., only by one of the reviewers (e.g., a Liberal or a Conservative one), or some proportional combination of the two. The performance ceiling is determined purely by the disparity in the responsive rate of reviewers in the control set and control set only.
Below is an example that illustrates the performance ceilings of models trained using varying proportions of training data from both the Conservative and Liberal reviewers. As you can see, in no scenario is 100% performance (in Recall AND Precision) achievable.
The only question left now is – how can YOU compute the performance ceiling before wasting time and money trying to push your models to do the impossible? In the next article, we will answer this question. But to preview – there is a simple formula that you can use when there are only two reviewers and a more complicated formula, for which we will provide an online calculator for a more general setting of multiple reviewers. Using that, you will be able to simply feed in the responsive ratio of each reviewer and the proportion of data they tagged in the control set and immediately obtain the performance ceiling.
Till next time.