Greg Burnham leads Epoch’s benchmarking team. Tom Adamczewski is a senior research engineer who develops new benchmarks, including MirrorCode.
Topics we cover: why benchmark saturation isn’t as alarming as it seems, how AI can speed up benchmark development, the benchmark-reality gap, whether an AGI benchmark can exist, and the role of human evaluation in future benchmarks.
We also discuss MirrorCode, a benchmark (co-developed by Epoch and METR) of long-horizon coding tasks, and FrontierMath: Open Problems, Epoch’s benchmark of real unsolved math research problems.
Transcript
This is an edited transcript of the “Epoch After Hours” podcast.
Are AI benchmarks doomed? [00:00:36]
Anson
So AI benchmarks seem to have a really big problem right now. If you look at all of the AI benchmarks, it seems like most of them are saturating really, really quickly. And by really quickly, I mean months for most of them. If they’re really, really good, maybe they’ll last for a year or two. But then for the most part, it seems like it’s very hard to build a benchmark that can last quite a long period of time.
So there’s a looming question that revolves around all of this: are AI benchmarks doomed? To start off, I’d like to get a nice little vibe check of where you guys stand on whether AI benchmarks are doomed. So what do you guys think?
Tom
So I think benchmarks will continue to be important as long as people want to have some kind of qualitative description of what an AI system can do, or want to quickly compare when a new model comes out — which one is better. And so it seems like we’re sort of stuck with benchmarks, regardless of the many flaws that they might have, just because there is this obvious demand for information, this gap that they fill.
We might be in a situation where benchmarks are less useful than they used to be. Like they explain less of all that we might want to know about AI systems’ performance. But there’s still additional information, and so people are going to continue to release new benchmarks and look at benchmark results.
Greg
I’m a bit more of an optimist in this perspective. I’d almost say we’re living through a golden age of benchmarking, where it used to be that I think models were not that capable and there was only so much for benchmarks to say. Now models are much more capable, but this just means there’s much more for benchmarks to potentially tell us.
So maybe, as Tom was saying, the percentage of questions you might want benchmarks to answer that benchmarks actually answer might be shrinking. But the amount of information we’re gleaning from benchmarks, I think in some sort of absolute terms, is growing.
And I think this is very exciting. I think benchmarks will survive and be important and even potentially central, so long as there are things we are curious whether AI systems can do — and that seems like there’s still plenty of questions about what AI systems can do. I think there are some benchmarks that might even survive — I mean this loosely — survive the singularity.
The costs and benefits of benchmark development [00:03:13]
Anson
One thing I’d want to understand better is why some people are so much more pessimistic. I imagine some researchers in AI safety would probably say: if you look at benchmarks like FrontierMath, the researchers put quite a lot of effort into trying to make these benchmarks last for quite a bit of time. And it seems like maybe within one or two years — which is already relatively good for some of these benchmarks — they’re getting to the point of saturation. And now we’re having to spend millions of dollars to build these benchmarks. Can we really keep doing this? If it costs millions of dollars and the gains are maybe not that high, maybe it’s just hard.
I’m curious what you guys think about that.
Tom
I think what you said about the gains not being high — that’s really the key. Yes, I agree that as the tasks that AI can do get more and more impressive, creating benchmarks for those tasks becomes more and more costly.
And so then it just depends on whether the benefit side is high enough. And I sort of suspect that this will be the case, because while AI gets more powerful, it’s just more important to know what it can and can’t do, or which AI systems are better than others.
Just in the same way that everything is increasing — like AI companies’ compute spend — similarly, the cost that benchmark developers spend on developing new benchmarks is also increasing a lot. I think this is sort of fine as long as people care enough about the answers we get from these benchmarks.
Yeah, I may be caricaturing your pessimism slightly, but I feel like it can sometimes come from, “Oh, well, this benchmark has all these flaws — was it really worth all the effort?” Well, think about how unhappy you’d be if you had nothing at all.
If literally all benchmarks were saturated, that does seem like we’d be in a much worse position. And if we were in that world, the premium on being the one team in the world that has an unsaturated benchmark would be huge. So I do think that basically costs and benefits might keep pace with one another.
Greg
I think it’s not crazy to measure the benchmarking budget as a percent of revenue of AI companies. I also just wouldn’t underestimate human cleverness. I do think benchmarking used to be kind of super easy — too easy to make a benchmark that started at zero. And now you have to be more clever to find a benchmark that is unsaturated, and sometimes you’ll be wrong about what is or is not saturated. But that’s a fine trade-off. We should be generally happy to have opportunities to exercise our cleverness and try.
And I think there’s some historical examples. I think part of where this pessimism might be coming from is we have just seen this big ability spike — a qualitative abilities spike — with coding agents starting to just work. This means that some tasks that we had put in benchmarks thinking they were hard are doable now.
And I would just point out, this has happened at least twice before, I think, roughly. One, where, call it around GPT-4, models just could do all these easier question-answering or language-manipulation tasks. And so some benchmarks were saturated and people did have to be clever to come up with harder benchmarks. Fine.
And then there were also reasoning models that came out and suddenly some math benchmarks were saturated. I think if we feel a little shell-shocked right now, that’s understandable, but if you just look around at the world, there’s plenty of things systems can’t do. And if you have to spend some more money on them, fine — that is the case.
You can have benchmarks that survive these paradigms. I think GPQA is a really good example of this. It was made at the end of 2023, before reasoning models in their current form were even on the horizon. And I would argue was only really saturated in winter of 2025, two years later. And I think that’s impressive. Reasoning models definitely did better on it — there’s a big spike around o1 — but it’s not like it was totally saturated.
It was a high-effort benchmark, though. You had to get these experts, you had many experts reviewing each question and testing out each other’s questions so you could tell that the chemistry questions are really hard for the physicists. It was more effort, it was more expensive. People paid it, it was worth it. And while some benchmarks that were supposed to be hard — like in terms of math — that were completely saturated when o1 came out, GPQA wasn’t. So we’ll have some wins and losses in this metric.
The last thing I’ll say is: a saturated benchmark is not a problem. Even having a benchmark that is saturated upon release — a hundred percent — because you started developing it four months ago and AI progress happened to just hit the nail right on the head — that’s very useful to know, because it dramatically reduces your uncertainty about what this qualitative feel, this vibe, of AI progress actually means in terms of numbers. Even this is relevant. So while it’s a little disappointing if your benchmark is saturated on release, I still think it can be quite valuable.
And maybe there’s some lessons we can learn about how to try to build benchmarks like this, and we can come back to that. But I just feel like this pessimism is over-updating.
Anson
I guess one kind of counter-argument that comes to mind is that cost is one thing that maybe we’re willing to pay a lot more for because we at Epoch believe that it’s very valuable to have these kinds of benchmarks. But then what about the time it takes — the cost in terms of time — for trying to build these benchmarks?
I don’t want to underestimate human cleverness, but I also don’t want to underestimate AI cleverness. As AIs are getting really smart, they’re going to just crack all of these benchmarks so soon, even if we spend six months filling a benchmark. By the time we’re done it’s not going to be great — because it’s going to be saturated.
Tom
I mean, I do think this is an argument for developing smaller, bite-sized benchmarks faster. In some ways, put something out as a trial balloon that you think is toward the harder end of the distribution you’d want your benchmark to cover, and see what happens as you keep filling out the benchmark. And if that balloon gets popped, then you say, “I need to work on a different project,” or whatever. But again, that served its purpose.
I do think there is some lead time risk to any benchmark where the fundamental infrastructure will take you six months before you could even have a sample. I’m not so worried about that, because I think any benchmark should kind of have a manual experiment. You have some software task you sort of want to make a benchmark out of — you just ask Claude Code to do it and see how far it gets, and you get some sense of that. I do think benchmarks starting out with that is good and something more like “agile” development of benchmarks would be a good lesson to learn.
But, yeah, I think it’s worth updating, just not updating all the way to “benchmarks are impractical now.” Because, again, to be grounded — as long as there’s a task that you, today, might practically want an AI system to do, and you put in like half a day’s work eliciting it and it doesn’t do it — there’s absolutely, today, a benchmark there.
Greg
I like the agile development point. I feel like that’s something that, maybe, historically, because benchmarks have come out of academia, it’s been very much — you don’t share anything with the world, you work for months until you have this super polished paper and then you release it. Maybe moving to something a little more gradual, a little more like open-source software development where there are continual improvements being made — maybe that’s promising.
Two responses to your calendar time, lead time objection. One is just: we need to look at what’s parallelizable and what’s not in the benchmark development process. For parallelizable things, you can hope you can just throw more resources at them and make it faster that way.
And then there will be some non-parallelizable portion. For that part, if the worry is that we as humans are just too slow and AI progress is very fast — well, AI systems are helping with everything, including benchmark development. This is something we see already in our own benchmark development work. For most technical work that I do, LLMs are a pretty essential tool and they speed me up a lot.
MirrorCode and scalable benchmarks [00:11:48]
Anson
So to make sure I’m understanding: the AIs are helping you build the benchmarks faster. And the other thing is, to what extent can we break this down to multiple chunks where we can just throw more resources at the problem.
I kind of want to dig into the second part a bit more, because you guys are the ones building the benchmarks on the ground. And I know Tom, you recently have been working on a benchmark, and my understanding is it’s meant to be like METR’s time horizons, like task set 2.0, or something. Could you say more about that?
Tom
So maybe I’ll not answer it directly, but take a step back first to say: with this question of how do we make unsaturated benchmarks, one angle I really like — and that I’ve liked for a while — is: are there tasks we can find, like categories of tasks where you can just take the same setup and crank up the difficulty as much as you want — ideally to infinity, but maybe it’s also sufficient if you can just crank it up a lot.
So, I like this idea, and I’ve been working on a benchmark that is sort of my instantiation of this idea for the software engineering domain called MirrorCode. And it’s called MirrorCode because the AI has to re-implement some existing program and mirror its functionality perfectly.
Yeah, maybe a little bit on the setup. These are all command-line programs that have a command-line interface. So that can be all the way from simple command-line utilities, like dirname or ls, up to huge programs that just happen to have a command-line interface, such as interpreters for programming languages, type checkers, et cetera.
We give the AI system the documentation for the original program — we don’t give it the source code — and we give it access to a black-box reference implementation, so a binary of the original program that it can send inputs to and view the outputs. If things are underspecified in the documentation, or if it wants to see the exact output format or test new hypotheses, it can do that as much as it wants against this reference binary.
The hope with this is that you can really scale across several orders of magnitude in size of the original program, and hopefully also the amount of effort for the AI or humans to complete the reimplementation task. Programs that are really trivial and were like 10 or a hundred lines in the original, up to 10 million or tens of millions of lines of code like the Linux kernel, or really complicated compiler chains. I think there’s just a lot of room here for scaling up to the largest software projects ever in the history of software development.
Anson
And how far did you scale it in fact?
Tom
So we’re still figuring out exactly what we’ll release. What we definitely have so far are a couple of programs that are in the roughly hundred-thousand-lines-of-code range, without counting dependencies in the original implementation. An example of that is Pkl, which is this new programming language that came out in 2024 from Apple.
In our experiments so far, the best AI systems — with something like hundreds of millions or a billion tokens over the course of the run — are not yet able to complete these very hardest tasks, but they’re able to do pretty reliably everything up to that level of difficulty —
As of recording this podcast, I feel very uncertain about whether, with more tokens, they would just be able to do everything. I would say it’s currently my best guess that yes, they would be able to do everything up to the hundred-thousand-line-or-so size.
With this benchmark, I did originally envision it as, “Okay, this is going to be a really hard benchmark for AI systems.” And we created a lot of tasks in the early phase of the project that are now saturated. It certainly shows that even when you think you might be setting the bar high enough accounting for how much progress AI will make, you might still be underestimating it.
For very precisely specified tasks, the AI really knows absolutely everything the program has to do — it has to output exactly this string on this kind of input, et cetera. AI systems can just keep going at it for many, many times the size of their context window with compaction. And because the task is sufficiently precisely specified, they sort of know where they’re at in terms of their progress, and they do even these very impressive ones that we would guess represent several weeks of human work — that still has a bit of room to go, like scaling to the biggest human software projects ever. To sort of help us answer: if we tell an AI system very precisely what to do, can it do anything in software engineering?
Anson
Let me make sure I’m contextualizing this correctly. This is supposed to be a bunch of, your were saying multiple-week-long tasks, like hundreds of thousands of lines of code. And these are things that we were thinking were going to be really, really hard for the AI. But then it seems like before we’ve even released the benchmark, AIs are already able to do a huge chunk of these — as long as they’re using — what was the token budget?
Tom
So just to be clear, these time estimates for how long it would take a human to do the task are guesses. We don’t have data on this. The multiple weeks is sort of my personal guess. The hardest task that AI can definitely do in MirrorCode, which is implementing the CommonMark spec — which is a formalization of Markdown that tells you like exactly, for any markdown, how to convert to HTML — the reference implementation for that is about 16,000 lines of C. My personal guess, which is extremely speculative, is that this would take an experienced software engineer who is completely unassisted by AI multiple weeks to reimplement.
Anson
I see. But then it’s still the case that if you were to invest in building the month-long versions of this, or maybe the year-long which are the ideal things to do in the future — you think that there’s still plenty of room to keep scaling this up?
Tom
Well, so I don’t want to make strong predictions about whether AI will be able to do it or not. But I think — that’s almost, maybe I’m a little bit dodging the main question you want to ask with this podcast — but that’s sort of not really my main, like I think this is interesting just because it lets us describe AI capabilities on precisely specified software tasks across these orders of magnitude of difficulty. And it’s great to know whether AI can do that or not.
I care a bit less about whether it will be saturated by a certain date. And I agree it’s relevant because people want to be able to keep tracking AI progress. I don’t feel very confident about making predictions for that.
What I can say is that Nicholas Carlini stopped the Anthropic C compiler experiment based on — my impression is — pretty much his gut feeling of, “Ah, it’s gotten up to here, it seems to now be sort of stagnating, to be introducing bugs when it tries to introduce optimizations,” and he decided to stop it there. I don’t really know what his criteria were, or maybe he just wanted to spend up to $20,000 of compute and didn’t want to go further.
So it’s clearly the case that if Carlini had wanted to say, “Okay, no, no, the task is to compile all these projects and have the resulting code be as efficient as GCC” — I sort of feel torn between two inclinations. One is: it just would seem so crazy for AI to be able to rebuild the largest software engineering projects ever from the ground up, representing many years of work by hundreds of people. So that still, you know, feels kind of intuitively shocking on some level.
But also, I obviously have updated on the results that it can do these really impressive things on CommonMark in our experiments. It can make substantial progress — although not fully solve our hardest task — within a billion tokens, and it can do Carlini’s C compiler. So between these two poles, I end up just being very uncertain.
Greg
Mm-hmm. Isn’t this a win for benchmarking? Or would your steelman pessimist claim that this is a problem?
Anson
Sorry, that what exactly is a problem?
Greg
The state of MirrorCode upon release as AI systems having perhaps made more progress on it than we would’ve predicted when Tom began working on it, call it however many months ago.
Anson
I would’ve thought naively that they would see this as evidence that it was actually just really hard to make these kinds of new tasks. But then it depends on how far we push things and what the costs are and what the benefits are — which is sort of the thing we were saying is the thing that matters.
AI speed-up in benchmark development [00:20:57]
Anson
I’m kind of curious for both of your takes, in the case of MirrorCode, in the case of FrontierMath: Open Problems — that relates back to what Tom said earlier — where whether it makes sense to build these benchmarks and whether you’re going to have trouble continuing to build benchmarks that aren’t saturated depends on whether AI can speed up the benchmark building process and also how much you can parallelize things.
So on these two different dimensions — how much have you found AI to be helpful for speeding things up when you’re building benchmarks, and also to what extent is it the kind of thing where you can just absorb more resources and it’s very flexible?
Tom
So on absorbing more resources — MirrorCode could have benefited from a lot more full-time software engineers on it. I was basically the main person with a lot of engineering experience on the project, although I certainly had some help from collaborators. And I definitely feel that, both in terms of adding target programs to the benchmark and also setting up the infrastructure, just having three engineers on it would’ve sped it up a lot.
Obviously this is from a low base. If you have a 20-person team within Anthropic — can you still sort of scale that up to 50 or a hundred people and get similar speed-up? I feel more uncertain about that.
And then there’s just adding more samples to a benchmark. One would hope that this is sort of inherently pretty parallelizable.
Greg
I’m curious for the AI speed-up one.
Tom
Yeah. AI speed-up. I mean, we all know from METR’s research that people seem to be pretty bad at estimating this. And I myself feel very uncertain, but — you know, gun to my head, if you really forced me to pick a number — I would say 2x speed-up.
Greg
I suppose I’d give similar answers here. For FrontierMath: Open Problems, the problem contribution is embarrassingly parallelizable, limited only by the mathematicians. I shouldn’t exactly say that — we have a review; I review all the problems, and so that’s a limiting factor. But for the most part.
And then each problem contributor develops their own verification program. So we have more diversity of AI speed-up — some of them certainly used AI. But anyway, I believe that speed-up is, you know, moderate there.
The bottleneck is more in having the idea for the problem. And I don’t think the AI systems are so good at finding problems that meet our admittedly somewhat unnatural constraints of being unsolved math problems of a certain degree of interestingness with solutions that happen to be verifiable.
The benchmark-reality gap [00:23:28]
Anson
So we’ve just covered a bunch of things about whether we think benchmarks are going to be doomed to be saturated as we try to build them out because AI progress is so fast.
But there’s another way in which benchmarks could be doomed, or at least as I understand it, which is that no matter what, benchmarks are just not going to be able to capture the things that we care about, no matter how much effort you put into trying to build them.
So the kind of examples here would be like GPQA Diamond — people often say it’s like PhD-level science questions. If you can do GPQA Diamond, then you’re going to be able to do PhD-level science. Somewhere along the line the logic breaks down. You know, you can do GPQA Diamond, but then maybe you can’t do all of PhD-level science.
What is wrong with this particular line of argument? Is it wrong? Do we think that AI benchmarks are doomed in the sense of not being able to capture these real-world impacts?
Greg
I mean, I think the argument might be a little overstated in the snapshot you gave, already. I’m pretty sure that models that did well on GPQA Diamond indeed generalize to the task of answering questions qualitatively similar to those in GPQA Diamond.
One lesson to learn from this is just to make sure that when you say, if an AI model can solve this benchmark, then it can generally do tasks like the tasks in this benchmark — you’ll never go wrong, short of abject cheating, training on test — you won’t go wrong by saying, “Okay, what this means is if I give it a self-contained grad-level science problem, even one that you need to be an expert in the domain to solve, as was verified for GPQA, then it’ll solve that.”
And you just leave the listener to their own devices to generalize. How much will that help someone working in science? What sort of uplift will that give to a non-expert — a biologist doing a chemistry problem outside their comfort zone, whatever. But the benchmark was never going to tell you that, because that’s not what the benchmark was about.
I would say incidentally, we seem to be in a period where you don’t even get in that much trouble for generalizing a little, maybe, beyond the letter of the benchmark task. By which I mean coding agents seem genuinely useful even if many of the tasks we see are not obviously in distribution for benchmark tasks.
Some of this is contingent — this is happening only because the AI companies are perhaps behind the scenes shoving a lot more tasks than we see into distribution, into training. But still, short of cheating, you should expect benchmark generalization — machine learning works, it generalizes into the training distribution — that’s fine.
And so I think what this means is we should be very careful about extrapolating benchmarks, but we should also be very thoughtful and put a lot of effort into trying to put the benchmark pin right in an important area, an area that tells us something we actually care about inherently. And I think both of the benchmarks we’ve talked about that Epoch has been busy developing — both MirrorCode and FrontierMath: Open Problems — meet the spec to a clear degree.
MirrorCode is just — if I have a really clearly, precisely specified test suite or at least spec, then I can expect AI systems to develop software of that nature at least to a certain degree of complexity, which MirrorCode helps you understand. And that’s inherently — I don’t think it’s a stretch to say that’s inherently of interest to someone who might be using the system for practical purposes, deciding whether to fire all of their software engineers or even doing research on software intelligence explosion. Like what sort of tasks go into AI research? How much of them are tasks like this? And this adds clarity in very helpful practical ways.
So too with FrontierMath: Open Problems, even more so — these are problems where there’s no generalization required, at least for each individual problem. It’s something some mathematician would really care about personally, would care about seeing solved. If you’ve devised your benchmark well, you shouldn’t care about generalizing too far beyond the benchmark because the benchmark itself is from a distribution you genuinely care about.
Tom
One thing I’m a little unsure about, though, is that, ok, this all sounds good. We can be pretty confident in the claim that if the AI does the benchmark task, it’s going to be able to do very similar tasks to that thing. But then what counts exactly as something that’s very similar to it?
In practice, people often do want to try to generalize these things further, and although we say we should be careful about generalizing further, it’s very hard to say exactly how much that is.
My one example here is GDPVal. I think in their paper they’re explicitly motivating it in the first few paragraphs, we want this to be something like a leading indicator for a lot of automation. And I guess unfortunately it wasn’t successful at that. Probably they spent like millions of dollars building this thing, and it doesn’t seem to fully reflect what we’ve been seeing in, say, productivity statistics and so on.
Greg
Well, they, I think, fell prey to a pun in the name — and it is catchy, GDPVal. It’s great.
I think you just have to look at the task and say you may have a motte and bailey, but in the good sense — like you have a core goal and a stretch goal, say. Where the core goal is, for GDPVal for example, saturation of this benchmark should be evidence that AI systems can do self-contained tasks drawn from a wide range of digital work. And I mean to emphasize self-contained quite a bit, because these tasks are very self-contained. You do web search, but apart from that, you’re given the documents you need and you’re given your task and you output basically a document — usually often just a text file. That is your output. And so it’s quite self-contained compared to the actual work environments that humans face.
So, the core goal is just — can it usefully offload tasks like this? I would say it’s extremely consistent with my experience that over the last year, offloading tasks of complexity — like, less than a day’s worth of work for me to put together a written report on some topic that requires expertise — they’ve gotten a lot better at that. Of course they have.
Now for automation, I think it would just have been foolish to expect that this would automate. Florian Brand, who worked on the same report on GDPVal had a great analogy. He said the self-contained nature of these tasks is somewhat analogous to the self-contained nature of bug-fixing or small feature addition in software engineering. So just as AI systems currently have not automated software engineers as a whole profession, but they have transformed the workflow — you now delegate and manage much more of your time than you spend writing — so too, saturation on APEX-Agents or GDPVal or RLI would mean that, if you are a knowledge worker in these other domains, you too could see your daily workflow transformed.
But the benchmarks just aren’t targeting, GDPVal anyway, is just not targeting automation enough for you to expect to generalize there.
Anson
I think that makes a lot of sense. And one thing that I think maybe this suggests is that there’s a lot of value in digging into the details of what this benchmark actually tells us. Because it’s very easy to be like, “Oh, GDPVal, and then GDP,” but then actually we need to look into what exactly the tests are. And as you were saying, the specific tests actually seem like they do generalize better if you look at what those tests are rather than “GDP” or whatever.
Tom
Sure. I certainly agree with you that this sort of effect that Anson was describing doesn’t mean that benchmarks are doomed. But I have a slightly different perspective, in the sense that this slogan of “benchmark-reality gap” does resonate with me a bit more.
If you told me in 2020 that AI would solve GPQA-style questions — where they’re Google-proof, so even with arbitrary web access you can’t just find the solution written somewhere, you have to not only combine a bunch of knowledge but also do a bit of reasoning about these pretty advanced science topics — I would’ve predicted much, much bigger effects of AI on the economy and society than we in fact saw when AIs were, say, at 50% on GPQA.
And I think this is the case for many people. And to some extent this is, “Okay, I should take the L.” I was naive in how I was thinking about benchmarking, and maybe some people were much wiser about it. But it does kind of ring true to me that there seems to be a systematic way in which we try to design a benchmark that we hope will capture this broader thing, and then we see AI do great at it, but the real-world usefulness or impact isn’t quite there.
For myself, I want to take into account the track record of how I’ve been surprised by this. The sense in which I feel like it goes beyond just, “Oh, well, you were wrong and naive about the benchmark at the start,” is maybe there’s just something inherently very difficult about squeezing all of the complexity of real-world, long-horizon tasks into something benchmarkable. And we’re going to keep systematically bumping against this, even as we try to make benchmarks better and more realistic.
So, I do feel like there’s something to be aware of here. But in terms of whether this dooms benchmarks — no, because it still seems like, even if we were wrong about what GPQA meant, we can try to take the lessons from that and design the next eval better. Basically, even if we continue to be a bit wrong about this, hopefully benchmarks are still useful.
Greg
Two responses. One is more leaning into, yes, I think people do expect more from benchmarks than they ever should have. The one AI paper I wrote, long before Epoch, was on a critique of benchmarks at the time, and people not investing in making sure benchmarks matched distributions that they wanted, even a little. This was around 2019, and, you guys got to know, the situation was much worse back then. It really wasn’t clear that benchmarks correlated with anything. And so I think there’s some zen of what you should expect from benchmarks. And yet I think they’re better than they’ve ever been.
So the lessons I think were learned over time of we’ve got to make this something that isn’t meant to be something random that AI systems just can’t do today, but if they could do it I’m not sure I’d feel informed about anything other than this random niche. I think benchmarks used to look like that a lot, and they sometimes still sort of look like that today when people find quirks — whatever, “r’s in strawberry” or something — you can make benchmarks out of that. But these are more hobby efforts on the side, and the big benchmarks people pay attention to have been centered over more meaningful distributions.
And I think this does point to the sort of progress you’re saying. And if you couple that with the perspective of modesty in inferring from benchmark results what impacts you’d expect on the world, then you can be very happy about benchmarking. Join me in happiness. The invitation’s open. It’s great here.
But, the other thing I would say is: we have seen a lot of impact of AI on the world. We have this massive marshaling of societal resources to make more systems — the signals that people needed to see to choose to invest a lot of money, including now very meaningfully growing revenue from just consumers, not just investors — were strong enough that people did say this is a big deal, and acted like this is a big deal.
In some ways the benchmark progress did indicate real impact on the world. And the fact that we weren’t necessarily exactly right about the shape or the immediacy of what human-level performance on GPQA Diamond was — if you zoom out a little, maybe we were right? This is a big deal.
Or even going back further in benchmarking history — not that much further — to Winograd schemas, the ambiguous pronoun resolution tasks. This was included in — I forget where it was from — some list of, like, “AGI will be here when five things are true.” And one of them was a sufficient score on a more or less completely saturated Winograd schema test.
And I think what I was trying to get at was: look, this is sort of a tricky task that requires world knowledge and fluency in natural language. And that’s gotta be a big deal if that happens. And it wasn’t immediately — when you got systems just blowing this benchmark out of the water — that the world was transformed literally overnight. I think it was a big deal. It’s a big deal that we have AI systems that can do well on language tasks and can very flexibly use human language. And this was one big blocker in AI being useful, and that blocker is mostly gone.
Tom
Well, but if AI had stopped progressing at the level where it did really well in the Winograd schema benchmark, I feel like we wouldn’t have seen that much impact.
Greg
I’m not sure that’s totally true. There’s a version of it — there’s maybe a narrow version that’s true — but if you give me a little rope, I think if AI progress had plateaued with GPT-4 levels, but not reasoning model levels, there was already, I think, a lot of economic transformation or whatever economic value baked in that it was going to take a while to figure out how to, you know, use everywhere. Linguistic flexibility, even if you don’t have super precise reasoning — I think, you know, is a technology on par with — it is a tech-of-the-decade kind of thing. That’s not bad.
And I think Winograd schemas being saturated probably was a meaningful sign that you were there. And if you had plateaued, you still would’ve been like been, like, “Wow. Used to be, I couldn’t really talk to a computer, and now I can kind of talk to a computer, and that’s meaningful.”
And I think the benchmarks would’ve played their role in helping you at least dismiss extremely reductive cases, of like “This doesn’t.. No, no, we used to have no idea how to solve these puzzles, and it seems plausible you need language skills to do it, and you can. So now — impact ahoy.”
Can an AGI benchmark exist? [00:38:26]
Anson
So one thing I wanted to kind of make sure I’m understanding correctly, for both of you, is, do you guys both think that “AGI bench” can exist? If you have this benchmark and then you were to just train on it and hill climb it, you saturate it, now you’ve got AGI for sure?
Tom
I don’t find the term AGI very useful to begin with, and, because of this point that many, many people have made — I’m obviously not inventing it — that the capabilities of, even before AI, computers and now AI systems are heterogeneous in terms of how good they are at different things. And it seems like we could see huge impacts of AI on society and the economy before we have this generality where it can do, you know, all or almost all of the things that humans can do.
I just don’t think this AGI label is that useful. And instead we should be saying, “What are the capabilities that we think are especially relevant and important?” — and let’s try to build benchmarks for those.
Greg
I do think there’s a spirit of your question, which is fine. You could have a breadth of benchmarks and I can concatenate them and say, here’s my mega-benchmark. And do I think that is possible to build? I think it’d be very expensive. We’re talking a lot of tasks.
And I think there’s sort of this magic ingredient sitting behind these things, which is something like generalization — will we get a system where doing well on one task is strong evidence that it will be able to do well on another task where humans sort of have something like this quality.
So I think this generalization question is very interesting. Benchmarks that could help you identify general reasoning — there have been attempts at this, like this is what ARC-AGI is supposed to be all about, you know, AGI arrives with ARC-AGI-6, or whatever. But I think that’s actually sort of a plausible view — they clearly haven’t pushed this to the human extremes, but there are other approaches you could take to try to measure this kind of out-of-distribution generalization, in-context learning kinds of things.
One idea I’ve heard discussed: you get the latest video game that’s popular on Steam and you see if an AI system can play it well, and that gives you some sense of whether it has generalized.
Tom
But I guess you might worry that even this concept of generalization is — you know, actually, once you look under the hood — it’s this super weird multi-dimensional thing, and we can’t really conclude that much from this random new video game on Steam, performance on that. Well, maybe it just doesn’t tell us that much about if I bring in an AI as a new temp worker for this kind of low-level administrative task. How well will they do on that? I would still worry that what you end up measuring is — well, can it generalize within the specific sub-domain, or at this type of task?
Greg
Of course. I do think there’s room for somewhat cautious optimism here because we have in fact seen sparks of AGI. I do think that’s a fair characterization, that we have seen some degree of generalization — unclear how much of that was from shoving things into the training distribution. That’s a big question.
But you could maybe hope to detect something like this. Like, whatever, we have a benchmark for boring temp work that we keep hidden and we have a benchmark for video games or whatever, and we see if progress is made at the same time on both of them. And if it was, I would say we’re seeing an interesting thing emerging.
But it’s also, of course, hard to know whether that just happened because someone in the lab happened to buy an RL environment that looks a lot like one of your hidden benchmarks. Ideas aren’t — it’s hard to be that original. So I do think this is something of a question.
But again, these lists of things that will herald AGI, I don’t think have been terribly off base. I actually think we’ve learned some lessons. What are things that have not heralded AGI? I think that would include chess — Deep Blue beating Kasparov was not a moment of general AGI.
However, the techniques developed there — there’s still a little bit of, “No, it was correlated with the same thing that society was trying to do for a while.”
But fine, call that a loss. But I think a win is, these sort of broadish, hodgepodge of tasks show some general capability. And then, I don’t know, maybe this generalization is still something benchmarks should be paying attention to, over and above any particular task.
Beyond automated scoring [00:43:18]
Anson
So given all of these things — it sounds like in terms of saturation, you guys don’t think that the benchmarks are necessarily doomed. In the case of how much they can generalize, there are a lot of interesting questions, and I guess it’s a bit more complicated.
I think there is still a big looming question here, which is, where do we go next with benchmarks? What exactly did benchmarks look like in the future?
Tom
So one kind of categorization I find useful is in terms of how benchmarks are formed. Is it completely machine-checkable? So you have an algorithm, not based on language models, that just checks correctness. And basically all traditional benchmarks, there is some form of LLM-as-a-judge. And the third category is just human judging, non-automated judging; you just have humans score the AI outputs.
So, I’m interested in people figuring out how to do the second category well. And then human grading is I think basically, historically, it would’ve been ludicrous, because human time is just way too costly. And when we had benchmarks with like a thousand samples and so on, it just wouldn’t have been feasible.
You know, now we’re seeing things like much smaller benchmarks, or actually even just demos like Anthropic’s C compiler, where there’s a single output and running the benchmark might be in the tens of thousands of dollars. There, maybe there’s a form of human rating that could make sense.
There’s so much more to explore with these alternative scoring methods. There’s a lot more juice to be had even in the completely algorithmically scoreable category.
Greg
It’s funny how I almost feel like we’ve got two poles here that are both very promising, and then this tempting — but I’m not sure how much I believe in it — middle ground of relying on fuzzy qualitative AI judgment for assessing AI outputs.
We’ve rarely had benchmarks outside of — the math, science, coding — this domain. There are some attempts at creative writing benchmarks, and they’re good, I mean, no shade, but they’re just not that deep or compelling. And outside of that, it’s not only been this first —
Tom
— there are things that try law. I’m really interested in white-collar work that isn’t STEM. But I wish I had the time — I haven’t had the time to look at the literature.
Greg
We can talk a little about some of these. I think it’s interesting.
Recently, Epoch wrote a report reviewing three benchmarks that try to target economically valuable work outside of coding, math, science. And I think there are some interesting entries there. It’s also interesting to look at how they’re graded, because none of them are in this first category you were describing.
One called APEX-Agents uses detailed rubrics — and this is targeting tasks in corporate law, management consulting, and investment banking. And they have just detailed rubrics at which an LLM then assesses. And it’s things like, “Did this customer’s data breach described in these documents violate GDPR, which you have a copy of over here, and here’s the contract the customer had with their client?” And the rubric is saying, “You lose a point if you don’t say how clause 10.3C or whatever was violated or was not violated” — so it’s very granular. I think I believe it that this is doable.
The other two benchmarks we looked at — GDPVal from OpenAI and Remote Labor Index from Scale/CAIS collaboration — those are just graded by humans. They just bit the bullet, and I think this is great. And It’s interesting. GDPVal is close to saturated, but Remote Labor Index is definitely not.
Tom
How many tasks are in Remote Labor Index, and do you know how much they paid the graders?
Greg
So all good details. I don’t remember the exact task number — it’s in our report — but on the order of a hundred, not 10. And they don’t give us much on how much they pay the graders.
The graders are simply asked to — they’re given the AI output, they’re given the spec from the customer. These are real tasks taken from the gig work platform Upwork, and they give the reference output which was accepted by the customer, and they say, “If this is what the customer was looking for, would this other output likely satisfy them?” What I take this to mean is, “Is the AI output even in the ballpark of the human output?” That’s a lower bound on the quality of judgment.
Most of these things, to be clear, this was an innovative take, are multimedia output. So it’s kind of a visual gist judgment. And right now at least the failures are just dramatic. The first author on the paper was just describing a test case they have of, like, “We asked you to draw the Superman logo in Inkscape and you submitted an unrecognizable blob.” That’s the level that models are at here.
I think fine-grained judgments will get harder, but I don’t think they invested that much in the human rating. And I sort of believe this is a perfectly reasonable thing, and the benchmark at least is good enough to tell us the binary of, “Can AI at least come close to doing this sort of task? Are the deficiencies more fine-grained versus are they not even close?” And the answer there is they’re not even so close. And as you were saying, this is a new form for benchmarks, primarily.
I’ll mention one other form that I think is a good example for going forward, which is — incidentally people paid a lot of attention to it, but I don’t think appreciated it as a human-judged benchmark — which is the International Math Olympiad.
So for those who don’t know, this is a contest where some very smart high school kids write and solve math problems by writing proofs, arguments, and there’s a very well-developed, decades-long process for human judges to score these purported solutions from the students. They’re all double-judged, and the judges are given very extensive rubrics ahead of time, but they also evolve those rubrics during the scoring process as new things come up. And there’s an argument back and forth where the judges get to present their assessment. It’s very involved, very labor-intensive.
And Google got their solutions submitted — Google’s solutions were submitted anonymously — and that is where Google scored, the IMO gold claim from Google, is properly judged by the same process for judging humans.
I think it’s an amazing benchmark, and no one batted an eye at this. You know, this was a really good methodological benchmarking win that hadn’t really been done before. And it was just done by hooking into existing human infrastructure for judging work output. And I think for category three, this is something to be emulated.
Tom
Just to give you the opportunity to hammer home your point — what are some other examples of using these kinds of existing structures?
Greg
I’m worried I’m not remembering the one that you maybe liked that I said when we were chatting earlier on, but one that I can imagine is anything where there is currently a human contest for just submitting something like this.
So this isn’t the one I said earlier, but I was just thinking — there are various awards for fiction. So if you want to have your AI system write a novel and submit it — there are ethical concerns and ways of trying to make sure we’re not flooding the inboxes of editors and whatnot — but a very reasonable benchmark, in my opinion, for creative writing would be submitting a short story to a short story writing contest and have it graded or voted on the same way. I think this is a very reasonable benchmark.
What was, yeah —
Tom
Yeah, that one’s great. I also think just peer review, academic review of papers — especially as AI becomes more important and gets used in academia a lot, you should eventually be able to persuade reviewers to be willing to spend five or ten percent of their reviewing time or something evaluating these AI outputs. Maybe AI labs paid them a lot for that time.
This seems pretty feasible, and a way that you just hook into this infrastructure that applies to any time a paper is reviewed. So it’s pretty much any area of science.
Greg
And I think, unfortunately there is something of a refereeing crisis — meaning a labor shortage — in certain academic fields. But this could be a synergistic opportunity to pay the money to solve that problem. And then have some gated process by which AI autonomously authored papers are submitted to NeurIPS, or whatever, and the benchmark is to win best paper, or get accepted, and then get whatever accolades. I think this is pretty good.
And I suspect, you know, one thing — stepping back — one thing that’s funny about benchmarking is, again, it used to be this almost purely academic exercise done right alongside the people who were developing the models. Now there are companies with annual budgets in the tens of billions of dollars and growing for AI system development.
And surely they’re not hanging off every word of benchmarks made by little shops like us. They have highly resourced internal benchmarking suites and they are surely trying to evaluate their systems.
I imagine part of what they’re doing with the help of the sort of data collection companies is trying to extract “just such” cases from real-world internal corporate use. So even if some of these processes are legible to us as industry outsiders, from the vast majority of industries, as peer review or the existence of contests for public-facing consumer output — there are lots of cases if you were in the guts of an insurance company where you’d have all sorts of, “Oh, here’s the step in the process where the senior claims adjuster signs off on a report authored by a junior claims adjuster,” and that’s their whole darn job is to do this.
And so I’m sure someone somewhere is trying to collect data to replicate that — and maybe even do human trials with some regularity of, okay, we did our messy RL environment approximation of this, that’s a shoddy benchmark purely internally, we trained on that data and now we have a validation set and it looks like it’s doing well, and now we’re going to do some taste testing, which they wouldn’t necessarily call a benchmark, but it’s a benchmark, of having a real senior insurance claims adjuster take a look at this report that the AI system tried to generate.
And to be clear, that’s exactly what GDPVal is sort of trying to externalize and do. But it’s still these relatively self-contained tasks, and I think just expanding the scope of these — that would take a human less than a day to do or something — if that’s saturated, let’s go to week-long projects, and you know, get what you get.
Benchmarking in messy real-world contexts — I think that’s just where benchmarks will go. These might look more like case studies, and I think this is fine. You can have standard method case studies and see — I think we should remember that every 18 months or so we see a big spike in capabilities. If we’re really in that regime, we shouldn’t feel too bad about doing a “Can AI do this thing it obviously can’t do?” kind of contest and have that every four months or something like that.
And then, you never know when the next spike is going to come. So set a baseline of AI not being able to do these things, hone your methodology so you’ll be able to say when a big spike has happened. And this will, I think, be a very fruitful mode for benchmarking to be in. And if anything, we’ll have less of a gap between the things we really care about and the raw benchmarking numbers.
And yes, it’ll be more expensive, or not have some of the nice features that current cheapness has — which is, like, there’s an interesting fast model from an upstart Chinese company, can we run it on the benchmark right now? It’s very easy to do. And that won’t be so easy. But this is, I think, an acceptable price to pay. Somehow the scores will move slower but will come regularly. I think this will be very informative, and we’ve hardly explored this at all and have plenty to squeeze there.
Tom
So — if I think about what’s next after some version of MirrorCode is released — a few things that seem kind of interesting: so one is staying within the MirrorCode idea of staying within easily scoreable software engineering tasks.
Seeing as AIs are pretty good at these on MirrorCode-style reimplementation — can we see, does this generalize if you put AI in something more akin to situations that humans are in? And so you would be pushing the frontier with access to, you know, any code base that you want, any existing tools.
And so the examples there would be like, can you speed up some widely used software that is, where speed is a real bottleneck and there’s already been a substantial amount of effort on optimizing that?
One example that comes to mind here is Rust compilation. People really like the Rust programming language but complain a lot about the compilation being slow, because it fundamentally just has to do a lot more with borrow checking and other things than other languages. Yeah, that feels like a kind of natural next step, “Oh, AI is really good at precisely specified software tasks.” Can we get it to a point where this would produce an artifact that would actually be useful in the real world? That’s kind of one angle.
Something else I’m interested in is — a lot of people are interested in the effect of AI on speeding up AI R&D, and I’m quite curious to think about the question of how much of those tasks are kind of MirrorCode-style, where there’s a pretty clear goal or metric? How well does AI do with those?
So one thing I’m actually a little bit confused about — I should look into it more — is RE-Bench, which seems to have this property, and it sort of seems actually quite similar to MirrorCode in terms of you know precisely what to do and being able to get feedback as you go.
My impression is it’s not the case that every single RE-Bench task is at, you know, superhuman, more-than-eight-hour time-horizon levels. Understanding more whether that’s the case, and if so, why? And then potentially seeing if there are benchmarks we can design that really target this AI R&D thing.
Greg
One thing I’ll throw in that I think is this sort of magic ingredient of out-of-distribution generalization — I think that’s a topic benchmarks can take a crack at. And I think we’ve done a little bit of work with this. We have a chess puzzles benchmark that shows some interesting patterns in how models perform, or sort of make halting progress on where presumably labs care less about optimizing for this. But if you had a general-purpose reasoner that could solve super hard math problems, you should be able to work through a chess puzzle. You want this to be moderately secret and not too high profile so that the labs don’t focus too much on it, like ARC-AGI became a bit bench-maxed, somewhat.
For specific ideas, ne that I happen to like — who knows if we’ll make anything of it — is trying to push more into physical world tasks. There was this lovely little blog post of someone trying to get Claude to teach him how to make coffee via just taking photos of where he was and asking it for instructions. And I think that’s interesting because you can imagine all sorts of impacts on the world if LLMs are good as brains for perhaps robots, but even just for humans to navigate the world — and, you know, can provide all sorts of skill uplift if they can tell you how to, whatever, replace this machine part in something in your car or in a factory.
So I think we can just sort of start to look more broadly at what are the bottlenecks to all sorts of economic impacts. And there are probably some — what I’d say are regular old benchmarks — that probably can fit reasonably into that framework.
How AI changes benchmark building in practice [01:00:45]
Anson
How do you envision the benchmark building process when in a couple of years you have lots of AIs that are helping you speed up the process itself? What do you think that looks like?
Greg
Have I really drank the Kool-Aid if I don’t have an off-the-cuff answer to what I’ll do with all my agents?
The software engineering style of this seems maybe a little more concrete to imagine?
Tom
It seems like it’s an abstraction ladder interacting with a coding agent. At the bottom you might say, in this particular function, factor out this particular thing into a helper function. And that’s basically like typing it yourself — it’s so specific it might just save a little bit of time versus doing it manually yourself.
And actually, sometimes I do this for an instrumental reason, which is then the AI has in context that this has just been done. Whereas it’s a little bit more annoying to get it into the AI context if you do it manually. So that’s sort of the bottom. And then you can go up and up and up this abstraction ladder where the instructions you give the AI are more and more high-level.
I don’t think I really have a useful more concrete picture or prediction beyond that.
Greg
I think there is a bottleneck in some benchmark design around taste in tasks. I do feel like it would be a big unlock if AI systems had some of this taste that I feel like they don’t do a great job with today. Where, for example, if I say, “Give me examples of problems that fit the rubric for open problems” — I haven’t been impressed with what they turn up, and it’s a little bit of an unusual —
Tom
Yeah, they don’t have great taste for coming up with MirrorCode target programs. But just the fact that they know every single thing in computer science or in computing — so you can just ask it to keep generating more ideas and then you can pick based on your taste.
And also, even during the development of this benchmark, Opus 4.5 and 4.6, I feel like they’re already better at coming up with suggestions that meet more of the criteria.
Greg
I mean, I think the — in case it’s not obvious — a couple steps up the abstraction ladder would be a human researcher sets up the framework for the benchmark, with plenty of assistance on coding whatever infrastructure is necessary. And then you come to the part where you have to fill out all the tasks.
I mean, often you sort of start with that to make sure there are some tasks, but you’re at a point where you want to get 10 or a hundred of these things, and there’s some work to do to even come up with what they should be. And you ask an AI system, and you can sort of trust its results that it will mostly come up with good ideas that it’s worth your time to engage with and quality-control — instead of a couple steps down where it’s: I came up with a task and now I’m going to get a lot of help from it to implement. Or I see what’s wrong with the current version of this and I’m going to give it some feedback and have it take a turn on the code or whatever.
For the chess puzzles benchmark, where Gemini 3 Pro wrote all the code for it, but it was me looking at the output and saying, “Ah, these chess puzzles are lacking this feature.” Or, “Our search for chess puzzles of the characteristic we want is not turning much up. I think X, Y, Z is wrong. What do you suggest?” And it’s helpful and productive, but a human is a couple of layers up, obviously.
Building the whole thing from scratch? When do we just say, “AI, I would like a benchmark in this domain”? I don’t know. I mean, presumably it’s on the path out there, but that does feel a couple of turns away. Call it six months to three years, conservatively. But I skewed towards the later end of that.
Anson
I think this is interesting, also a little funny. “Hey, we need a benchmark for benchmark taste.” You can see if the AIs can themselves make the benchmarks.
Greg
Yeah, I mean I do think some of our benchmarks have elements of taste baked into them in these kinds of “don’t expect it to generalize too well” kind of ways, but maybe useful angles on it.
Like, even MirrorCode, some of the more complex programs, you need — call it architectural taste — to make it not fall apart. And we’ll see if the models have that for the harder ones. Or some of the open problems — you might need what a human would call taste for the harder problems. We’ll see in hindsight, I don’t know.
Anson
Okay, cool. I think this is a good place to end. Thank you both for coming on the podcast. It was a good chat.
Tom
Thanks.
Greg
Thanks, Anson. Thanks, Tom.
In this podcast


