Assessing heterogeneity in METR's late 2025 developer productivity experiment

Assessing heterogeneity in METR's late 2025 developer productivity experiment

Update: Fixed exponentiation of estimated parameters.

Summary

I use data from METR's recent developer productivity experiment to assess the possibility of heterogeneity in the effect of AI on time to complete a task. Relative to a sample-wide 6% speedup, I estimate a 12% speedup in tasks which were predicted by developers (prior to treatment assignment) to take substantially shorter with rather than without AI, and I estimate a 25% speedup for the developer in the study with the highest estimated speedup.

In this post I'm going to focus on giving a descriptive view of what this data looks like and what we may be able to estimate based on it. In a later post I may give some opinions on possible takeaways from this in terms of actions people and organizations might take. I also focus on estimation rather than characterizing confidence or uncertainty.

Why care

  • METR interprets their relatively small overall speed-up to indicate bias due to selection on both developers and task. Understanding heterogeneity can assist with understanding how plausible this is, and may also help with designing future experiments.
  • Heterogeneity in the effect of AI on productivity may be strategically relevant to businesses and could also be relevant for planning for and developing policy around AI-induced labor market impacts.

METR's results

METR describes their results like this:

Our raw results show some evidence for speedup. Our early 2025 study found the use of AI causes tasks to take 19% longer, with a confidence interval between +2% and +39%. For the subset of the original developers who participated in the later study, we now estimate a speedup of -18% with a confidence interval between -38% and +9%. Among newly-recruited developers the estimated speedup is -4%, with a confidence interval between -15% and +9%.
 
However the true speedup could be much higher among the developers and tasks which are selected out of the experiment. Some developers self-report very high speedups, though as we documented in our earlier study those estimates can be quite unreliable.
 
Due to the severity of these selection effects, we are working on changes to the design of our study. Below, we provide further detail and describe our plans for other means of studying the impact of AI on developer productivity.

From what I can tell, METR estimates speedups by regressing log implementation time against log estimated time without AI (estimated by developers prior to random assignment of tasks to AI vs no-AI condition) and an indicator to whether AI was allowed or disallowed for a given task. This makes sense since estimated time without AI seems to be well-correlated with implementation time, and so can help with controlling variance in time across tasks that just has to do with the task itself:

grid

When I obtain an estimate for the entire sample that METR makes available I get a sample-wide estimated speedup of 6%.

Heterogeneity across tasks

To assess heterogeneity across tasks, I utilize estimates provided for the time each task will take with and without AI from before task are randomized to either AI condition. We might hypothesize that task where developers estimate this gap to be larger would have a larger speedup. I initially tried using the raw difference between the AI and no-AI estimated times, but I was getting so odd results and wasn't sure if outlier or some other issue was complicating matters. I ended up transforming the two estimates into an indicator for whether the no-AI estimate was larger than the AI estimate by at least 60 minutes. For use in the "heuristic estimate" section below I wanted to get an estimate that lean towards larger speedups, and some experimentation seemed to indicate that selecting a threshold that pushes towards larger differences in developer-estimated times yielded larger differences in the estimated effect of the AI condition.

I fit a regression model with a main effect for this indicator and the AI treatment X indicator interaction included. This produced as estimated AI speedup of 5% when the no-AI estimated time was not larger than the AI one by at least 60 minutes, and a speedup of 12% when the no-AI estimated time was larger by that amount.

Heterogeneity across developers

To assess heterogeneity across developers, I estimate a linear mixed effects model with fixed effects for the AI treatment assignment and log estimated time without AI, and developer-level random intercepts and random slopes for the AI treatment assignment (I don't include the indicator discussed in the previous section). I estimate the developer level AI treatment effect as the sample-wide effect in this model (7% speedup) plus the developer-level random slope. These estimates look like this (negative values indicate speedups):

histogram

The maximum estimated speedup in this sample based on the method above is 25%.

Heuristic estimate accounting for selection

As a heuristic method for getting a sense of how much selection might impact these numbers, I tried the following. Keep in mind that this can't truly "correct" for any selection because we don't have data on the observations we might be missing. This is only intended as a rough attempt to understand how selection might be impacting these results.

METR says in there post that "an increased share of developers say they would not want to do 50% of their work without AI" and that "30% to 50% of developers told us that they were choosing not to submit some tasks because they did not want to do them without AI". Let's assume we are missing 50% of tasks/developers and that these are selected to have systematically lower effects of AI on productivity. We might think that observations excluded in this way are near the higher end of what is observed in the sample that we do have. To simulate this, I estimate a "synthetic" sample-wide effect by averaging the observed sample-wide effect and one adjusted by the difference between the two effects estimated in the "tasks" section (averaged because we are assuming 50% of the sample is missing due to selection i.e. a 0.5 weighting to both observed and "missing"). Similarly, I estimate a synthetic developer-level effect that averages the maximum developer-level effect and the effect adjusting that max by the task adjustment amount. I then estimate an overall synthetic effect that averages the synthetic effect adjusting for tasks only and the "developer-level" synthetic effect. This gives an estimated 20% speedup.

I'll note that I intentionally use values that lean towards high speedups here in order to understand possible selection effects. None of these numbers should be taken as my personal best-guess or to imply that I am taking any particular position on what speedups are likely.