You Don't Need Autoresearch. You Need an Input, a Verifiable Output, and a Loop.
Back to Blog

You Don't Need Autoresearch. You Need an Input, a Verifiable Output, and a Loop.

Karpathy shipped autoresearch and named a pattern everyone could already run. The real insight is that the loop works anywhere you can define a verifiable output. You can apply it today, in any field, without waiting for a framework.

05 Apr 202605 Mins read
You Don't Need Autoresearch. You Need an Input, a Verifiable Output, and a Loop.

Karpathy shipped autoresearch and GitHub exploded. My timeline turned into people posting overnight val_bpb curves and captioning them like they cured cancer. I pulled up the repo and read it: 630 lines of training code and a markdown file called program.md. That's the whole thing.

I sat with that for an evening and argued both sides of it. My brain went: wait, this is just a for loop with an LLM inside it. I ended up somewhere I didn't expect, because the loop is not the point. The pattern is the point, and once you see the pattern, you realize you can run it on almost anything.

what autoresearch actually is

Stripped of the framing, the loop is this:

while True:
    agent modifies train.py
    run for 5 minutes
    if val_bpb < best:
        keep the change
    else:
        revert

An LLM reads the training file, edits it, and you run the edited version for a few minutes. Check if the loss went down. Keep or revert, then go again. The "research agenda" is a markdown file that tells the agent what to try, what not to break, and what metric to care about. Then the agent loops until you run out of compute or patience.

So when someone says autoresearch, what they actually mean is: LLM as proposal function, greedy hill climbing over a single scalar metric, with a keep/revert gate. That is not a knock. That's the whole insight.

what I got wrong in my first reaction

My first instinct was to roll my eyes, because technically this is something any engineer who has used Claude Code or Codex for more than a weekend has done. "Hey, tweak this config, run the benchmark, revert if it got worse, try again." That's a Tuesday afternoon workflow. But that reaction was missing the actual contribution.

Karpathy named the pattern, gave it a minimal repo, and showed real results. Within 48 hours people were running it on marketing funnels, query expansion models, Shopify's internal codebases, SRE runbooks. Thousands of engineers already had the pieces, but almost none of them were running the loop.

That is the thing that is easy to underestimate. Half the value of any good project is removing the activation energy so people actually do the thing. React didn't invent components. Docker didn't invent containers. Karpathy didn't invent "let an agent modify code and check if the metric improved." He just made the pattern legible enough that people started running it, and that is worth something.

the real insight is the ingredients

Once you strip the repo down to what it actually needs, you get three ingredients:

  1. An input. Something the agent can modify. Code, a prompt, a config, a SQL query, a runbook, a retrieval pipeline, a copy draft.
  2. A verifiable output. A scalar metric. Something you can compute automatically and trust. Loss, accuracy, latency, conversion rate, retrieval precision, click-through, pass-at-k, anything.
  3. A loop. Propose a change. Evaluate. Keep or revert. Repeat.

That's the whole thing. And once you see it stated like that, you realize the loop does not care whether the input is train.py. It works on anything.

where this pattern applies anywhere

Think about your own stack for a second.

RAG systems. Input: the retrieval pipeline (chunk size, embedding model, top-k, reranker prompt, query rewriter). Verifiable output: recall@k against a labeled set of queries, or an LLM judge score. Let an agent loop on the pipeline overnight.

SQL query optimization. Input: the query. Verifiable output: execution time on a fixed dataset. Agent rewrites the query, runs EXPLAIN ANALYZE, keeps the faster version.

Prompt engineering. Input: a prompt template. Verifiable output: accuracy on a held-out eval set. Agent rewrites the prompt, measures, keeps if better.

Marketing copy. Input: landing page copy. Verifiable output: A/B test conversion rate, or a proxy model scoring engagement. Agent rewrites the copy, measures, keeps.

Runbook tuning. Input: an incident response runbook. Verifiable output: resolution time on replayed incidents, or an LLM grading coverage. Agent edits the runbook, evals, keeps.

Data cleaning pipelines. Input: the cleaning script. Verifiable output: downstream model accuracy. Agent rewrites the cleaning logic, retrains, keeps if better.

None of these need autoresearch the repo. They need the pattern. And the pattern fits on a napkin.

where a for loop still wins, and you should use one

Let me be honest about the limits of this, because the excitement is running ahead of the reality.

If your search space is small, discrete, and you already know the knobs, a structured sweep beats an LLM agent. Write the for loop, use Optuna, nest the bash script. It's faster, cheaper, deterministic, and it doesn't crash.

for wd_all in [True, False]:
    for init_scale in [0.5, 0.8, 1.0, 1.2, 1.5]:
        train(weight_decay_all=wd_all, init_scale=init_scale)

Ten runs. Forty minutes. Done. No agent needed.

The agent loop earns its keep in two cases. First, when you don't yet know what the knobs are, because the agent can read the code and propose knobs you had not thought to sweep. Second, when the modification is a joint edit across multiple parts of the system that a for loop cannot express. Restructure attention, change the positional encoding, adjust the optimizer's momentum, and retune the LR schedule all at once. That's the kind of change an agent can make coherently and a grid search cannot.

So the answer is not "agent loops are better than for loops." The answer is: they are different tools, and you need to know when each one wins. For most small, well-understood optimization problems, the for loop still wins. For open-ended exploration of an unfamiliar system, the agent loop earns its place.

the actual contribution

I came into this thinking the autoresearch repo was oversold, but I came out thinking I had underweighted the naming problem.

Everyone already had the ingredients: Claude Code, Codex, a training script, a metric, the ability to run a shell loop. Almost nobody was running it. Naming the pattern, shipping a minimal working example, and showing real numbers was what moved people from "I could do that" to "I am doing that." That is a legitimate contribution even when the mechanism is trivial.

It is also a reminder that you do not need to wait for the framework. You have the ingredients already. An input, a verifiable output, a loop. That is the entire toolkit.

my honest take

Autoresearch the repo is narrow: a training file, a markdown prompt, and a keep/revert loop. The mechanism is simple. The framing is what went viral.

But the pattern underneath it is not narrow at all. It works anywhere you can define a metric you trust, and that is most of engineering.

If you are sitting on a RAG pipeline that nobody has tuned properly, an SRE runbook that grew organically over two years, a landing page with no A/B harness, a SQL query that's slow but nobody wants to touch, a prompt that was written in ten minutes six months ago and has been running ever since, you already have the setup. You don't need a framework. You need to define the input, pick a verifiable output, and write the loop.

The real lesson from autoresearch is not "use this repo." It's "the loop is cheap, the pattern is universal, and you can run it tonight."

So go run it tonight.

🫡

Related Posts