Letting an AI agent maintain OpenWPM
This post is an experience report on how I maintain OpenWPM. Since early 2026 that has mostly meant doing the work with an AI agent. It covers the workflow, the specific way it can go wrong, and the system I have built around the agent to keep its output trustworthy. The companion post covers what shipped; this one covers how.
Why nothing was getting done
OpenWPM is a hobby project. I maintain it in my own time, after a day job, and the binding constraint was never ideas. The issue tracker has had good ideas sitting in it for years. The constraint was activation energy.
Two things drained it. First, I am a slow typist: turning a clear idea into a working, tested change is a lot of mechanical work, and mechanical work is the hardest kind to start when you are already tired. Second, OpenWPM's test suite is slow: a full run is over half an hour locally and around twenty minutes in CI. The loop of make a change, wait, find out you got it slightly wrong, wait again is draining in a way that is hard to convey if you have only ever worked with fast tests. After a full work day, I usually did not have it in me.
So OpenWPM sat. Not because it was finished, and not because I had stopped caring. I had just run out of the specific kind of energy it needed.
What changed
In February 2026 I got a personal Claude Code subscription and started using an agent for the mechanical part of the work.
The point is not that the agent has ideas I do not. It is that the two things draining my activation energy, the typing and the waiting, are exactly what an agent is good at absorbing. I describe a change; the agent produces the diff. I kick off work and step away from the machine while the slow suite runs, instead of sitting there watching it. When tests fail, the agent picks them back up.
It is not only the purely mechanical work that this unlocks. OpenWPM's dependency-update script used to be a Bash script, update.sh, and Bash is miserable to work with the moment a script does anything non-trivial. Rewriting it in Python was an obvious improvement with no glamour attached, no new capability, just the same job in a language that fights back less. That is rote work, but it is not mechanical: you cannot transcribe your way through it, you have to re-implement the thing. It is exactly the band of work a tired maintainer skips forever. The agent simply did it, and the script is now update.py. Being ordinary Python rather than Bash also made it easy to extend: it now derives the release version from the git tags, so a mistake I had already made by hand, shipping v0.33.0 without bumping the version file and not noticing, cannot silently recur.
For a time-constrained hobbyist, that is the difference between a project that moves and one that does not, and it felt that significant. But a feeling is not worth much on its own. The rest of this post is the part that matters: what the agent costs, and the system I have built around it.
The flaky-test hazard
Here is the specific way this goes wrong.
OpenWPM's test suite is not just slow, it is genuinely a little flaky. It drives a real browser, and real browsers have timing-dependent behaviour. So a reasonable-sounding instruction to an agent is: if a test fails and you are confident it is just flaky, retry it.
That instruction is an epistemic hazard, and it is worth being precise about why. OpenWPM is a measurement tool. Its entire value rests on the data it collects being trustworthy. When a study reports "we observed X across N sites," the instrumentation has to have actually observed X. A regression in that instrumentation does not crash; it silently produces slightly wrong data, and every study built on that version inherits the error.
The problem is that a test catching such a regression and a test that is merely flaky look identical from the outside: both are red on one run and green on the next. An agent, or a tired human, allowed to decide "that one was just flaky" and re-run until green will, eventually, wave a real regression through. The retry does not resolve the flakiness; it launders a signal you needed into noise you ignored.
This is the inconvenient part, and it is why the rest of this post exists. It means the agent's confidence is not something I can take at face value on exactly the question that matters most.
Review is the new bottleneck
There is a precise way to describe the limit here, and it is Amdahl's law. It is usually stated for parallel computing: if some fraction of a task cannot be sped up, that fraction sets a hard ceiling on the total speedup, however fast you make everything else. Take a task that is 90% parallelisable, drive that 90% to zero, and you have still only gained 10×. The serial 10% is now effectively the whole cost.
The same accounting applies to maintenance. A change to OpenWPM is part mechanical, typing and waiting on tests, and part judgement: deciding what to build, and reviewing what comes back. The agent drives the mechanical fraction towards zero. It does nothing for the judgement fraction; if anything it enlarges it, because now there is a diff to scrutinise that I did not write and cannot vouch for by default.
So the speedup is real, but it is bounded, and the bound has a name: review. Typing and waiting used to be the slow part, with review cheap beside it. That has inverted. Review is the bottleneck now, and under Amdahl's law it is the only remaining work whose improvement still moves the overall number.
This is why the recap post keeps insisting on guardrails enforced statically rather than by discipline. A check a machine runs is review I do not have to, the one kind of effort that attacks the bottleneck instead of the part that is already cheap. More tests, stronger static analysis, types that make an invalid change fail to compile: in an agent-assisted workflow those are not hygiene. They are the throughput strategy.
The two loops
If review is the bottleneck, the move is to make the reviewing itself rigorous and largely mechanical, so that by the time a change reaches me, most of it has already been done. The structure I use for that is not mine. It comes from a methodology called Verification-Driven Development (VDD), formulated by dollspace-gay, who is also one of the authors of crosslink, a tool I will come back to. VDD has been the single biggest influence on how I run agents.
Its core idea is an adversarial loop. You do not let the model that wrote the code be the one that judges it: a model reviewing its own work is agreeable, and agreeable is useless. Instead you put a builder and an adversary in a high-friction loop. I run this as two nested loops.
The inner loop is adversarial iteration. An implementer agent makes a change. Then the-hater reviews it. the-hater is an agent I have prompted to channel Linus Torvalds at his least patient: it treats the code as guilty until proven innocent, writes its findings to a HATER.md file with severity ratings and a blunt one-word verdict, and does not do feedback sandwiches. The implementer addresses the findings. The Hater runs again. Repeat.
Two details make this work, and both are straight from VDD. The first is negative prompting: the adversary is explicitly told to have zero tolerance, because an LLM's default helpfulness is precisely the failure you are trying to defeat. The second is fresh context every pass: because the Hater runs as a separate subagent, every round starts with a clean context and no memory of the last one, so it cannot soften toward code just because it has seen it before. It is as hostile on round four as on round one.
The loop needs a stopping rule. You stop when the adversary starts hallucinating. When the code is lean enough that a hyper-critical reviewer has to invent problems that are not actually there, you have hit the floor. the-verifier, a third agent, builds the project, runs the tests, and makes exactly that call: it reads HATER.md and decides whether the remaining complaints are real, or whether the Hater has been reduced to nitpicks and fabrications. When it is the latter, the loop has converged. The OpenWPM change that reworked DNS redirect handling went through three of these rounds before converging, and its pull request still carries the round-by-round review history.
The outer loop is planning and final judgement, and that is where I sit. Before the inner loop runs, the work is decomposed into a plan: a tracked hierarchy of epics, issues and sub-issues, so no part of the change goes unaccounted for. The inner loop then grinds against that plan until it converges. Only then does the result reach me. I review the final artifact once, properly, knowing an adversary has already gone over it until it could not find anything real to say.
Holding this together is crosslink, the tracker dollspace-gay co-authors. It is what turns a goal into that hierarchy of issues, and it is also the orchestration layer. A long-running coordinator session can kick off the implementer and the Hater as subagents and simply poll for them to finish, instead of running their work inline. For a long session that matters: the coordinator keeps its context spent on the plan and the state of play, not on the line-by-line of every subtask. The chain of implementer → Hater → implementer runs underneath it, and the coordinator only surfaces when there is a decision a human has to make.
A word on cognitive diversity, because VDD asks for it. The method recommends a different model family for the adversary, on the theory that two instances of the same model share blind spots. The inner loop does not get that. My Hater is a Claude agent reviewing, usually, Claude's code, and it leans on the adversarial prompt and the context resets rather than on genuine diversity. The diversity enters in the outer loop instead: the pull request also gets reviewed by GitHub Copilot, which is not Claude and so fails in different places.
What I cannot recommend is the integration. Re-requesting a review from Copilot happens through the gh command line, and there is no path for the two models to iterate against each other directly. So the loop between them runs through me: I read Copilot's review, carry the substance back to Claude, let it respond, then return to the command line and request the review again. Two language models, a review cycle that genuinely wants several rounds, and the bus connecting them is a human pasting text back and forth. It works, and the second model is worth it, but a workflow where a person exists to ferry messages between two LLMs is exactly backwards, and I resent it a little every time I do it.
What I still check by hand
The adversarial loop is rigorous, but rigorous is not the same as correct, and a few things stay stubbornly manual.
I read every diff before it lands. The loop produces a change an adversary could not fault; it does not produce a change I have understood. Those are different, and only one of them is my name on the release.
A changed test is guilty until proven innocent. This is the failure mode I watch hardest. The entire VDD edifice, and OpenWPM's worth as a measurement tool, rests on the test suite being an honest signal. An agent that "fixes" a failing test by quietly weakening its assertion has not fixed anything; it has unplugged the smoke detector. So when a test's expectations change, I want the specific production behaviour that justifies it, and "the test was wrong" is a conclusion I have to be argued into.
When the question is "did Firefox change?", I make us check. Much of OpenWPM's breakage comes from Firefox changing its internals, and the lazy resolution is to assume it did and adjust to match. Instead I have had agent sessions check out mozilla-central, build specific commits, and confirm a behaviour actually changed in that commit, turning a guess into a fact with a changeset attached. It is slow. It is also the difference between maintenance and guessing.
Guardrails the machine enforces
Everything so far still depends on someone, me or an agent, choosing to do the right thing. The next layer does not. It is enforced mechanically, by hooks that fire on every session whether anyone remembers them or not, and it is the most literal form of the bottleneck strategy there is: correctness that costs no ongoing attention, because a machine asserts it.
A word on the tooling underneath this, because it is not incidental: the version control is Jujutsu (jj), not git. jj makes every change a commit, keeps the working copy itself a commit, and records every operation in a log you can rewind, so an agent that makes a mess has not made an unrecoverable one. jj undo exists; jj op restore exists. That reversibility is a quiet precondition for letting an agent touch history at all, and several of the hooks below exist specifically to keep its jj usage disciplined. (I am considering one more that simply forbids agents from reaching for git directly.)
There are not many hooks, and each one exists because the agent, or I, got something wrong in exactly that spot once:
- A session-end hook refuses to let a session finish while the current jj change still has no description. Sessions used to end with anonymous changes; now they cannot.
- A session-start hook warns me when the jj working copy already has changes before new work begins, because untangling two unrelated changes after the fact is miserable.
- Before the agent runs
jj squashto fold one change into another, a hook prints a diff stat of exactly what is about to move, so an unrelated edit getting swept along is visible immediately instead of three steps later. - Another blocks
gh issueandgh prcommands that pass their body inline, forcing the text through a file instead, because backticks and code blocks in a shell argument get mangled, and a mangled bug report is worse than none. - One outlier is not about safety at all: it rewrites certain commands, through a tool called RTK, into more token-efficient equivalents. The bottleneck has a budget as well as a clock.
And in OpenWPM's own CI, test jobs carry a hard thirty-minute timeout, so a runaway agent loop fails fast instead of burning a runner for an hour. None of this is clever in isolation. The point is the aggregate: a layer of correctness that costs zero ongoing attention because it was paid for once, in code.
The whole workflow is a versioned repo
One last thing, and it is the reason I am comfortable writing all of this down: none of it is improvised. The agents, the skills, the hooks, the sandbox the agent runs inside, and the rules it operates under are all defined declaratively, with Nix and home-manager, in a single public repository. My workflow is not a folder of habits; it is a configuration I can read, diff, and reproduce on a new machine with one command.
That repository also keeps the history, and the history is more honest than any description I could write. It begins, in late 2024, as a handful of safety rules in one file. Then a jj skill. Then the adversarial agents. Then the settings became declarative. Then a hook, and another, and another. Then the sandbox. Then crosslink. Roughly 180 commits across about seventeen months. I did not sit down and design this system. It accreted, every agent and every hook added in response to a specific, concrete way the setup had just failed me. The methodology came from outside: VDD from dollspace-gay, and a base layer of agents and skills from the open-source nucleus project. But the scar tissue is mine.
The sandbox deserves its own sentence, because it is the other half of why I can "kick off work and step away." The agent runs inside a bubblewrap container. Letting it run with little supervision is not an act of trust; it is an act of containment. It has the run of one project and very little else.
Where the rigor actually goes
All of it, the adversarial loop and the hooks and the manual review, could be read as a claim that I treat every line the agent writes with identical suspicion. I do not, and it would be dishonest to imply otherwise.
update.py is the example I am least comfortable with. It runs a custom resolution algorithm to choose a mutually compatible set of npm dev dependencies, a problem that peer dependencies make genuinely awkward. I do not feel good about that algorithm. It is the kind of job that deserves a real solver, and instead has something I wrote. But it has produced correct output twice now, it is my tooling, and when it fails it will fail loudly and in my face. So I accept it. Pragmatism is allowed to win there.
What pragmatism is not allowed to win is anything researchers run. The moment code crosses from "tooling for my own workflow" into OpenWPM's actual instrumentation, the rigor rises steeply, because that is where a failure is silent, and where it lands not on me but on every study built on the release. The DNS work got the full adversarial treatment for exactly that reason; update.py did not, and does not need it. The rule, stated plainly: the rigor scales with the blast radius.
What it actually changed
It is worth being precise about what the agent did and did not change, because the honest version is more useful than the ad.
It did not lower the bar for rigor. On the instrumentation researchers depend on, the review is as paranoid as it ever was, arguably more so, because I now treat a confident-sounding wrong answer as something that is always on the table. It did not hand OpenWPM a vision or good ideas; those were already in the issue tracker.
What it removed was the tax. What is now cheap is the typing and the waiting that used to stand between a clear idea and a merged, tested change, the part an exhausted hobbyist cannot push through. The ideas and the judgement are still entirely mine.
Three releases and a year of catching up came out of this, and so did these two posts. An agent is a fast, tireless, and occasionally confidently-wrong contributor. Everything above, the adversarial loop and the hooks and the sandbox and the manual review, exists to catch the confidently-wrong case before it reaches a release. The agent made OpenWPM better. It did so only because nothing it produces is trusted until it has been checked.