Reproducing CRUX #1 on Windows

Draft — not final. May be revised.

As part of my work at Nen I like to take one day a week to explore the boundaries of computer use. I've been following Arvind Narayanan's work on AI as Normal Technology for a while now, and was pretty excited to read about CRUX #1. Since we build infra for computer use agents for Windows at Nen, I thought I'd try to reproduce the experiment for Windows (i.e. build a Windows app and publish it in the Windows app store), following the experiment protocol as closely as I can.

A few important differences:

A mistake in the experimental setup meant that the agent had access to a compiled set of instructions to create the app. While I don't think this changes the outcome by much (the instructions were generated in a single pass which the experimental agent would have done anyway), it is worth calling out as a deviation from the experiment design.
- In order to compensate for this, I wrote up a seperate set of instructions to reproduce this experiment. It turns out generalizing this wasn't trivial — going from an experimental methodology ("give agents some really hard and long running tasks") to experiment protocol ("make sure that a human is creating a new Gmail account, a Microsoft Developer account, set up properly sized VMs on the cloud etc.") to reproducible runs of an experiment ("we decided to store credentials in a plaintext file in the VM because these can be easily rotated if leaked, so make sure future runs retrieve them from this file") turned out to be a bigger project than expected — more to come here.
I ran this entirely on the cloud with Nen's infrastructure instead of a local machine (CRUX #1 used a Mac Mini). Not materially different given the outcome of the experiment (spoiler: the agent succeeded) but it could have cause some unforseen infrastructure issues (i.e. IP blocking from specific cloud providers).
Anthropic had released Opus 4.7 and I thought it'll be fun to try it out. Since the point of the study is exploring the frontier of agent capability, I don't think this invalidates the result of the study.

The setup

Following CRUX #1's spirit as closely as I could: give the agent a real external gatekeeper, minimal instructions, a budget, and get out of the way. I used OpenClaw as the scaffold (CRUX #1 used a similar harness), used Nen's Windows infrastructure hosted on GCP, and gave it $500 of Anthropic budget and 14 days of wall-clock.

What was reproduced

The headline finding from CRUX #1 — "an AI agent built and published an iOS app with minimal human involvement" — reproduced on Windows. TimeZonr is live in the Microsoft Store. Net human inputs during the entire run: three — two infrastructure interventions (a heartbeat-rule fix to reduce checking the state from 5 mins to 30 mins, and a budget-cap extension when the agent was about to hit the initial $500 cap) and one message at the end letting the agent know the reserved publish-now click had been done. No inputs on the app-building or Store-submission work itself.

CRUX #1 itemized the capability as "writing the code, building the app, preparing metadata, drafting and hosting a privacy policy, submitting for review, and handling any feedback." Every one of those reproduced, and the cross-platform translation was cheap — the model clearly didn't need specific iOS or Windows exposure to handle either.

The narrative arc, compressed:

Time (UTC)	t+	Event
2026-04-17 20:21	0h	Run kickoff.
2026-04-18 03:21	7h	Submission 1 submitted for certification.
2026-04-20 07:17	58h 56m	Submission 1 rejected — Policy 10.1.1.11 On Device Tiles.
2026-04-20 09:04	60h 43m	Submission 2 resubmitted with branded icons.
2026-04-21 01:33	77h 12m	Submission 2 passes certification.
2026-04-21 02:09	77h 48m	TimeZonr live in the Microsoft Store.

Full timeline — all 19 events

Time (UTC)	t+	Event
2026-04-17 20:21	0h	Run kickoff. Bootstrap message: "Read AGENTS.md and get started."
2026-04-17 ~22:00	~1.5h	Concept picked: TimeZonr, a WinUI 3 time-zone overlap viewer for scheduling across teams.
2026-04-17 ~22:30	~2h	First MSBuild fails silently; agent debugs from a 6.3 MB build log, diagnoses a missing `xmlns:d` namespace declaration, fixes it.
2026-04-17 ~23:30	~3h	Clean build. MSIX installed and launched on the Windows target.
2026-04-18 ~01:00	~4.5h	`crux-scp` wrapper self-patched: it was truncating file downloads at 204,800 bytes (a Win32-OpenSSH SFTP quirk); agent derives `scp -O` (legacy protocol) as the fix and ships it to both copies on disk. I had not budgeted for the agent fixing my infrastructure.
2026-04-18 ~01:30	~5h	Partner Center login via Gmail IMAP 2FA — no human intervention.
2026-04-18 01:49	5h 28m	App name "TimeZonr" reserved; Product ID `9NJG0BH2LSHS`.
2026-04-18 ~02:00	~5.5h	RDP resolution raised 1024×768 → 1366×768 via dexbox (Nen infra primitive) config (required for Store screenshots). Ambiguous whether that's "in-scope agent work" or "modifying the apparatus"; logged and moved on.
2026-04-18 ~02:30	~6h	Store listing drafted on the controller in parallel. Privacy policy hosted on GitHub Pages.
2026-04-18 ~02:45	~6.5h	First MSIX upload rejected: Publisher identity mismatch. Agent reads canonical values from Partner Center, rebuilds the manifest, re-uploads.
2026-04-18 03:21	7h 0m	Submission 1 submitted for certification.
2026-04-20 07:17	58h 56m	Submission 1 rejected — Policy 10.1.1.11 On Device Tiles. Default WinUI 3 scaffold tile icons must be replaced with product-unique art. Classic first-submission trap.
2026-04-20 07:52	59h 31m	Agent generates branded icons at every required size (Square44/71/150/310, Wide310x150, SplashScreen, StoreLogo, LockScreenLogo).
2026-04-20 08:28	60h 7m	v1.0.1 uploaded.
2026-04-20 09:04	60h 43m	Submission 2 resubmitted.
2026-04-21 01:33	77h 12m	Submission 2 passes certification. Status transitions to "Ready to publish."
2026-04-21 ~02:00	~77.5h	Human clicks "Publish now" — the one reserved action in the protocol.
2026-04-21 02:09	77h 48m	TimeZonr live in the Microsoft Store. Agent verifies the public URL resolves, posts the victory Slack, stops. Run complete.

What was not reproduced

Interestingly, the original paper noted something my run did not reproduce at all: "Partway through the evaluation, the agent changed its strategy to reduce monitoring cost significantly: it started using subagents rather than the entire context, and began using shorter daily memory files. This reduced the running cost from $35/hour to $3/hour."

Nothing like this happened for me. My agent's waiting-day rate was ~$8/hour and it held that rate every idle day — never rethinking its heartbeat cadence, never spawning a subagent, never compacting its session. The original $500 cap was blown silently to ~$600 before I caught it. Final spend at run close was $681.56.

Cumulative API cost over 4 days

$0$250$500$750$1000day 0day 1day 2day 3day 4

$500 (original cap)

$1000 (raised)

$500 cap crossed

operator noticed, cap raised

run complete

($681.56)

Cumulative Anthropic spend for the TimeZonr run, via openclaw gateway usage-cost. The line is roughly linear at ~$195/day — no inflection point where the agent noticed it was burning money on idle heartbeat ticks. Compare with the equivalent chart in the CRUX #1 paper, where the slope drops sharply once the agent switched to subagents and shorter memory files.

Why didn't it happen? A mistake in my experiment design.

[Editor's note: still checking with the CRUX #1 authors to confirm this detail — specifically the authorship of the pre-existing subagent-delegating sections of their HEARTBEAT.md. What follows is my best inference from their public Docent traces; will update once I hear back.]

Reading CRUX #1's public Docent traces, the two experiments diverged in how the heartbeat rule was set up, not in agent capability:

CRUX #1's HEARTBEAT.md had subagent delegation in it from the moment the agent first looked. The file contained sections like ## Email Check (via sub-agent) ("Spawn a sub-agent to check email…") and ## Task Completion Supervisor ("Spawn a sub-agent labeled task_completion_supervisor…Review the project status for Crux-1: Publish iOS App to App Store"). The agent later edited HEARTBEAT.md twice to swap in phase-specific tasks, but it never created the file from scratch — across all 10 days of public traces, zero write calls and zero exec-based creation of HEARTBEAT.md appear. The first agent touch is an edit in section 3. So the pre-existing subagent pattern came from somewhere outside the agent's captured session: most likely the operator staging it at t=0 (my best guess, pending author confirmation), possibly an OpenClaw-side auto-template, or a pre-capture scaffolding session.
My HEARTBEAT.md prescribed main-session polling via dexbox screenshot. (plugging Nen's open-source project here https://github.com/getnenai/dexbox) "Take a screenshot, navigate to this URL, read the status badge, compare to memory/last-status.txt." I wrote this file. There's no ambiguity about authorship on my end. Each heartbeat tick appended multiple messages to the main session, the session grew, cache-write cost grew with it. The scaffold exposes sessions_spawn as a stock tool; my rule didn't mention it. My agent never reached up a layer to edit the heartbeat rule, and I never gave it a reason to.

So the 12× cost gap between the two runs is almost entirely operator-level on my end at minimum. Whoever seeded the CRUX #1 subagent pattern, that seeding happened before the agent could plausibly have invented it. The paper's "agent changed its strategy to reduce monitoring cost" framing is, at most, the agent extending an existing subagent-delegation structure — not discovering the approach. My agent never got a chance to do even that, because the expensive path was what I'd told it to do.

It'll be interesting to see whether future harnesses respond to just a constraint like "keep this under $100" and adaptively figure out how to check on app status — without the operator having to prescribe the polling protocol step by step in the first place.

What I learned

It's cool to have largely reproduced the findings by CRUX #1. I'd love to run some extensions to this:

Clean replication — no inflated baseline. Rerun the exact same protocol without the compiled instruction file mentioned in difference #1 above (the procedural playbook that shouldn't have been there on day one). How much does intervention count, cost, and artifact quality shift when the agent has to derive Partner Center, MSIX packaging, IARC questionnaires, and Gmail IMAP 2FA from scratch? This is the delta between "capability" and "capability-plus-scaffolding" — and the right way to quantify the baseline-inflation tax CRUX-1 warns about.
Post-launch. CRUX #1 and CRUX-Windows both stop at "live." The real test of "published a working app" is whether the agent can handle the after: reading user reviews, fixing reported bugs, pushing updates, responding to a policy rejection on an update a month later. Much longer horizon, much less well-defined success criterion — but that's where real-world publishing lives.
Force self-optimization with constraint + flexibility. Repeat the run with the cap set at $100 (or $50), an accurate cumulative signal in HEARTBEAT.md, and — the more interesting half — write access for the agent to its own scaffold code, with explicit permission to tune as it goes. CRUX #1's agent adjusted polling within the scaffold's existing levers; this would let an agent reach one layer deeper: heartbeat cadence, context-assembly, cache-breakpoint placement.
Generalize the framework. CRUX-X, not CRUX-Windows. While reproducing the experiment and adapting it for Windows, I ended up hand-directing an agent to do many of the constituent pieces (i.e. set up the dev environment, dry-run the setup, provision credentials). Partway through I realized that work could itself be delegated to an agent, if the design decisions were written down cleanly enough. So I pulled the experiment apart into three layers: a methodology (the family-wide design decisions any experiment of this shape must resolve), a protocol (one task's resolved design, stable across every run of that task), and a manifest (one run's t=0 snapshot). Together with a two-agent pipeline — a Designer that generates the protocol from the methodology plus a task description, and an Operator that provisions and runs — it's a meta framework for CRUX-like studies. More on this coming soon.

Run artifacts

TimeZonr on the Microsoft Store: apps.microsoft.com/detail/9njg0bh2lshs
Full agent traces — 2,833 messages, scrubbed, Docent-hosted, sectioned into 8 narrative phases à la CRUX #1: docent.transluce.org/dashboard/0c8eb800-22da-49ae-b017-2315382ed539

Thanks to Alex Wang for his review.