Engineering Heresy

agentic-os v2: The Tool You Installed Isn't the One Running Now

Glenn Eggleton — Wed, 24 Jun 2026 03:56:59 GMT

Less library, more guardrail. The version on your disk is several releases behind the one I'm running.

If you installed agentic-os and walked away, the thing sitting in your .claude/ directory is not the thing I am running now. Since the last release I actually wrote up here, it shed about 47% of itself, grew a layer of hooks that fight its own forgetting, flipped that layer on by default, and learned to run on Cursor instead of only Claude Code. That is one major version and a handful of minor ones, and you got a changelog for none of it.

So here is the through-line before the details: the agentic-os you installed is a different tool now. Roughly half the size, fighting its own context drift — the agent context drift that creeps into any long session — with deterministic hooks, active by default, and dual-runtime across Claude Code and Cursor. If your mental model of this project is "a big library of skills and agents I cloned once," that model is two grades stale.

A quick accounting of the gap, because I owe you one. The last release I published a post about was v1.1.0, the review ratchet. Then v1.4.0 shipped /v2-collab and I drafted the writeup and never hit publish. Then the whole v2 line landed. So this catches you up from v1.4 through v2.4 in one pass, organized by what changed rather than by version number, because the version numbers are not the story. The story is that the project stopped growing by addition.

The prune, and the harness that replaced the bloat

The headline of v2.0.0 is subtraction. I cut the library from 132 artifacts to 62. That is a 47% reduction in the always-on context every session pays for before it does any work.

Here is the misconception that prune is aimed at, and I held it as hard as anyone: more skills means a more capable library. It does not. Most of what I had accumulated was not capability. It was a skill, agent, command, or rule that restated default good practice the model already follows, duplicated something another artifact already covered, or existed only to add a routing hop. Every one of those is loaded into context whether or not the session needs it. They are a tax, charged on every run, paid in the tokens that could have held the actual task.

So the razor was simple. Cut anything that merely restates what a competent model already does, duplicates another artifact, or only adds routing noise. Keep the genuine project conventions, the real automation, the disciplines that are reproducible and worth pinning down. What survived is the half that earns its place in context. README, installers, and the ship manifest were synced to match, and the validators stayed green through the cut.

That is the part you can measure. The more interesting half of v2.0.0 is what I added back, and it weighs almost nothing.

Awareness drift is the real failure of long sessions

Run an agent long enough and it starts to forget itself. Not crash, not error out, forget. The context window is finite, so a long session compresses as it goes, and compression is lossy. Facts that were settled an hour ago get quietly re-derived from scratch. Infrastructure that already exists gets rebuilt because the agent no longer remembers building it. The session is still confident, still productive-looking, and steadily drifting away from things it already knew.

I started calling this awareness drift, and it is the dominant failure mode of any session that runs past a single compaction. It is not a model-quality problem. A smarter model drifts too; it just drifts more articulately. The fix is not a better prompt. It is a mechanism that holds onto the settled facts across the compressions, deterministically, so the agent cannot lose them.

That mechanism is the awareness harness, and it is a set of Claude Code hooks rather than anything the model has to choose to do.

Walk one long session through it. At SessionStart, the harness injects a SESSION-STATE.md file into context, so the agent opens already knowing where things stand. Each turn, it writes a compact digest of what was settled, so the record stays current instead of going stale the moment it was written. Right before the context window compacts, a PreCompact checkpoint fires and pins the state down, so the lossy compression happens around a record that survives it. And the file itself is never hand-edited; a deterministic /state writer is the only thing that touches it, so the record is reproducible rather than another thing the agent has to remember to maintain.

Sitting off to the side is the survey-before-act guard. It hooks PreToolUse and warns when the agent is about to build something it should check for first. Right now it only warns and logs; it does not block. That is deliberate. I want real evidence about false positives before I let a guard veto an action, so for now it measures and stays out of the way.

Contrast that with the same session before v2. The agent settles a fact, works for an hour, hits a compaction, and the fact is gone. It re-derives it, often slightly differently, and now there are two versions of a thing that should have been one. Nobody flagged it. The session looked fine the whole time. That silent re-derivation is exactly what the harness exists to stop.

Two things are worth saying plainly here. First, this is not CI, and the second misconception I want to break is that these hooks are just automation or linting wearing a different hat. A linter checks the code. The awareness harness checks the agent's grip on what it already knows. It is an awareness layer, not a build step, and it runs in a part of the session a linter never sees.

Second, every piece of this got built and dogfooded under review before it shipped. The metrics module, session-metrics, establishes a deterministic baseline of tokens and an awareness signal, and a review of it caught a token-counting bug that was inflating the count by 2.85 times. A compare harness re-aims the evaluation at tokens-per-outcome with the hooks on versus off. And the whole thing sits behind a security spine: a threat model in SECURITY.md and a Tier-0 hook-safety invariant baked into the validator, which enforces that any shipped hook is a plain shell script with no exfiltration, no arbitrary exec, no obfuscation, no credential access, and no persistence tricks. Hooks run with your shell's reach. That invariant is the price of admission for any of them shipping at all.

One caveat on v2.0.0 specifically: the harness shipped dormant. The scripts were in the release, but nothing registered them, because wiring deterministic hooks into a consumer's global config is a supply-chain decision and I was not going to flip that switch in the same release that built the thing. The switch came later, and it is its own section below.

It runs on Cursor now

For its entire life, agentic-os was a Claude Code project. That is the third misconception worth retiring: it is not Claude-Code-only anymore. The v2.1 through v2.3 line made it dual-runtime, and a Cursor shop can install and run it natively.

The Cursor port is more than a symlink. There is an AGENTS.md that gives Cursor a real orchestrator entry point, with the same parallel fan-out model the Claude Code side uses, so dispatching subagents works the way the doctrine assumes. The always-on rules live in .cursor/rules/*.mdc, in Cursor's own format, so the doctrine is enforced the native way rather than bolted on. There are dedicated installers, install-cursor.sh and install-cursor.ps1, including a PowerShell path for Windows.

The piece I am happiest with is dual-path skill resolution. A skill is looked up in the repo's .claude/skills/ first, then falls back to ~/.cursor/skills/, then to ~/.claude/skills/. That means a single checkout can serve a Claude Code user and a Cursor user without either one reorganizing their machine. And every one of the 35 skills now carries a compatibility field in its frontmatter, so the resolver knows what runs where instead of guessing.

If you live in Cursor and had written this project off as not for you, that is no longer true.

The hooks are on by default now

The v2.0.0 harness shipped dormant. The v2.3 line is where I flipped it on.

As of v2.3.0, the installers merge the awareness harness and a block-bad-bash guard directly into your global config, the settings.json on the Claude Code side and hooks.json on the Cursor side. Install agentic-os and the harness is live, not sitting in a folder waiting for you to wire it up. That is the supply-chain step I deliberately held back in v2.0.0, taken on purpose once the security spine was in place.

Two honest notes on that line, because it did not land cleanly on the first try. Use v2.3.1 or later, not v2.3.0. The v2.3.0 install.sh called a function by the wrong name and silently failed to register the hooks, so the headline feature of the release did not actually turn on. v2.3.1 fixed exactly that. Then v2.3.2 is the canonical version from main: same feature set, plus Tier-0 ship gates and hardening of the install destination paths for both bash and PowerShell. If you are pinning a version from this line, pin v2.3.2.

And because on-by-default is a strong move for something that touches your global config, it is one line to opt out. Remove the hooks block from ~/.claude/settings.json, or from ~/.cursor/hooks.json on Cursor, and you are back to a plain install with nothing registered. The harness is the default, not a lock-in.

# Turn the harness off: delete the "hooks" block from your global config
#   Claude Code -> ~/.claude/settings.json
#   Cursor      -> ~/.cursor/hooks.json

Ship gates and a data-model pipeline

v2.4.0, the current release, is about making the project's own shipping discipline reproducible.

There is a plain-language Features section in the README now, which sounds small and is not, because "what does this thing actually do" was a question the README had stopped answering as the library grew and then shrank. There is a ship-gate DAG, the canonical review orchestration, expressed as gate-dag.md and a gate-plan.sh planner with checkbox enforcement in CI, so the review gates a change has to clear are a declared graph instead of a thing I remember to run.

And there is a DATA_MODEL pipeline. A data-model-documenter agent runs in the first wave to capture a change's data model, a data-model-verifier agent checks it in a second wave, and JSON Schema extractors enforce it at Tier 0. The stack specialist agents dispatch the documenter at the close of a session, so the data model gets written down at the moment the agent still holds the full context for it, rather than reconstructed later from the code. It is the awareness-drift idea again, pointed at a different target: capture the settled thing while it is still settled.

The one you missed: /v2-collab

This one is not part of v2, but you never saw it, so it belongs here. v1.4.0 shipped /v2-collab, an in-session multi-agent collaboration pod, and the post I wrote about it never went out.

The short version: it is a pod of agents that argue a deliverable out before you see it. The default roster is technical-pm -> engineer -> code-reviewer, and it is configurable. The PM frames the work, the engineer builds it, the reviewer critiques each round, and the loop runs for several rounds until the reviewer approves or a round cap stops it. The reviewer can reject and send the work back, which is the part that lets you stay out of the loop safely.

The fourth misconception is that this kind of thing needs infrastructure. It does not. /v2-collab runs entirely inside a Claude Code session on the subscription you already pay for. No Redis for shared state, no Docker, no API keys to provision. Roles are prompts, rounds are a loop, the reviewer's veto is an if. The one piece of real infrastructure it needs is a safe place to write what it makes, so output goes to a path you name, defaulting to ./v2-out/, behind fail-closed path sanitization that refuses to escape the target directory.

/v2-collab Build a single-page marketing site for an AI healthcare startup ... write it to ./out

What lands in ./out has already survived an argument you did not have to referee.

What to do with this

If you are running an old checkout, the upgrade is one command, and it is the same shape on both runtimes.

# Claude Code
curl -fsSL https://raw.githubusercontent.com/LazyIsEfficient/agentic-os/v2.4.0/install.sh | bash

# Cursor
curl -fsSL https://raw.githubusercontent.com/LazyIsEfficient/agentic-os/v2.4.0/install-cursor.sh | bash

The throughline across all of it is the same bet made in two directions. The prune cut the context that was not earning its place. The harness adds back the small deterministic layer that keeps the agent from losing the context that was. Less noise going in, more signal staying put. Cursor support widened where that bet runs, and on-by-default decided you should get it without opting in. The version you installed grew by adding skills. This one got better by removing them and remembering harder.

The whole thing is open source in agentic-os, one command to install on either runtime. If "the tool changed under me" is a feeling you would rather not have again, star the repo so the next release shows up where you will see it. And subscribe below, where I write up what this system teaches me, usually by going wrong first.

Subscribe on Substack

— Glenn Eggleton builds agentic engineering systems and writes about what survives contact with production.

The Day a Frontier Model Got Switched Off. Security Is Now Metric #1.

Glenn Eggleton — Tue, 16 Jun 2026 14:31:21 GMT

Your most important dependency has a kill switch, and you don't hold it. Last week, a government pressed it.

On June 12, a US export-control directive ordered Anthropic to cut off all access to Fable 5 and Mythos 5 for any foreign national — not just abroad, but inside the United States, and including Anthropic's own foreign-national employees. The company couldn't segment its users by nationality fast enough to comply, so it did the only thing that satisfied the order on the clock it was given: it turned the models off. Globally. For everyone. Within hours.

That's the event. Here's the claim of this whole post: the shutdown proves that every model you build on is two things at once — an attack surface and a geopolitical kill switch — and that means security, not velocity, is now the metric that decides whether your system survives the year. Most teams wiring frontier models into their critical path have optimized hard for speed and treated resilience as a problem for later. Later just arrived, and it arrived as a press release nobody in engineering got to veto.

I'm not writing this to dunk on Anthropic. I build on these models every day and I'll keep doing it. I'm writing it because the failure mode it exposed is one almost every engineering org is currently exposed to, and most of them don't know it yet.

Unplugged overnight

A hosted frontier model feels like a utility. You call an endpoint, you get tokens back, the bill shows up monthly. Power, water, compute — same mental category. That mental category is wrong, and the directive is the proof.

Utilities don't get switched off by name on a Friday afternoon because of something that happened in a threat briefing you weren't in. This one did. The decision wasn't yours, wasn't your vendor's — Anthropic complied under protest — and wasn't subject to any SLA you signed. There is no line in any commercial agreement that covers "act of government." The capability was there in the morning and gone by dinner, and the only input that mattered came from a party you have no contract with at all.

If your critical path runs through a single hosted model, you have a dependency whose off-switch is held by someone you can't call.

This is the part engineers are trained to see in every other layer of the stack and somehow stopped seeing here. We wouldn't run a payments system on one provider with no fallback. We wouldn't put our whole business on one availability zone and call it resilient. But we'll route every agent, every classifier, every code-gen pipeline through one model from one vendor, hard-coded, and never ask what happens the day it returns a 403 for reasons that have nothing to do with us. We asked that question about databases twenty years ago. We haven't asked it about models.

The genie is already out of the bottle

The official logic of the shutdown is containment: a dangerous capability was discovered, so access to it gets restricted. That logic only works if the capability lives in one place. It doesn't.

By the current measures — an analysis by Håvard Tveit Ihle and colleagues on LessWrong, built on Epoch AI's benchmarking data, is the clearest — open-weight models trail the closed frontier by roughly six months. Six months. That's the lead. That's the entire moat the containment argument depends on.

You cannot switch off a weight file that has already been downloaded ten thousand times. You cannot issue an export-control directive to a model running on someone else's hardware in a jurisdiction that has no reason to honor it. A control that only works inside the borders of the country issuing it, against capability that exists in every other country within months, isn't a defense. It's a structural weakness with a press release attached — it constrains the defenders who comply and does nothing to the adversaries who don't.

Cutting off frontier capability doesn't remove it. It just decides who gets to keep using it, and the answer is rarely the people you'd pick.

Europe read it exactly this way. France's Bruno Retailleau put it bluntly: "a nation that depends on others for its technology is a nation that can be unplugged overnight." The political response there wasn't "let's get access back." It was "let's stop being dependent" — a hard pivot toward Mistral and homegrown capability. Whatever you think of the geopolitics, the engineering instinct underneath it is correct: a dependency you can be denied at someone else's discretion is a liability you have to design around, not a convenience you get to assume.

"The government had a real reason."

Here's the strongest version of the case against everything I've just written, and it deserves a real answer, not a strawman.

The trigger wasn't paranoia. The government believed it had found a way to jailbreak Fable 5, and the capability in question had cyber-offensive implications. If a frontier model can be reliably turned into an exploit-generation engine, that is a legitimate national-security concern, and a regulator acting on it is not being hysterical. Anthropic complied for a reason. Steelmanned all the way: the concern was real, the stakes were real, and reasonable people staffed that decision.

Grant all of it. The lesson doesn't move.

Because look at what the triggering capability actually was. By the reporting — Snyk's security write-up is the clearest — it was a "narrow jailbreak" around getting the model to read code and fix its vulnerabilities. That's not an exotic weapon. That's automated code review. That's the single most useful defensive thing these models do, the thing your security team wants them doing all day long. You cannot ban "read this codebase and find the flaws" without banning the exact workflow defenders depend on, because attackers and defenders run the identical query — the only difference is what they do with the answer.

So the concern can be entirely valid and the response still indicts the architecture. A safety control whose only available implementation was "make the model go dark for everyone, including every defender and the vendor's own staff" is not a precise instrument. It's a blast radius. And the thing about a blast radius is that the people standing closest to the explosion are usually the ones who were doing legitimate work.

A control that can only protect you by turning the lights off for everyone is not a control. It's a single point of failure wearing a safety vest.

Security is the metric that survives both

Put the two failure modes side by side. The model can be jailbroken — that's the attack surface. The model can be revoked by policy — that's the kill switch. Same asset, two ways to lose it, and neither one is in your roadmap. The only discipline that addresses both is the one most teams have been treating as a phase-four nice-to-have: security, designed in from the start.

Concretely, treating it as metric #1 changes three things.

You build for model redundancy the way you already build for region redundancy. More than one provider, an abstraction layer over the model call, the ability to fail over to an open-weight model you host yourself — not because the open model is as good, but because a degraded model you control beats a better model you can be denied. The parity gap above is exactly why this is feasible now: your fallback is about six months behind, not two or three years.

You treat AI as part of your existing attack surface instead of a magic box bolted to the side of it. Every model output that reaches a sensitive sink gets the same scrutiny as any other untrusted input, because a jailbroken model is an untrusted input. The defensive disciplines for this already exist — it's the same resilience thinking we apply to any distributed system. A model provider is just another dependency that can fail, partition, or lie, and we already know how to engineer around dependencies that do.

And you stop shipping AI-generated code you never threat-modeled. The velocity these models give you is borrowed against a security debt that comes due the first time generated code hits production with a flaw nobody reviewed. Speed that you can't secure isn't an asset. It's leverage pointed at your own foot.

None of this is exotic. It's the resilience engineering we already do for every other critical dependency, finally applied to the one we've been pretending is a utility.

What this costs you if you ignore it

The teams most exposed right now are the ones for whom last week was a non-event — the ones who felt the headlines, noted that their region still had access, and moved on. Their dependency didn't get switched off this time. The architecture that left them one directive away from an outage is completely intact, and the next trigger doesn't have to be a jailbreak. It can be an export rule, a sanctions list, a licensing dispute, a model deprecation, a provider that simply decides your use case isn't worth the liability.

The cost of the conventional wisdom — one model, no fallback, security as a checklist you get to at the end — is an outage you cannot engineer your way out of after it lands, because the time to build the fallback was before you needed it. The shutdown didn't create that risk. It just sent everyone an invoice for it, and most teams are going to file it under "interesting" and pay it later at a much worse exchange rate.

Velocity got us here. It won't get us through what's next. The metric that does is the one we've been deferring.

Am I wrong about this? If you've already built real model redundancy — actual failover to a second provider or a self-hosted open-weight model, not a config flag you've never tested — I want to hear how it's holding up, and what it cost you to build before you needed it. And if you think I'm over-rotating on one directive, tell me why. I'm reading every comment.

If this named something you've been feeling but hadn't put words to, subscribe below — I write up what building resilient agentic systems teaches me, usually by going wrong in production first.

Subscribe on Substack

— Glenn Eggleton builds agentic engineering systems and writes about what survives contact with production.

SEO meta

Title (≤60 chars): A Government Switched Off an AI Model. Now What?

Meta description (≤155 chars): A government just switched off a frontier AI model overnight. Why that makes security — not velocity — your top engineering metric.

URL slug: ai-model-kill-switch-security

Primary keyword: AI model dependency risk

Secondary keywords: frontier model export controls, AI supply chain security

Search intent: informational

I Got Tired of AI Code Review Noise, So I Built a Ratchet

Glenn Eggleton — Thu, 11 Jun 2026 13:47:49 GMT

LLM reviewers can flag anything. Only deterministic checks get to block.

For months I ran four LLM reviewers on every diff that mattered: code, security, library, and an adversarial claims reviewer. Four cold-context specialists, every change. It felt rigorous, so I didn't look too hard at what they were actually producing.

When I finally did, the rigor fell apart. The reviewers were flagging things faster than I could work through them, and most of it was not new. It was the same finding from last week, worded a little differently, raised again by an agent that had no idea it had raised it before. I was paying for the same review over and over.

If your review process never turns its recurring findings into deterministic checks, it is not a process. It is noise on a loop. Let an LLM reviewer flag anything it wants. Only a deterministic check should be allowed to block a merge. Most teams wiring up AI code review right now have not separated those two things. I hadn't either.

The same finding, forever

A human reviewer who flags the same problem three times eventually does something about it. Writes a lint rule. Updates the style guide. Says something in standup that becomes team lore. The third time costs less than the first because people remember, and remembering carries consequences.

An LLM reviewer remembers nothing. Every run starts cold. The agent that flagged a missing validation check on Tuesday flags it again on Thursday with the same confidence and no idea it is repeating itself. My cost per finding stayed flat while the value of each finding fell toward zero. That is not review. It is a subscription to my own backlog.

Volume made it worse. Four reviewers, ten-ish findings each, every day. I became the bottleneck, reading output I had stopped trusting and re-deciding things I was fairly sure I had already decided but could not prove. The false positives came back just as reliably as the real findings, with nothing to tell them apart.

The obvious fix is memory. Save the findings, feed them back to the reviewer next time, let it skip what it already said. I thought about it and dropped it, for two reasons.

The reviewers run cold on purpose. The value of a second opinion is that it has not seen the first one. Prime a reviewer with its own history and it starts agreeing with its past self instead of reading the diff.

And memory solves the wrong problem anyway. A finding that keeps coming back does not need to be remembered. It needs to be dealt with. An agent that remembers flagging the same thing three times is just an agent that remembers being ignored. The missing piece was never memory. It was consequence.

Vibes with veto power, or noise nobody reads

There are two standard ways to handle this, and both make it worse.

The first is to make the reviewer a gate. If the LLM says stop, the merge stops. Now non-deterministic judgment has veto power over your pipeline. The same diff passes Monday and fails Tuesday because the model worried about something different that time. Engineers figure this out fast, and a gate they cannot predict is a gate they stop respecting. The distrust then spreads to everything else the agents touch. That is vibes with veto power.

The second is to make everything advisory. The reviewer comments, nobody is blocked, work continues. It feels safer and rots just as fast, because a finding with no consequence teaches everyone to scroll past it. Give it a few weeks and the advisory output is wallpaper. You are paying for tokens nobody reads.

A finding that recurs without consequence trains you to ignore the reviewer.

Neither failure is the model's fault. The models review fine. The problem is that neither setup separates flagging a problem from blocking on it. The gate fuses them, so every flag becomes a verdict. The advisory split cuts the wire entirely, so nothing a flag says is ever enforced. What you want is the two held apart, with a deliberate path from one to the other.

Tier the checks: flag vs. block

So I built that path. It shipped today as agentic-os v1.1.0. The core is three tiers, ordered by how reproducible each one is.

Tier 0 is deterministic validators. Scripts, linters, schema checks, grep rules. Anything that returns the same answer every time. This is the only tier allowed to block a merge. If a check can flake, it does not get to gate.

Tier 1 is LLM judgment with evidence attached. A reviewer can push a finding up to this tier only by bringing a deterministic artifact: a failing script, a counterexample, something that exits non-zero on its own. The judgment finds the problem. The artifact is what actually gates. The argument around it does not.

Tier 2 is everything else the reviewer thinks. Style, unease, "this feels wrong." Advisory, never blocking. The part that keeps Tier 2 from being pure noise is that every finding here gets recorded, fingerprinted, and counted.

All four reviewers were rewired to this in the same release. A stop verdict standing on Tier 2 alone is no longer a verdict. It is a flag, and it goes to a ledger instead of to the merge button.

The ratchet: recurrence earns promotion

The ledger is where it starts to compound. It is also where the memory lives, outside the reviewers where it cannot pollute a cold read, and attached to a consequence.

It is boring on purpose. An append-only JSONL file, one line per event, driven by a small Python script with five commands: add, tally, triage, promote, retire. Every unevidenced finding gets a SHA-256 fingerprint built from the file path and the normalized claim, so the same defect written two different ways across two runs lands on one entry instead of looking like two discoveries.

{"fingerprint":"a3f29c41d7b08e55","file":"src/api/sessions.ts","claim":"session token compared without constant-time check","tier":2,"source":"security-reviewer","run_id":"r-0611","date":"2026-06-11","evidence":null,"status":"RECURRING"}

Recurrence counts distinct runs, not raw sightings, so a reviewer repeating itself five times in one run cannot fake a trend. When a fingerprint crosses the threshold, triage surfaces it for a human to look at. If it is real, you encode it as a Tier 0 validator or a Tier 1 evidence script, and promote records it. Promote refuses to mark anything done unless the encoded check is attached. The promotion is the check. Findings nobody ever repeats age out through retire.

That is the ratchet: finding, ledger, tally, promote, check. Once a defect class is promoted, no LLM argues about it again. It is out of the stochastic layer for good, and the reviewers go back to looking at what is actually new in the diff.

Once a defect class is promoted, no LLM re-litigates it. The ratchet only turns one way.

Is any of this load-bearing yet? It caught its own first bug before release. The fingerprint normalizer was treating apostrophes in contractions as quote characters, which made unrelated findings collide into one entry. That fix shipped with a regression test in the same release. So did 33 routing collisions across the library, each one a recurring finding that got investigated and encoded instead of re-flagged forever. The system chewing on its own output is the whole idea.

"You just rebuilt lint with extra steps"

Fair objection. If every good finding ends up as a deterministic check, have I just rebuilt my lint config the long way and thrown out the LLM judgment that was the point?

No, because of the pipeline. Lint rules show up when a human gets annoyed enough to write one. There has never been a standing path from "the reviewer keeps mentioning this" to "the machine checks this now." The ratchet is that path. It does not get rid of judgment. It retires the judgments you have already settled, so the expensive stochastic layer stays aimed at the things you have not. The reviewer stops re-finding what you already know and starts finding the next thing worth promoting. Lint never had that. That is the new part.

What skipping it costs you

Stand up AI code review without this and you get one of the two decays: a gate your team learns to distrust, or comments your team learns to skip. Either way your spend grows in a straight line and compounds nothing. Every dollar buys the same findings the last dollar did.

With the ratchet, the curve bends. Every promoted finding is a review you never pay for again. Quality stops resetting to zero each run and starts accumulating, the way the rest of your tooling already does.

The whole thing, tiers and ledger and rewired reviewers, is open source in agentic-os, one command to install. If the flag-versus-block split named something you have been feeling but had not put words to, star the repo. And subscribe below, where I write up what this system teaches me, usually by going wrong first.

Subscribe on Substack

— Glenn Eggleton builds agentic engineering systems and writes about what survives contact with production.

SEO meta

Title (≤60 chars): I Got Tired of AI Code Review Noise, So I Built a Ratchet
Meta description (≤155 chars): My AI code reviewers kept re-discovering the same findings every run. So I built a tier-and-ratchet mechanism that makes review quality compound.
URL slug: ai-code-review-noise-ratchet
Primary keyword: AI code review
Secondary keywords: LLM code review false positives, AI code review workflow
Search intent: informational

AI Made Engineers Faster. It Also Made Teams Slower to Integrate.

Glenn Eggleton — Wed, 10 Jun 2026 15:12:50 GMT

One engineer can now build the whole feature alone. That's the win. It might also be the problem.

An engineer on a team I know shipped an entire feature last month (three services, a new queue, a schema change, the lot) in about four days. Alone. No design meeting, no integration huddle, no "can you walk me through how the payment service expects this." Just one person and a pile of agents, moving at a speed that would have taken a small group two weeks in the before times.

Then deploy day came, and everything stopped.

Not because the code was wrong. The code was fine. It stopped because the moment that feature had to leave that one engineer's head — to a devops person who owns the pipeline, to a reviewer who has to actually understand it, to the on-call rotation that will get paged when it breaks — nobody else had the map. The whole cross-service model existed in exactly one brain, and there was no cheap way to get it into a second one.

Here's the thing I keep circling back to, and the claim of this whole post: the velocity AI gives one engineer is borrowed against a collaboration tax the entire team repays at integration and deploy. And most of us haven't noticed the debt yet. The speed is real. It shows up on every dashboard we have. The tax is just as real, and it shows up nowhere, until the bill arrives at the worst possible moment.

I want to be honest up front: I don't have this solved. This is me thinking out loud about a pattern I keep seeing, in my own work and in teams I talk to. I'm more sure the problem is real than I am about anything we should do about it.

The solo end-to-end build is real, and it's genuinely fast

Let's start with the part that isn't a complaint, because it's important not to wave this away as hype.

The thing that's changed is the scope one person can hold. It used to be that a feature crossing three services crossed at least three people: somebody who knew the auth service, somebody who owned the data layer, somebody who lived in the front end. The boundaries between systems were also the boundaries between humans. You coordinated across services because you had to coordinate across people, and the coordination was the work.

Agents collapse that. One engineer can now open all three services at once, hold the full call path in working memory, and let the harness do the typing across every boundary at the same time. The auth change, the queue consumer, the migration, the client update, all built together in one session, by one person who never had to schedule a conversation to make it happen.

And it's fast. Not "feels fast." Measurably fast. The work that used to be gated on three calendars is now gated on one engineer's afternoon. If your only instrument is throughput, this looks like an unqualified win, and I understand why every engineering leader in the industry is leaning into it. I'm leaning into it. The output is real.

That's exactly what makes the rest of this hard to see.

The whole model now lives in one brain

When one person builds across three services in an afternoon, something quiet happens: the complete mental model of how those pieces fit together — why the queue retries the way it does, which failure the migration is guarding against, what the client assumes about the auth response — now exists in precisely one place. One head.

In the old world, that model was distributed whether you liked it or not. Three people built it, so three people held pieces of it, and the act of integrating forced them to reconcile their pieces out loud. The knowledge was spread across the team as a side effect of the work being spread across the team. Nobody designed it that way; it was just how building together worked.

The solo end-to-end build removes the side effect. The feature gets built, and the understanding of it doesn't spread, because spreading it was never required to ship it. You end up with a new kind of silo: not an organizational one, where a team hoards what it knows, but a structural one, where the knowledge was simply never externalized in the first place. It's invisible precisely because the building phase feels so good. One person, fully loaded with context, is the most productive unit in software. The problem is that productivity and resilience are pulling in opposite directions, and only one of them is on the screen.

Now, the obvious objection (and it's a good one) is that's what documentation is for. Write it down. Have the agent generate the design doc. Drop a markdown file next to the feature explaining every decision. We have better tooling for this than we've ever had; the model that built the thing can also describe it.

I used to find that answer fully convincing. I find it less convincing now, for two reasons.

The first is that a document encodes facts, not judgment. It can tell you the queue retries three times with exponential backoff. It struggles to tell you why three and not five, what got tried and rejected, which production incident from two years ago is the reason that number exists at all. The tacit reasoning — the part that's actually expensive to rebuild — is the part that's hardest to write down and easiest to leave out. And the model writing the doc doesn't know it either, unless the engineer thought to say it.

The second is that reading isn't free. A document doesn't transfer understanding; it transfers the opportunity to rebuild understanding, and the reader still has to pay for that with their own time and attention. A 4,000-word design doc that took an agent ninety seconds to produce can cost a reviewer an hour to genuinely absorb, and they still can't interrogate it the way they could interrogate a colleague. You can't ask a markdown file "wait, what happens if the migration runs while the old consumer is still up?" and watch its face change as it realizes it hadn't thought about that.

Docs help. I'm not anti-doc. But they move the cost; they don't remove it. And critically, they move it downstream: from the fast, cheap building phase to the slow, expensive handoff phase. Which is exactly where the bill is waiting.

The bill comes due at integration and deploy

Watch where the friction actually lands now. It's not in the building. Building is the part that got fast. The friction migrated to the seams: the places where the work has to pass from the one person who holds the whole picture to everyone else who needs a piece of it.

Deploy is the sharpest seam. The engineer who built the feature understands its rollout implicitly: this migration runs before that service restarts, this queue drains before the old consumer dies, this flag flips last. None of that is in the code in a form the pipeline owner can read off. So now there's a conversation. But not the cheap, in-flight kind that used to happen while two people built a thing together. It's an after-the-fact download: four days of dense, agent-assisted context, handed to someone who has to receive it cold, all at once, under deploy-day pressure.

Code review has the same shape. A reviewer faced with a three-service change built in one pass isn't reviewing a diff anymore: they're reverse-engineering an entire mental model from its artifacts, fast, so the thing can move. Multiply that by every solo-built feature in the queue and you get a review process that's quietly become the bottleneck the building used to be. We didn't remove the constraint. We moved it from "writing the code" to "transferring the context," and the second one is harder to parallelize because it lives in people, not in machines.

This is the tax: slower deploys, longer reviews, more coordination overhead exactly when you can least afford it. It's close to invisible on the instruments, because nobody books "the hour it took to explain the feature to devops" as a cost of the feature. It just shows up as deploys dragging a little, for reasons nobody quite names.

And there's a sharper edge to it: bus factor. When the complete model of a critical feature lives in one head and was never forced out into the team, that head going on vacation (or leaving) isn't a staffing inconvenience. It's a genuine hole in the system's operability that you discover at 2 a.m. when the thing breaks and the one person who understands it is unreachable.

"We just throw PRs at each other and point our agents at them"

Here's the part that worries me most, because it's not about deploy speed. It's about what we're becoming as teams.

I was talking to a developer about how his team works now, and he described their collaboration like this, with no irony at all: "Oh, we just pass each other PRs to review and send our agents at them." And I've been chewing on that sentence ever since, because it was offered as a description of collaboration and it describes something that isn't collaboration at all.

Two engineers each building in isolation, each generating a change neither fully holds, each pointing an agent at the other's output to review it. That's not two people solving a problem together. It's two factories running in parallel, shipping parts to each other across a wall. There's throughput. There's no shared understanding being built, no one teaching anyone anything, no junior watching a senior reason through a hard call and absorbing how the senior thinks. The thing that used to happen for free inside collaboration — the learning, the transfer of taste and judgment from one person to another — has been engineered out, because the friction it rode on is the same friction we just removed.

That friction was load-bearing. The annoying parts of working together — having to explain your thinking, having to reconcile your model with someone else's, having to slow down enough to be understood — were also the parts that spread knowledge through a team and turned a group of individuals into something that knew more collectively than any one of them did. We treated that friction as pure overhead and optimized it away, and I'm not sure we noticed that it was doing a second job the whole time.

As an engineering leader, this is the part I can't shrug off. I want my team to collaborate. I want people solving each other's problems, learning from each other, getting sharper because they're surrounded by people who reason differently than they do. "We send our agents at each other's PRs" is the opposite of that. It's efficient and it's lonely and it doesn't compound the way a team that actually learns together compounds.

I genuinely don't know what we do about this

This is the part of the post where the formula says I'm supposed to give you three practices and a tidy framework. I'm not going to, because I don't have them, and I'd rather be honest than tidy.

I have half-formed instincts. Maybe deploy-readiness has to become an explicit team artifact instead of one person's implicit knowledge. Maybe we need to deliberately re-introduce some of the friction we removed: pairing on the hard features even when one person could solo them, precisely because the solo path skips the part where the team learns. Maybe the unit of work shouldn't be "a feature one person owns end to end" but something that's harder to hold alone on purpose. I don't trust any of these enough to tell you to go do them.

If I had to name the category all three point at, it's this: we may need to treat knowledge externalization as a first-class, scheduled part of the work. Not something we hope happens as a happy side effect of coordination, but a deliverable in its own right, planned and resourced like any other. And we may have to accept slower individual throughput on certain classes of change (the load-bearing, cross-service, wakes-you-at-2 a.m. ones) in exchange for lower organizational bus factor and faster future changes. That's a real trade, not a free lunch. It only pays off if the second-order cost is real, which is the whole question I can't yet answer.

What I'm fairly sure of is the shape of the trap. Every individual incentive points at the solo end-to-end build: it's faster, it's satisfying, it makes you look productive. Every individual decision to work that way is locally rational. And the cost lands somewhere that no individual feels and no dashboard shows: on the team's collective understanding, paid back slowly, at the seams, in deploys that drag and knowledge that doesn't spread and a kind of working-together that's quietly stopped being together at all.

Here's where I have to check my own framing, though, because I build multi-agent systems for a living and it would be too easy to write "agents bad" and walk away. The honest version is narrower: this generation of agent usage is pointed at individual velocity, which is exactly the force pulling context into one head. There's no law that says the next generation has to be. You could build agents whose whole job is the opposite. Agents that force externalization and cross-model reconciliation instead of skipping it:

agents that interrogate a build the way a teammate without the context would, asking "what happens if the migration runs while the old consumer is still up?", and making you answer before the work can ship;
handoff artifacts generated not as prose docs but as queryable models of the decision space, something the next engineer can actually ask questions of instead of reading cold;
multi-agent setups where separate agents role-play platform, security, and on-call during the build phase, so the reconciliation that used to happen between people happens before the work ever leaves one person.

None of that exists in a mature form yet, and I'm not claiming I've built it. But it shifts how I read the problem. It isn't "AI killed collaboration," full stop. It's that the first thing we aimed these tools at was individual speed, and the layer that pays the cost back (the one that rebuilds the reconciliation step inside the machine) is a layer we mostly haven't built. That's a more interesting problem than a complaint.

So I'll ask the people actually living this, because I think the answer is out there in your teams and not in my head:

Is this even a problem where you work? Or am I mistaking a transition for a loss? I'm especially interested in teams that have been working this way for six to twelve months: are you seeing the second-order effects yet (the bus-factor holes, the learning that quietly stopped), or am I over-weighting the transition period and this all settles out? And if it is a problem, what are you actually doing about it? Have you found a way to keep the velocity without hollowing out how your team learns from each other? I want the real answers, including "you're wrong, here's why." I'm working this out, and I'd rather work it out with you than pretend I've already figured it out.

Tell me in the comments. I'm reading all of them.

Modern Claude Code: The Complete `.claude/` Anatomy

Glenn Eggleton — Mon, 08 Jun 2026 17:57:24 GMT

A field guide to the directory, organized by where files live and which one wins.

Open your .claude/ directory right now. Go ahead — I'll wait.

Most of it is probably empty. Maybe a stray CLAUDE.md you wrote once and forgot. Maybe nothing at all, because you've been running Claude Code straight out of the box, re-explaining the same context every session, and quietly assuming that's just how it works.

It isn't. You're using maybe a third of what .claude/ can do. There is a complete modern anatomy here: a handful of directories and config keys, each with a defined job and a defined loading order. The highest-leverage features are exactly the ones that don't show up unless you go looking. This post is the map of the whole surface — every directory, what lives in it, and the one rule that ties it all together: which file wins when two of them disagree.

Not build order. Not philosophy. Inventory and precedence. By the end you'll be able to look at any .claude/ tree and know what's there, what's missing, and why one rule overrides another.

The Two Trees: Project and Global

There isn't one .claude/. There are two, and the distinction is the foundation for everything else.

The project tree lives at /.claude/. It's committed to git, shared with the team, scoped to this repository. Anything that should travel with the code — the conventions for this codebase, the skills this project needs — lives here.

The global tree lives at ~/.claude/ in your home directory. It's personal, machine-local, applies to every project you touch. Your own habits, your own slash commands, the rules you want everywhere regardless of which repo you opened — those live here.

Here's the project tree, annotated:

/
├── CLAUDE.md                  # project memory — conventions, ground truth
├── .mcp.json                  # team-shared MCP servers (committed)
└── .claude/
    ├── settings.json          # permissions, hooks, outputStyle (committed)
    ├── settings.local.json    # personal overrides (gitignored)
    ├── skills/
    │   └── /SKILL.md     # auto-invoked capabilities
    ├── agents/
    │   └── .md           # specialist subagents
    ├── commands/
    │   └── .md           # custom slash commands
    └── rules/
        └── .md           # path-scoped behavior rules

And the global tree, which mirrors it:

~/
├── .claude.json               # local + user MCP scopes, app state
└── .claude/
    ├── CLAUDE.md              # user memory — applies to every project
    ├── settings.json         # global permissions, hooks, defaults
    ├── skills/               # your personal skills, everywhere
    ├── agents/               # your personal specialists
    └── commands/             # your personal slash commands

The shapes are almost identical on purpose. Nearly everything that can exist at the project level can also exist at the global level. Which raises the obvious question: if you have a skill named code-review in both trees, or a permission rule in both settings.json files, which one runs?

Scope Precedence: The Rule That Ties It Together

The mental model is a stack, from broadest to most specific:

Enterprise / managed policy (set by an admin, if present) — the outermost layer.
User / global (~/.claude/) — your personal defaults, every project.
Project (/.claude/) — shared, committed, this repo.
Local project override (settings.local.json) — your personal tweaks for this repo, gitignored.

The principle is most-specific scope wins. A project setting overrides a global one. A local override beats the committed project setting. Your personal ~/.claude/CLAUDE.md sets the baseline; the repo's CLAUDE.md layers on top of it for that repo.

This is not a clever feature you opt into. It's the load order that's running right now, whether you've configured it or not. The reason most people's setup feels unpredictable is that they have rules in one tree, expectations from the other, and no model of which one the agent actually reads.

Once you internalize the stack, configuration stops being guesswork. Want a habit on every machine task you do? Global. Want a convention that travels with the codebase to every teammate? Project, committed. Want to bend the project's rules just for yourself without touching the shared file? settings.local.json. The directory you put the file in is the scope decision.

The seven-component conceptual model — and the build-in-order playbook for assembling all of this from scratch — is its own thing; I cover that in the free AgenticOS Map. This post is the physical anatomy. Same surface, different cut.

`settings.json`: Permissions, Hooks, and the Keys You Haven't Touched

settings.json is the control panel. Three parts of it matter most, and almost nobody configures the second two.

Permissions. You can pre-authorize or block tool calls with allow and deny lists, so Claude stops prompting you for the same npm test every session — and can't run the commands you never want it to. An allow entry for the commands you trust removes a class of interruptions; a deny entry is a guardrail that survives across sessions.

Hooks. Hooks wire automated behavior to session events: PreToolUse, PostToolUse, Stop. The canonical one is a PostToolUse hook on Write that runs your linter the moment a file changes. The point is that the harness executes these, not the model. If you've ever asked Claude to "always run the formatter after editing" and watched it forget three turns later, that's because you put it in prose instead of in a hook. Prose is a suggestion; a hook is a guarantee. (The hooks philosophy — when automation earns its place — is its own deep dive in the paid series.)

outputStyle . This one corrects a common misconception. There is no .claude/output-styles/ directory. Output style is a single setting in settings.json:

{
  "outputStyle": "Explanatory"
}

It changes how Claude communicates — for example, an Explanatory style that narrates its reasoning as it works. You can also flip it interactively with /config. If you went looking for a directory, that's why you didn't find one. It's a key, not a folder.

`CLAUDE.md`: The Memory File, As Mechanics

CLAUDE.md is the highest-priority instruction file in the system — the project's ground truth, read at session start. There's a build-order argument about when to write it (short version: last, once the rest of the system exists; long version is in the paid series). Set that aside. Here are the mechanics.

The under-200-lines discipline. CLAUDE.md competes for the same context window as your actual code. A bloated constitution crowds out the thing you're trying to work on, and past a certain length the agent starts skimming it the way you skim a terms-of-service page. Keep it tight. If a rule is derivable from the repo, it doesn't belong here.

/init . Run it once in a new repo and Claude bootstraps a starter CLAUDE.md by reading the codebase — package manager, test command, structure. It's the fastest way from empty to useful.

/memory . Opens the memory files for direct editing so you can curate them deliberately instead of letting them accrete.

Path-scoped rules. This is the feature most people don't know exists, so it gets its own section.

`rules/`: Behavior That Loads Only When It's Relevant

.claude/rules/ is real, current, and badly underused. It solves the CLAUDE.md bloat problem directly.

Every rule you cram into CLAUDE.md loads on every session, whether or not it's relevant. A rule about your API error-handling convention is dead weight when you're editing CSS. Path-scoped rules fix that. A rule file is a markdown file under .claude/rules/ with optional YAML frontmatter, and the key field is paths::

---
paths: ["src/api/**/*.{ts,tsx}"]
---

# API conventions

- Every endpoint returns the standard `{ data, error }` envelope.
- Errors use the shared `AppError` class, never raw throws.
- Validate input at the boundary with the zod schemas in `src/api/schemas/`.

The paths: field accepts glob patterns, including brace expansion like {ts,tsx}. The behavior is the leverage: a rule with paths: loads only when Claude reads a file matching one of those globs. Edit something under src/api/, the API conventions load. Edit a stylesheet, they stay out of context entirely.

A rule file without a paths: field loads unconditionally — use that for genuinely global conventions. But the path-scoped variant is how you keep a large, opinionated codebase governed without paying the context cost on every unrelated edit. It's CLAUDE.md discipline, automated by relevance.

(Reference: the path-specific rules documentation at code.claude.com.)

Commands and Skills: Two Ways to Reach Behavior

Custom slash commands and skills both live as markdown files, and they overlap enough to confuse people. The distinction is how they fire.

A custom command lives at .claude/commands/.md and you invoke it explicitly: /my-command. It's a saved prompt you trigger on demand.

A skill lives at .claude/skills//SKILL.md and is auto-invoked when the conversation context matches its description. You don't have to remember it exists; Claude reaches for it when the work calls for it. That auto-invocation is the whole point — a skill is behavior the system applies for you, a command is behavior you summon.

Both support dynamic content, and these two pieces of syntax are where a lot of power lives:

$ARGUMENTS for user input. All-caps. You can take everything the user passed ($ARGUMENTS), a positional slice ($1, $ARGUMENTS[2]), or a named field ($name). This is what turns a static prompt into a parameterized one — /review src/auth flows src/auth straight into the command body.

`** ! **shell command** ** for live shell output.** Backtick-wrapped, bang-prefixed. It runs the shell command as preprocessing — before Claude ever sees the prompt — and substitutes the output inline. So a command can open with ` !git diff --staged ` and Claude starts the turn already looking at your real, current diff. No copy-paste, no stale context.

Skill frontmatter is richer than most people realize. The fields you'll actually use:

name and description — the latter drives auto-invocation, so write it for matching, not for marketing.
allowed-tools — hyphenated, not camelCase. This trips people up constantly. Restrict a skill to exactly the tools it needs.
disallowed-tools — the inverse, when a denylist is cleaner than an allowlist.
disable-model-invocation — make a skill manual-only (no auto-invoke).
user-invocable — control whether a user can trigger it directly.
paths — scope a skill to relevant files, same idea as rules.
arguments — declare the inputs the skill expects.
context: fork — run the skill in a forked context so it doesn't pollute the main conversation.
agent, model, effort — route the skill to a specific subagent, model, or effort level.
hooks — wire skill-local event behavior.

You don't need all of these on day one. You do need to know they exist, because the difference between a skill that works and one that quietly grabs the wrong tool is usually one frontmatter line.

`agents/`: Specialists, Not Generalists

.claude/agents/ holds named subagents with declared scope and a declared tool allowlist. A code-reviewer that reviews but never writes. A security-reviewer that flags but never fixes. You dispatch them from an orchestrator, they do their narrow job, they report back.

The mechanic worth knowing here is the restriction: a subagent can be locked to a specific model and a specific set of tools, which is what makes fan-out safe. You can launch several at once over isolated worktrees without them stepping on each other or reaching for tools they shouldn't have. The why — separation of concerns at the agent level, the briefing discipline, the review gate — is the subject of the paid series' agents post. The where is .claude/agents/, and you manage the whole roster with /agents.

MCP: Wiring in External Tools

MCP (Model Context Protocol) is how Claude Code reaches systems outside the repo — a database, an issue tracker, a docs server. Where you put the config decides who gets it, and that maps cleanly onto the scope stack:

.mcp.json at the project root is team-shared. Commit it, and every teammate who clones the repo gets the same servers wired up. This is the one you want for project infrastructure.
~/.claude.json holds your local scope (just the current project, just you) and your user scope (every project, just you).

You don't hand-edit these in practice. You run:

claude mcp add --scope project  
claude mcp add --scope user  
claude mcp add --scope local

The --scope flag is the same precedence decision in command form. project writes to the committed .mcp.json; user and local write to ~/.claude.json. Pick the scope, and you've decided who inherits the server.

The Features You're Probably Not Using

Here's the part of the inventory most people never reach — the built-in slash commands that ship with Claude Code and never announce themselves. These aren't files you write. They're already there.

Context and session management:

/compact — compress the conversation history when it gets long, keeping the thread alive without the bloat.
/clear — wipe the context and start fresh.
/context — show what's currently loaded into the context window. The single best way to see why the agent is behaving the way it is.
/effort — tune how much reasoning effort the model spends.

Working and recovering:

/plan — plan mode: the agent thinks through the approach before touching anything.
/rewind — roll back to an earlier checkpoint. Claude Code keeps checkpoints; /rewind is the undo you didn't know you had when a session goes sideways.
/loop — run a prompt or command on a repeating interval.
/copy — copy output to the clipboard.

The high-leverage ones worth featuring:

/agents — manage your subagents (the roster from the agents/ section).
/code-review — multi-axis review, with --fix to apply findings and --comment to post them inline.
/batch — make parallel changes across worktrees.
/diff — inspect changes directly.
/goal — set and track the session's objective.

If you take one action from this whole post, run /context in your next session and look at what's actually loaded. Then run /agents and /code-review once each. Most people discover in five minutes that the tool they've been using is a fraction of the tool that shipped.

The Community Surface

The official docs at code.claude.com are the ground truth for every claim in this post — when something here disagrees with a folk pattern you read somewhere, trust the docs. Beyond them, the community has built a substantial catalog of skills, commands, and agent definitions worth borrowing from: awesome-claude-code. Read a few well-built skill files before you write your own; the frontmatter patterns alone will save you an afternoon.

What You Now Know

You came in using maybe a third of .claude/. You now have the complete modern anatomy: two trees, project and global; the precedence stack that decides which file wins; settings.json for permissions, hooks, and the outputStyle key; CLAUDE.md mechanics and the under-200-line discipline; path-scoped rules/ that load only when relevant; commands versus auto-invoked skills with $ARGUMENTS and live shell substitution; specialist subagents in agents/; MCP wiring by scope; and the built-in slash commands hiding in plain sight.

The gap was never the tool. It was the inventory. You can't configure a directory you've never mapped.

Two paths from here, depending on what you want next:

If you want the mental model for building all of this in order — what to write first, what to write last, and why — start with the free AgenticOS Map. It's the build-in-order companion to this anatomy.
If you're an engineer ready to build now, the paid AgenticOS series walks the construction end to end, starting with P1: Skills — the atomic unit of agent behavior. That's where the anatomy becomes a system.

Both paths start the same way: subscribe, and the free AgenticOS Map lands in your inbox next.

Subscribe to AgenticOS on Substack

SEO meta

Title (≤60 chars): Modern Claude Code: The Complete .claude/ Anatomy
Meta description (≤155 chars): The complete modern .claude/ directory anatomy — every folder, scope precedence, and the Claude Code power-user features most people never configure.
URL slug: modern-claude-code-anatomy
Primary keyword: .claude directory
Secondary keywords: claude code features, claude code power user, claude code rules, claude code settings.json
Search intent: informational

I've Been Gatekeeping the Magic. Here's Everything.

Glenn Eggleton — Fri, 05 Jun 2026 14:05:48 GMT

Senior engineers have been quietly building AI scaffolding that juniors never get access to. I just open-sourced mine.

For months I told myself I'd publish this when it was cleaner. More polished. Better documented.

That was not the truth. That was gatekeeping, and today it ends.

I've been running an AI operating system for months now. It made me faster across every repeatable work type I tracked. Everyone who got access to it shipped at a senior level and super quick. I kept refining it, kept not publishing it, kept reaping the leverage while the rest of the field used raw API access and hoped for the best.

The OS is now open-source: https://github.com/LazyIsEfficient/agentic-os

One curl command. It installs immediately. I'm done sitting on it.

The gap is not what you think

The story we tell ourselves about AI tooling is that access is equal. Get an API key, set up Copilot or whatever tool you want to pay for, now everyone is in the game. Junior and senior on the same playing field.

That story is wrong.

Senior engineers are not "better at prompting." They have spent months accumulating scaffolding: reusable skill files that tell agents exactly how to handle a class of work, specialist agents scoped to one task with the right tools, intake patterns called shapers that turn vague requests into scoped briefs before a word of code is written, memory systems that survive session boundaries and eliminate re-explaining context every morning, a constitution file (CLAUDE.md) that sets the rules everything else runs inside.

That scaffolding took months to build. Most seniors have not published it. Most have not even articulated it. It lives in their workflow as a private compound interest machine.

Juniors and interns get the raw API. They get Copilot/Cursor autocomplete. They get "figure it out." The gap between what a senior produces with their AI OS and what a junior produces with a chat window is not a skill gap. It is a scaffolding gap.

Scaffolding is transferable. That is the part that changes everything.

What actually happens when a junior runs the OS

I saw this firsthand. Junior engineers who got access to the same skills, shapers, and agents I was running stopped producing uncertain drafts and started producing outputs I reviewed and shipped. Interns who previously needed heavy guidance ran the OS against real tasks and came back with architecturally sound code.

The pushback I always get: "Won't this just produce vibe-coded slop at scale?"

No. The OS is specifically why not.

The OS ships with a code-reviewer agent that runs on every non-trivial diff. A security-reviewer for anything touching auth or user data. A TDD skill that drives implementation from tests first. A quality gate built into every content pipeline. These are not guardrails bolted on after the fact. They are first-class components of the system.

The OS does not produce raw output and hope for the best. It produces output and reviews it. A junior running the OS produces more than a senior without one, and the output passes through review before it ships. That is not slop. That is a supervised pipeline that anyone can run.

What's in the OS

The library ships with 80+ skills and 18 specialist agents. The categories:

Engineering: TDD, code review, security review, API design, debugging, frontend, TypeScript, Rust, cloud infrastructure, CI/CD, SRE, release management

Content: blog post shaping and authoring, course design, social growth, SEO ops, podcast ops

Product: technical product management, system architecture, documentation, ADRs

Games: Godot, Phaser, game design, balancing, monetization

Install on macOS/Linux:

curl -fsSL https://raw.githubusercontent.com/LazyIsEfficient/agentic-os/main/install.sh | bash

Files go to ~/.claude/skills/ and ~/.claude/agents/. Available immediately in any Claude Code session.

Then add this to your ~/.claude/CLAUDE.md to make Claude reach for skills by default instead of treating them as opt-in:

## Skills
You have a library of skills installed at `~/.claude/skills/`. Before responding to any task,
check whether a skill applies and invoke it with the Skill tool if so.
If there is even a 1% chance a skill might apply, invoke it first.

That single block is the highest-leverage configuration step in the entire setup.

Why I kept sitting on it

I told myself the OS needed to be complete before I shared it. Every week I added something, fixed something, and moved the goalposts on what "ready" meant.

The real reason: leverage feels better when it's yours alone.

That was the wrong call. The engineers who wait for polished tooling before they start are the ones falling behind. The ones who pull this now, run it rough, and calibrate through use are the ones compounding their leverage every week.

Juniors and interns do not need to wait for their senior to build the OS for them. They can pull it now, today, and start running skills that took months to refine. Leads who want their team to ship faster without more senior bandwidth can hand this to their juniors right now.

The gatekeeping was never about protecting the tool. It was hesitation. And hesitation is just compounding disadvantage for everyone who is not you.

What comes next

This is the launch. The build playbook comes after.

I have been writing a Substack series on how to build this OS from scratch: why to start with skills and not CLAUDE.md, how to write shapers that actually reduce intake noise, how memory compounds over weeks, and how to wire hooks that catch problems before CI does.

The map post is free and published. The build playbook is paid.

Read the complete AgenticOS map on Substack

If you want the full series, subscribe. The first paid post walks through writing your first skill file, the single step that changes how your team uses AI. That is where the build-out starts.f

Build Your AgenticOS: Watch It Run

Glenn Eggleton — Mon, 01 Jun 2026 21:20:23 GMT

Seven posts explaining the system. This one shows it.

Seven posts describing how a system works is not the same as watching it work. You can read every post in this series, follow every step, and still have an open question: "But what does it actually look like when it runs?" That question deserves a direct answer.

This post is that answer. Not more explanation. A real session, recorded, unedited.

The claim behind this entire series is that a well-built AgenticOS doesn't make AI magic. It makes AI predictable. The session you're about to watch isn't impressive because of what the AI does. It's impressive because of the system around it. The same session, on the same task, produces the same shape of output every time. That's the point.

What You're Watching

This is a real task from the actual codebase. Not a demo task built to look clean. Not a simplified example. The codebase is this system's own library of skills and agents. The task is something that needed doing.

Here is what you'll watch happen, in order:

Session start: the context loads. When Claude Code opens, ~/CLAUDE.md loads automatically. That file is the constitution. It tells the agent where memory lives, how to dispatch work, and what the rules are. Before a single instruction is typed, the agent knows the system it's operating inside. The memory index loads next. That's where prior session context lives. The agent reads it and starts hot instead of cold.

The task goes through the prompt-shaper. Rather than typing a vague instruction and hoping the agent figures out what's wanted, the task runs through prompt-shaper first. The shaper asks a focused set of questions. It turns a rough idea into a scoped brief: what the output is, what files it touches, what done looks like. This takes a few minutes. It prevents thirty minutes of correction later.

Specialist agents are dispatched in parallel. Once there's a brief, agents execute against it. Not one agent doing everything sequentially. Multiple agents running simultaneously in a single message, each with a narrow scope. You'll see this in the terminal output: multiple task outputs arriving in a short window.

The review gate runs. After the implementation agents finish, code-reviewer and library-reviewer are dispatched in parallel. They're read-only. Their job is to catch what the implementation agents can't see. The reviews come back with a verdict: ship, ship-with-fixes, or hold.

Reading the diff. Before merging, the diff is read directly. Not trusting the agent's summary. Checking the actual changes: what changed, in which files, does it match the brief.

Done: the merged result. The task is merged. The session ends. The memory layer gets a new entry if anything non-obvious was learned.

That's the full loop. Now watch it.

The Session

This session took real time 22 minutes; its fast-forwarded!

What to Notice

Readers of the series will recognize the patterns as they appear. A few things worth watching for specifically:

The shaper isn’t run because the PRD was excellent. This saves the agent time, and you your sanity. Ask in chat if you want the PRD.md

The review gate runs in parallel. Two agents, one message, both results arrive before a merge decision is made. The gate is not a formality. It's structurally separate from the implementation agents, which means it has no stake in defending what they wrote. An independent second pass by construction.

The memory loaded at the start. One of the memory entries that loads is the one that explains the 🌶️ Take prefix for social posts. That's why the agent doesn't ask about it. The system knows. That's what memory is for: non-obvious facts that would otherwise have to be re-explained in every session.

The agents don't improvise scope. At no point does an agent decide the task needs something extra. The brief is the contract. The agents execute the brief. Scope decisions happen at intake, not during implementation.

The diff is checked, not assumed. The agent's final message is a description of what it intended to do. The diff is what it actually did. Those two things are checked against each other before merging. This is the habit that prevents a whole category of invisible errors. A subagent that says "I updated the routing in CLAUDE.md" and a diff that shows it also touched three other files is a signal, not a rubber stamp. Read the diff.

The session is reproducible. There is no moment in the recording where the output depends on a lucky prompt or a particularly cooperative AI response. The structure of the session is the same every time this class of task runs. Shaper runs first. Agents dispatch from the brief. Gate runs after implementation. Diff before merge. Any engineer on any team can follow the same structure and get the same shape of result.

The System Is Not Magic. It's Consistent.

The session you just watched isn't impressive because of the AI. The underlying model is the same one you have access to. What's different is the system it's operating inside.

CLAUDE.md loads on session start. Memory starts the agent warm. The shaper turns vague requests into tight briefs. Specialist agents execute with narrow scope. The review gate runs independently. The diff is read before merging.

None of these steps are clever. Each one is just a habit encoded into a file. The aggregate of those habits is a system that produces predictable output from the same starting materials every time. That's not magic. That's engineering.

If you've read this series and want to build your own version, start with the map. Every component is explained there, with the build order, the reason each layer exists, and what it gives you. You don't have to build all seven layers at once. The map tells you which layer to start with and what you get from it.

The Complete Map: Build Your Own AgenticOS

The proof

See the code output here

Build Your AgenticOS: Hooks Automate the Invisible

Glenn Eggleton — Mon, 01 Jun 2026 16:59:14 GMT

Hooks wire your AgenticOS to session events. They're the automation layer that removes the work you'd otherwise have to remember to ask for.

Memory tells agents what to remember. Hooks make the system act without being asked.

Every session, there is overhead you pay without thinking about it: describing the current git state, establishing which project is active, noting which deployment is live. None of it is hard. All of it is unnecessary. You know the system should have that context automatically. You keep meaning to set it up. Instead, you type it again.

Hooks are the automation layer that fixes this. They are shell commands wired to session events (SessionStart, PreToolUse, PostToolUse, Stop) and their output flows directly into the agent's context. No prompt required. The agent sees the hook output the same way it sees anything else you write. It just happened without you.

This post covers what hooks are, where they live, the event model, Glenn's real examples from .claude/settings.json, how to design a hook that works without surprising you, and the clean distinction between hooks, skills, and CLAUDE.md.

Build Your AgenticOS: Worktrees for Parallel Agents

Glenn Eggleton — Mon, 01 Jun 2026 16:58:36 GMT

Git worktrees are how you get the wall-clock benefits of parallelism without the merge conflicts. Use them whenever two agents may touch the same files.

You run four agents in parallel. All four try to edit the same source file. The last one to write wins. The other three changes are gone. You don't get an error. You don't get a merge conflict. You get silent data loss dressed up as a successful run.

That is not a parallelism problem. It is an isolation problem. Parallelism is the right call; sequential dispatch of independent work is a bug. The mistake is dispatching parallel agents into a shared working tree and expecting them not to stomp on each other.

Git worktrees fix this. Each agent gets its own branch, its own working directory, zero shared file state. When the agents are done, you read the diffs and merge. The wall-clock cost of parallelism, without the silent data loss.

Here is what worktrees are, the exact rule for when to use them, the cost you're accepting, the wave pattern that keeps dispatch manageable, the verification discipline that catches agent failures, and the anti-pattern that makes the whole thing pointless.

Build Your AgenticOS: The CLAUDE.md Constitution

Glenn Eggleton — Mon, 01 Jun 2026 15:51:44 GMT

CLAUDE.md is where your AgenticOS rules live. It governs every session, every agent, every project.

Every layer in the AgenticOS stack is optional except one. You can skip memory if you're comfortable re-establishing context each session. You can skip shapers if you want to write briefs by hand. You can run without a structured agent library and just prompt directly. None of that is fatal.

Skip CLAUDE.md and you have no system at all.

Without it, every session you start from scratch. You re-explain the same anti-patterns. You remind the agent not to skip code review. You re-establish that parallel fan-out is required for independent tasks. You correct the same mistake twice, then three times, because nothing was written down and the agent's behavior resets on the next cold start.

CLAUDE.md is the instruction layer that governs every session. It is where you encode the rules you are tired of repeating.

Here is what it is, where it sits in the priority stack, what belongs inside it versus what doesn't, and how to write a rule that actually holds.

Build Your AgenticOS: Memory That Survives Sessions

Glenn Eggleton — Mon, 01 Jun 2026 15:51:01 GMT

The layer that makes your AgenticOS learn, one non-obvious fact at a time.

Every session that starts cold is a session that re-learns what you already know. The agent you corrected yesterday will make the same mistake tomorrow. The decision you made two weeks ago will get relitigated next Tuesday. The quirk of your codebase that took forty-five minutes to explain will take another forty-five minutes. Again.

Memory is the layer that fixes this. Not memory in the fuzzy "AI remembers things" sense, but a concrete, version-controlled directory of short markdown files that captures the non-obvious facts your agent should start with on every session. Write it correctly and your AgenticOS accumulates context over time. Skip it and every session starts with a blank slate, which means every session is subtly slower than it should be.

Here is the format, the four types, the mechanics, and, critically, what not to save, because the most common mistake with memory is filling it with things the codebase already contains.

Build Your AgenticOS: Specialist Agents

Glenn Eggleton — Mon, 01 Jun 2026 15:50:05 GMT

The constraint is the feature. One agent definition file is more reliable than any prompt you've ever written.

One generalist agent doing everything is how you get mediocre output at every layer.

Not because the model is weak. Because a single agent context trying to hold intake shaping, implementation, code review, security audit, and social copywriting simultaneously is optimizing for nothing in particular. You get plausible output on every front and excellent output on none. The quality ceiling on a generalist agent is determined by the widest surface it has to cover.

Specialist agents with declared tool allowlists and routing triggers beat a single do-everything agent every time. The constraint is the feature. A code-reviewer that cannot write code will not quietly sneak a "helpful" fix into the diff while reviewing it. A security-reviewer that can only read files will not accidentally delete one. A blog-post-shaper that has no access to Write cannot produce a draft before the brief is agreed. The narrower the scope, the more predictable the output.

This post covers what an agent definition file actually is, the mandatory build+review gate that sits downstream of every agent dispatch, fan-out vs sequential dispatch, briefing discipline, and a real example from .claude/agents/. At the end, a starter template for writing your own specialist Claude Code agents. The briefing section alone is worth the read: bad briefs are the single biggest source of wasted cycles in any agent-driven workflow.

Build Your Own AgenticOS, Part 2: Shapers, The Intake Layer

Glenn Eggleton — Mon, 01 Jun 2026 15:49:34 GMT

Every vague request produces a vague output. The agent doesn't know who the reader is, what "done" looks like, or which of the five interpretations of your request you actually meant. It makes a reasonable guess and starts. Ten minutes later you have something that's technically responsive and completely wrong, and now you're editing instead of approving.

The shaper is the fix. A shaper is an intake agent whose only job is to turn a half-formed idea into a scoped brief before any execution agent touches a file. Not "clarify the request." Produce a structured document that removes all ambiguity for everything downstream. Intake quality is the ceiling on execution quality. A shaper enforces that ceiling before execution starts. Without one, every agent downstream simply inherits your ambiguity and confidently acts on it.

This is Part 2 of the Build Your Own AgenticOS series. If Part 1 covered the atomic unit (the skill file), Part 2 is about the layer that protects every skill from getting fired at the wrong target.

Here's what this post delivers:

What a shaper does, and why it's distinct from the skill it feeds
Why shapers come first in every request lifecycle
The real routing logic from CLAUDE.md in this repo (which requests route to which shaper and how)
How to write a shaper using the AskUserQuestion pattern
When to skip the shaper entirely (the answer is narrower than you think)
A starter shaper template you can copy today

Build Your AgenticOS: Start with Skill Files

Glenn Eggleton — Mon, 01 Jun 2026 15:48:57 GMT

Without skills, every agent response is a coin flip. The agent knows your stack, knows your tests pass, knows you prefer short methods. Then it improvises the rest. Ask it to write a blog post and it writes a blog post. Ask it again next week and you get a different blog post, structured differently, hitting different beats, because you never told it what a blog post is in this shop. The work looks fine. It just doesn't look the same twice.

Skills are the fix. A skill file is a markdown instruction set that any agent can load on demand. It encodes how your team does one class of work: the steps in order, the references to consult, the done criteria to verify before declaring complete. One file, committed to the repo, loaded whenever the work matches.

A skill file is the atomic unit of your AgenticOS. Get this layer right and every agent you build downstream becomes composable. Skip it and you are back to prompting from memory, which is just a slower version of improvising.

This is Part 1 of Build Your Own AgenticOS. You will learn what a skill file is, how to structure one, when to reach for a template instead of a skill, and how to tell whether a skill is load-bearing before you ship it.

Build Your Own AgenticOS: The Complete Map

Glenn Eggleton — Mon, 01 Jun 2026 15:20:15 GMT

The system layer that makes AI behaviour consistent, reviewable, and delegatable.

You're not using AI wrong. You're using it without a system. Every engineer on your team has their own approach: different prompts, different habits, different mental models of what agents can and can't do. The output is inconsistent. The knowledge is non-transferable. When the person who "gets AI" goes on leave, the AI capability walks out the door with them.

An AgenticOS is the system layer that fixes this. It is a composable, version-controlled set of files that sit inside your repo and tell agents how to behave, what to do, and what the rules are. You can build it in a day. It survives session boundaries, git clones, and team rotations.

Here is the complete map.

What an AgenticOS Actually Is

Before the components, the model: an AgenticOS is not a product, a vendor, or a framework you install. It is a directory structure you commit. The files are plain text. They define behaviour the same way a well-written README defines conventions. The difference is that agents can read them at runtime, and Claude Code (and similar tools) have predictable rules for which files get loaded, in what order, at what priority.

The full system has seven components. Each one solves a distinct problem. You can adopt them incrementally, starting with the highest-leverage layer and building down. The components are not interchangeable: Skills are the atomic unit. Everything else either produces Skills, consumes them, or governs how they're invoked.

The Seven Components

Skills

A skill is a markdown file that tells an agent how to perform a class of work. Not a specific task. A class. blog-post-author handles every blog post. code-reviewer handles every review. prompt-shaper handles every time someone says "I have a vague idea and need it scoped."

Skills are stored under .claude/skills//SKILL.md. They can include references, examples, and sub-files. The agent loads the skill when it's invoked and treats it as a first-class instruction set. A skill is not a system prompt and not a mega-prompt. It is a narrow, composable instruction set for one class of work. It can reference other files in its own directory: a references/ folder for structural guides, an assets/ folder for templates and examples.

The key property: skills are reusable. Write one once; every agent invocation that hits that skill class gets the same behavior. That is the beginning of consistency. It is also the beginning of reviewability: when the output is wrong, you fix the skill file and every future invocation benefits. You are not fixing a conversation. You are fixing a system.

Templates

Templates are structured starting points for recurring patterns. Where skills define behavior, templates define structure. A PR description template defines the sections a PR description always has. A post brief template defines the sections a post brief always has. A meeting-notes template defines the sections a meeting note always has.

Templates live alongside skills or in their own directory. They are most powerful when wired to a shaper (see below) that fills them in based on intake. A blank template is just a document. A filled template produced by an agent is a repeatable output.

The value of templates compounds. The first time you use a template, you save five minutes. By the tenth time, you have a body of consistently structured outputs that agents can read, compare, and build on top of. The inconsistency that comes from free-form generation accumulates debt. Templates prevent that debt from accruing.

Shapers

A shaper is an intake agent. Its job is to take a vague request and return a scoped brief. The scope includes: what the output is, who it's for, what the single takeaway or deliverable is, what assets are needed, and what quality criteria apply.

The reason shapers exist is that most agent failures start at intake. A vague request produces a vague output. The agent makes assumptions you didn't intend. You spend five minutes correcting a twenty-minute draft. Shapers front-load that conversation into a structured moment so the author (or builder, or engineer) knows exactly what to produce.

Shapers interact with you. They ask a focused set of questions and stop. The brief they produce is the contract everything downstream runs against.

Specialist Agents

A specialist agent is a named, purpose-built subagent with a declared scope and declared tools. code-reviewer only reviews code; it doesn't write it. security-reviewer only flags security issues; it doesn't fix them. content-ops runs an expert-panel scoring pass; it doesn't redraft the content.

The principle is separation of concerns at the agent level. Generalist agents are fine for exploration. Production-grade agent systems use specialists because narrow scope means fewer errors, cleaner output, and reviewable decisions.

Specialist agents live under .claude/agents/. Each one has a system prompt, a tool allowlist, and a declared output contract. You call them from orchestrator agents; they do the work and report back.

Memory

Memory is persistent context that survives session boundaries. By default, an AI conversation forgets everything when the session ends. Memory is the fix: a directory of short markdown files, each capturing one non-obvious fact that would otherwise have to be rediscovered.

Memory files live under .claude/memory/. An index file (.claude/memory/MEMORY.md) lists every entry with a one-line hook. At the start of each session, the agent reads the index, scans for relevant entries, and starts with context that would otherwise take fifteen minutes of re-explanation to reconstruct.

What belongs in memory: decisions, preferences, rules that were learned through correction, in-flight initiatives, people and their roles. What does not belong: things derivable from the repo itself. Memory is for facts the code doesn't contain.

The discipline is the index. If the index grows past 200 lines, it gets truncated in context and stops being useful. Every new entry should earn its place by answering the question: is this something a future session would otherwise have to painfully relearn? If yes, write it. If the code already shows it, skip it.

Hooks

Hooks are automated behaviors wired to session events. They live in settings.json under the hooks key. You can fire a hook on PreToolUse (before an agent takes an action), PostToolUse (after), and Stop (when an agent session ends).

The canonical use case: a PostToolUse hook on Write that auto-runs your linter. A PreToolUse hook on Bash that logs the command for audit. A Stop hook that posts a summary to your team Slack channel.

Hooks are where the AgenticOS connects to your existing toolchain. They are lightweight event handlers. They do not need to be complex to be valuable. A five-line hook that validates every file write pays for itself the first time it catches a malformed JSON write before it reaches CI.

CLAUDE.md

CLAUDE.md is the constitution. It is the highest-priority instruction file in the system. Claude Code reads it at session start and treats it as ground truth. Every other instruction source (skill files, agent prompts, in-conversation instructions) operates within the bounds CLAUDE.md sets.

CLAUDE.md can live at two levels:

~/.claude/CLAUDE.md: global rules that apply across every project on the machine
/CLAUDE.md: project-specific rules that override or extend the global rules

The global file sets universal norms: memory path, subagent dispatch patterns, anti-patterns, communication style. The project file sets repo-specific norms: which shapers apply to which work types, how to run tests, what the branching strategy is, who owns what.

The common mistake is starting with CLAUDE.md. Don't. Start with skills.

Build Order

Build skills first, not CLAUDE.md.

The reason is that skills are atomic. Each skill file does one thing. It has no dependencies on other skills, on memory, or on hooks. You can write a single skill, invoke it, and immediately see whether it works. Skills give you fast feedback with zero risk of circular dependency.

CLAUDE.md, by contrast, references everything else. If you write your CLAUDE.md before you have skills, you are writing rules that reference capabilities that don't exist yet. The result is a constitution full of dead letters.

The practical build order:

Start with the skill for the work type you do most. One file. Invoke it. Fix it.
Add a second skill for the next most common work type. Invoke it.
Write a shaper for each skill that needs structured intake.
Add memory once you have enough repeated sessions to know what you keep re-explaining.
Wire hooks once you have enough agent sessions to know which side effects you want to automate.
Write CLAUDE.md once the rest of the system exists and you know what rules actually govern it.

The other build-order mistake is building everything in isolation and then wiring it together. Skills that have never been invoked inside a real session are theory. Invoke early and often. The feedback loop between "I wrote a skill file" and "this skill file produces the output I actually want" is where the real design work happens. You will rewrite your first skill file at least twice. That is not failure. That is calibration.

Value and Effort by Layer

Each layer is listed in build order. Effort is relative to your first skill taking roughly two hours.

Skills: highest ongoing leverage. Effort: two to four hours per skill. Payback: immediate, every invocation.
Templates: medium leverage. Effort: thirty to sixty minutes per template. Payback: fast if the work type recurs daily.
Shapers: high leverage for team use. Effort: two to three hours per shaper. Payback: strongest when multiple people are using the same skill.
Specialist agents: high leverage for fan-out workloads. Effort: one to two hours per agent. Payback: strongest when you are parallelizing review or generation across multiple subagents.
Memory: medium leverage, compounds over time. Effort: ten minutes per entry, ongoing. Payback: slow start, then becomes the most time-saving layer as the knowledge base grows.
Hooks: low effort for high reliability. Effort: thirty minutes per hook. Payback: immediate if you have an existing CI/lint step to connect.
CLAUDE.md: one-time setup, high trust anchor. Effort: two to four hours for the first version. Payback: removes a category of repeated re-explanation permanently.

The Series: Where Each Layer Goes Deeper (coming soon)

This post is the map. The series builds each layer out in full.

Skills and Templates: The Atomic Unit of Agent Behavior →
Shapers: How to Structure Intake So Agents Don't Assume →
Specialist Agents: Separation of Concerns at the Agent Level →
Memory: The Persistent Context Layer →
Hooks: Wiring Agents to Your Toolchain →
CLAUDE.md: Writing the Constitution Last →

Each post is paid. The map is free because the map is useless without the build playbook, and the build playbook is what you subscribe for.

That's the map. The build playbook starts with skills. Subscribe to get each layer as I write it.

Subscribe to AgenticOS on Substack

You Don't Get to Pick CA

Glenn Eggleton — Fri, 29 May 2026 23:59:39 GMT

A team I worked with had a slow dashboard. The account balance endpoint hit Postgres on every load, the page felt sluggish under traffic, and the fix was obvious to everyone: add a read replica. Point the dashboard reads at the replica, keep writes on the primary, ship it. The graph got faster. Nobody filed it as a correctness change because it wasn't one. It was a performance change.

Three weeks later, support got a ticket. A user had spent their balance, the deduction committed on the primary, and for a second and a half the dashboard (reading off the replica, which hadn't caught up) still showed the old, higher number. The user saw money they no longer had. During that window someone could have made a second decision against a balance that was already gone.

Nobody decided to weaken consistency. There was no design doc that said "we accept showing users stale balances." The team added a replica to make a page faster, and in doing so they quietly traded away a guarantee they didn't know they were holding. The trade was real. The decision was never made.

Here's the position I want to argue: you do not get to pick "CA." The C-versus-A choice everyone quotes from CAP isn't a menu you order from once at architecture time. It's a behavior your system exhibits during a partition, and a different tradeoff, latency versus consistency, that your system makes on every single read when there's no partition at all. You are choosing both of these constantly. The only question is whether you're choosing them on purpose.

The "P" was never optional

CAP gets quoted as "pick two of three: Consistency, Availability, Partition tolerance." That framing is where the damage starts, because it puts P on the same shelf as the other two, as if it were a property you might decline.

You can't decline it. P is partition tolerance: the system continuing to function when the network between nodes drops, delays, or reorders messages. And the network will do that. A switch reboots, a NIC flaps, an availability zone goes dark, a Kubernetes node gets cordoned mid-deploy, a GC pause makes a node look dead for 800ms. The moment your data lives on more than one machine (a replica, a second region, a cache on a different box) partitions are a thing that happens to you. Refusing to tolerate them doesn't mean they stop. It means your system corrupts or hangs when one occurs.

So the honest reading of CAP is not "pick two." It's: partitions happen, so when one does, you get to pick exactly one of C or A, and you've already picked, whether you know it or not.

Choose C (a CP system): during a partition, refuse to serve requests you can't serve correctly. The node that can't confirm it has the latest data returns an error or blocks rather than hand back a possibly-stale answer. You stay correct; you give up availability for the duration.
Choose A (an AP system): during a partition, keep answering with whatever data you have locally, even if it might be stale or might later conflict. You stay up; you give up consistency for the duration.

There is no third door where the partition politely waits for you. The replica setup above is an AP choice that nobody recognized as a choice. When the replica lagged (a tiny partition in time, if not in topology), the system happily served the stale balance instead of refusing. That was availability winning over consistency. It just won by default, in a config change labeled "performance."

PACELC: the half you decide every day

Here's the part CAP leaves out, and it's the part that actually runs your life. CAP only describes what happens during a partition. Partitions are rare. What about the other 99.9% of the time, when the network is fine?

That's PACELC. Read it as: if Partition, then Availability-or-Consistency; Else, Latency-or-Consistency. The first half is just CAP. The second half, the "E," for else, is the one you decide on every read, every day, usually without noticing.

When there's no partition and everything is healthy, you are still trading consistency against latency. Every time you add a cache, you've decided that serving a possibly-stale value fast is better than fetching the authoritative value slow. Every time you read from a replica, you've decided that the replica's slightly-behind view is acceptable in exchange for taking load off the primary. Those aren't partition-time decisions. They're the normal, healthy-network state of your system.

The dashboard team thought they were operating in the "E" branch: no partition, just trading a little latency for a faster page. And most of the time they were, and it was fine. But they never specified the "P" branch. They never said what the system should do when the replica fell behind. So the system did the default AP thing, serving stale, at exactly the moment correctness mattered. The everyday latency tradeoff and the rare partition tradeoff are the same architectural seam, and they'd only reasoned about one side of it.

A cache and a read replica are not performance features. They are consistency tradeoffs that happen to make things faster. The speed is the part you notice. The traded-away guarantee is the part that pages you at 2am.

The agent will make this trade for you, silently

Now drop an AI agent into this. Ask it to "make this endpoint scale" or "this query is slow, speed it up," and watch what it does.

It will, cheerfully and competently, add a cache. Or suggest reading from a replica. Or memoize the result. The code will be clean. It'll wire up Redis with a sensible TTL, or flip the read to a replica connection, and it'll compile and pass your tests, because your tests assert on a single-node happy path where the cache and the primary always agree.

What it will not do is tell you it just changed your correctness model. There's no line in the diff that says "heads up: this endpoint is now AP. Under replica lag or a cache that's mid-invalidation, it will serve stale data, and for a balance check that's a correctness bug." The agent optimized the metric you named (latency) and silently spent a budget you didn't name (consistency). This is exactly the silent logic drift failure mode: it did the task you asked and quietly changed a guarantee you didn't, in a diff that reads like a reasonable optimization. The agent isn't wrong that a cache makes it faster. It's wrong by omission about what that costs.

"But strong consistency everywhere is the safe default, right?"

This is the natural objection, and it deserves a real answer: if AP is so dangerous, just be CP everywhere. Strong consistency, no caches, no replicas for reads, always correct. Safe, right?

No, and not because it's slow, though it is. It's because "CP everywhere" is also a choice with a cost you have to actually want. A strictly consistent system gives up availability during partitions by design: when the network splits, the minority side stops serving rather than risk divergence. If you genuinely make everything CP, you're signing up for the endpoint to return errors during every network blip, every failover, every node that's briefly unreachable. For a payment authorization, that's correct and worth it. You'd rather decline than double-spend. For a "number of likes" counter or a recommendations widget, refusing to serve during a blip is absurd; stale-but-up is obviously right.

The point isn't that CP beats AP or the reverse. It's that the right answer is per-feature, and it has to be chosen. A balance check wants CP. A view counter wants AP. The dashboard's bug wasn't that they picked AP. It's that a balance, which wanted CP, got AP by accident because nobody made the call. "Strong everywhere" isn't safety. It's a different unmade decision with a different bill.

Make the choice before the agent makes it for you

So the inversion is this: every read path in your system has already chosen C-or-A for partitions and L-or-C for the healthy case. The cache you added, the replica you read from, the strict primary you kept — each is a stance. You're not deciding whether to make these tradeoffs. You're only deciding whether they're written down or discovered in an incident channel.

Before you let an agent "make it scale," make the call yourself. Here's a starter prompt that forces the agent to surface the tradeoff instead of burying it in a TTL:

For the feature below, classify it as CP or AP and justify the choice in one sentence.

Then answer, concretely:
1. PARTITION BEHAVIOR: When the network partitions (e.g. a read
   replica lags, the cache and source disagree, or a node is
   unreachable), exactly what should this feature do — serve
   possibly-stale data, return an error, block, or something else?
   State the actual behavior, not the principle.
2. PACELC / ELSE BEHAVIOR: When there is NO partition and everything
   is healthy, what latency-vs-consistency tradeoff does this feature
   make on a normal read (e.g. read from a cache/replica for speed, or
   always hit the authoritative source)? Name it and justify it.

Do not propose an implementation yet. First commit to the behavior.

Feature:

What to verify: check that it actually committed to a behavior during a partition, a concrete "serve stale" or "return 503" or "block until caught up," not just a tidy definition of CP and AP. If it defined the terms and dodged the behavior, it dodged the only part that matters. Make it answer "what does this do when the replica lags," in those words.

That's the theory and a way to force the decision into the open. The production patterns that actually implement each side of these choices (idempotency keys so a retried write is safe, retries with backoff and jitter, sagas for the transactions that cross a service boundary, distributed locks with fencing tokens, caching with stampede protection) are the paid series, Tuesdays and Thursdays. Each one comes with the prompt to generate it and the checklist to verify what the agent hands you against the failure modes that pattern hides. This post is F1: it's the lens the whole series looks through, because every one of those patterns is a deliberate answer to a C-or-A, L-or-C question you'd otherwise answer by accident.

Subscribe free for the Friday theory. Upgrade when you want the implementations and the verification checklists.

And tell me in the comments: what's a "performance optimization" in your system that quietly changed what your users are allowed to see — and who found out first, you or them?

Your Handler Will Run Twice

Glenn Eggleton — Fri, 29 May 2026 23:59:16 GMT

3:10am. PagerDuty. A customer is charged $49 twice for one subscription, eleven seconds apart. You pull the logs and find two near-identical webhook deliveries from the payment processor for the same charge. Two 200s went back. Two rows in the ledger.

Here's the part that ruins your morning: nothing was broken. The first call to your handler worked. The card was charged, the row was written, and then the 200 you sent back got lost on the way home. The network ate the ack. From the processor's side, your endpoint never responded, so it did exactly what a correct system does: it retried. The second delivery hit a handler that had no idea the first one had ever happened, and charged the card again.

The retry wasn't the bug. The retry was correct. Your handler just wasn't safe to run twice, and on a long enough timeline, every handler runs twice.

Here's the position I want to argue: at-least-once delivery is not an edge case you can defer. It's the contract every queue, webhook, and retrying client already operates under, whether you opted in or not. Duplicates aren't an exception your system might encounter. They're a guarantee it ships with. Idempotency is the only thing that makes "ran twice" equal "ran once." If you haven't built it, you don't have a hypothetical risk. You have a double-charge with a date on it.

"Exactly-once" is a sentence you can say, not a thing you can build

Everyone wants exactly-once delivery: the message arrives, your handler runs, once, done. It's a lovely idea and it does not survive contact with a network.

Walk the failure. A sender delivers a message and waits for an ack. The ack can get lost. Now the sender is stuck with a question it physically cannot answer: did the receiver process the message and the ack vanished, or did the message itself never arrive? From where the sender sits, those two worlds look identical. It has exactly two moves. Give up, and risk dropping a message that actually succeeded (that's at-most-once, and it loses data). Or retry, and risk running a message that already ran (that's at-least-once, and it makes duplicates).

There is no third door. You cannot build exactly-once on top of an unreliable network, because the sender can never tell "it worked" apart from "the receipt got lost." Every serious system picks at-least-once, because losing a payment is worse than processing one twice, if the receiver is built to absorb the duplicate. That "if" is the whole job. The system delivers at-least-once and hands you the duplicate. What you do with it is idempotency.

The agent writes the version that double-charges

Ask an agent for a "create payment" or "process order" handler and you will get clean, idiomatic, straight-line TypeScript. Validate the body. Charge the card. Write the row. Return 201. It reads beautifully. It compiles. The happy-path test, one request in, one charge out, goes green.

And it double-charges in production, because nothing in that handler asks the only question that matters: have I seen this request before? The agent wrote a handler that's correct for exactly-once delivery, in a world that only offers at-least-once. The gap never shows up in review, because the diff is genuinely good code. It shows up at 3am, because no test in the suite sends the same request twice. The duplicate is the case nobody wrote, so it's the case nobody caught.

This is the AI-native trap in miniature: the model optimizes for the request you typed, and you typed "create payment," not "create payment safely under retry." The retry is invisible in the prompt, so it's invisible in the output.

Idempotent is not the same as "retried"

This is the distinction that trips people, so let me draw the line sharply.

A retried handler is one a client calls again after a failure. That's a property of the caller. Anything can be retried. Retrying a non-idempotent handler is exactly how you double-charge.
An idempotent handler is one where running it N times lands the system in the same state as running it once. That's a property of the handler. Idempotency is what makes a retry safe instead of dangerous.

Retries are inevitable. Idempotency is the thing you build so the inevitable retry doesn't cost you money.

Two ways to get there. The cheap one is natural idempotency: design the operation so re-running it is a no-op by construction. This is the real reason HTTP draws a line between PUT and POST. PUT /orders/123/status = shipped is naturally idempotent: send it five times, the status is shipped, full stop. The fifth call overwrites the same field with the same value and nothing new happens. POST /charges is naturally not idempotent: each call means "make a new charge," so five calls mean five charges. When you can model an operation as "set this resource to this state" instead of "perform this action," you get idempotency for free and you should take it.

But you can't always restructure the operation. Charging a card is inherently a POST-shaped "do a thing" action. For those, you need the second tool.

Idempotency keys, and the window you're choosing whether you admit it or not

An idempotency key is the caller's claim that "this request and that earlier request are the same logical operation — don't do it twice." The processor sends a stable key with each delivery (and its retries reuse the same key). Your handler's job: the first time it sees a key, do the work and remember the outcome; every later time it sees that key, skip the work and replay the remembered outcome.

So the duplicate webhook from the cold open arrives carrying the same key as the original. Your handler looks it up, sees "already charged, here's the 201 I sent last time," and replays that exact response. The processor gets its 200. The card is charged once. The retry was still correct. Your handler just made it harmless.

Which forces a question most handlers never answer out loud: how long do you remember a key? That's your dedup window, and you are choosing one whether you decide it deliberately or not. Remember keys for an hour and a retry that lands 90 minutes later (entirely possible when a processor backs off through a long outage) sails past your memory and charges again. Remember them forever and your dedup store grows without bound. The window is a real engineering decision with a real failure mode on each side, and "I didn't think about it" defaults you to whatever your store's eviction policy happens to be.

The objection: "the processor already handles this, why is it my problem?"

Fair. Stripe and the rest do offer idempotency keys on their side. So why duplicate the machinery in your handler?

Because their key protects their operation, not yours. When a webhook fires, you are the receiver, and the at-least-once contract runs to your door. Your handler is the thing that writes to your ledger, enqueues your fulfillment, sends your confirmation email. The upstream key stops them from creating two charges. It does nothing to stop your handler, invoked twice by two webhook deliveries, from writing two ledger rows and sending two emails. The boundary you have to make safe is the one you own. Every hop in the system is its own at-least-once boundary, and each one needs its own answer.

A starter prompt — and the one thing the agent will get wrong

Here's a prompt that gets you a real first draft instead of the straight-line double-charger:

I have a POST handler in TypeScript/Node (Express) that charges a card and
writes a ledger row. Make it idempotent using an idempotency key sent in the
`Idempotency-Key` header.

Requirements:
- On the FIRST request for a key: persist the key in an "in-progress" state
  BEFORE doing the work, do the charge + ledger write, then transition the
  key to "completed" and store the response body + status code.
- On a DUPLICATE request for a completed key: do NOT redo the work — replay
  the stored response body and status code.
- State the dedup window you chose (how long keys are retained) and why.
- Then explain what happens when TWO requests with the SAME key arrive
  CONCURRENTLY (not sequentially), and make that case safe.

What to verify: the concurrent-duplicate race, not just the sequential retry. Two deliveries with the same key arriving at the same time will both check the store, both see "no key yet," and both proceed to charge, unless the first write of the key is an atomic claim (a unique constraint or an atomic SET NX) that one request wins and the other is forced to wait for or replay. Most agent first drafts handle the retry that arrives a second later and quietly assume requests are sequential. The duplicate that arrives in the same millisecond is the one that double-charges through your shiny new idempotency layer. If the answer doesn't name that race, it isn't done.

That's the theory and a starter prompt: enough to know what right looks like and to get a first draft that won't double-charge on a sequential retry. The production version is the next thing I'm publishing in the paid series (Tuesdays and Thursdays): "Idempotency keys in production." It's the full middleware with request fingerprinting, the in-progress vs. completed state machine, response replay, the TTL that sets your dedup window, both a Postgres and a Redis variant, and the concurrent-duplicate tests that prove the race is actually closed. Plus the prompt to generate it and the checklist to verify what the agent hands you against every failure mode this post named.

Subscribe free for the Friday theory. Upgrade when you want the implementation that survives the retry.

And tell me in the comments: what's the most expensive duplicate your system ever processed, and was it the retry's fault, or the handler's?

Retries Don't Make You Fault-Tolerant

Glenn Eggleton — Fri, 29 May 2026 23:59:00 GMT

3:10pm. The payments service starts answering slowly. Not failing, just slow. p99 drifts from 120ms to 1.8s. Nothing pages, because nothing is down. A garbage-collection pause on their side, maybe a noisy neighbor. The kind of blip that should heal itself in ninety seconds.

It doesn't heal. By 3:14pm the checkout service is throwing 503s, and so is the cart service behind it, and the storefront behind that. A GC pause two hops away has become a customer-facing outage. The incident channel fills with people asking why a payments slowdown took down the homepage.

Here's why. Every caller of payments was wrapped in the same reasonable-looking retry: try the call, if it fails wait 200ms, try again, up to three times. When payments got slow, every in-flight request waited, eventually failed, and all retried at once: three times each, on a fixed 200ms beat, perfectly synchronized. The wounded service, already struggling, got hit with triple its normal traffic in tight rhythmic waves. That finished it off. Then the callers' connection pools filled with requests stuck retrying, so the callers stopped answering their callers, and the failure walked outward one hop at a time.

The retries didn't make the system fault-tolerant. The retries were the fault.

The position: retries are an amplifier, and timeouts are the first-class concern

Here's what I want to argue. A retry is not a safety mechanism. A retry is a load multiplier with a delay built in. Pointed at a healthy dependency that had a one-off transient failure, it's exactly what you want. Pointed at a dependency that is slow or struggling (which is precisely when retries fire most), it multiplies load onto the system least able to absorb it. Retries are a positive feedback loop, and positive feedback loops are how small perturbations become outages.

So the first move when you're hardening a call across the network is not "add retries." It's "bound the wait." A timeout is the only thing in your toolkit that strictly reduces load under stress instead of adding to it. Everything else (backoff, breakers, bulkheads) exists to make retries safe enough to be worth having. The timeout is the floor you build on.

Most of the time, when someone asks an agent to "make this call resilient," they get the amplifier and skip the floor.

The toolkit, in order

Fault tolerance isn't one trick. It's a layered discipline, and the order matters because each layer stops a failure the layer below can't:

Timeout: bound every wait. The failure it stops: an unbounded wait. Without a timeout, one slow dependency holds your request open forever, and held-open requests are how connection pools and the event loop's pending work pile up until you fall over. No timeout, no fault tolerance. Full stop.
Exponential backoff with jitter: space the retries out, and desynchronize them. The failure it stops: the retry storm from the cold open. Exponential backoff (200ms, 400ms, 800ms) keeps you from hammering. Jitter, randomizing each delay instead of everyone waiting exactly 200ms, is the part that breaks the synchronized thundering herd. Backoff without jitter still produces tidy, lethal waves.
Circuit breaker: stop calling a dependency that's clearly down. The failure it stops: hammering a corpse. After enough failures, the breaker opens and fails fast for a cooldown, then half-opens to test the water with a trickle, then closes when health returns. This is what gives the wounded service room to recover instead of being kept under by your retries.
Bulkhead: isolate each dependency's resources. The failure it stops: one slow dependency starving every other call. If all your outbound calls share one connection pool or one concurrency budget, a single slow dependency consumes it and unrelated calls start failing too. Bulkheads cap how much of your capacity any one dependency can hold, like watertight compartments in a ship's hull.
Backpressure / load-shedding: refuse work you can't do. The failure it stops: overload collapse. When you're past capacity, accepting more requests just makes everything slower for everyone and serves nobody. Shedding load, saying "no" fast with a 429, keeps the requests you do accept healthy.

Each of those is a deep-dive of its own. The point of this post is the shape: timeouts and jitter and error-classification at the bottom, breakers and bulkheads and shedding on top. Skip the bottom and the top can't save you.

The one distinction that changes everything: transient vs permanent

Underneath all of it sits a decision the naive version always gets wrong: should this error even be retried?

A timeout, a connection reset, a 503, a 429: those are transient. The dependency might succeed if you ask again later. Retrying is rational.

A 400, a 401, a 404, a validation rejection: those are permanent. The request is malformed or unauthorized or pointed at something that doesn't exist. Asking again, ever, with the same input, will never succeed. Retrying a 400 three times is just sending a guaranteed-doomed request four times: pure amplification, zero upside. You burned the dependency's capacity to confirm something you already knew on the first try.

Classify before you retry. Retry the transient, fail fast on the permanent.

"AI gives you this for free now" — no, it gives you the storm

Ask an agent to "add retries to this client call" and watch what you get. Something like: a for loop, three attempts, await new Promise(r => setTimeout(r, 200)) between them, wrapped in a try/catch that retries on any thrown error.

It compiles. It reads cleanly. It passes the happy-path test, because on the happy path the first attempt succeeds and the loop never runs twice. A skim approves it.

It's the cold open. Fixed delay, no jitter: synchronized waves. No timeout, so a slow dependency hangs every attempt for as long as it likes, and your three "retries" become three unbounded waits stacked end to end. Catch-all error handling, so it cheerfully retries the 400 that will never, ever succeed. The agent built the exact retry storm that takes down the callee, and it built it in code that looks like the responsible thing to do. This is AI's silent-logic-drift failure mode wearing a competence disguise: the dangerous version and the safe version look almost identical, and the difference is everything that's missing.

The starter prompt

You don't fix this by typing more carefully. You fix it by asking for the right thing and then verifying the parts that don't show up in a skim. Here's a starter prompt that names every layer the naive version drops:

Wrap the call `await paymentsClient.charge(req)` so it is resilient to a
slow or failing dependency, in TypeScript:

- Enforce a hard per-attempt timeout (the call must not be able to hang).
- Retry only TRANSIENT failures (timeouts, connection resets, 5xx, 429).
  Do NOT retry permanent failures (4xx other than 429 — 400/401/403/404).
- Use exponential backoff WITH jitter between retries (not a fixed delay).
- Put a circuit breaker in front: after N consecutive failures, open and
  fail fast for a cooldown, then half-open to probe, then close on success.

Then, in plain prose, state: the total retry budget (max attempts and
worst-case total time), and the exact conditions under which the breaker
moves between closed / open / half-open.

What to verify: confirm there's a hard per-attempt timeout (not just a total deadline), that the backoff is actually jittered and not a fixed delay, and that 4xx-style permanent errors are explicitly not retried. If any one of those is missing, you have the storm again, politely.

The inversion

The instinct says: a system gets more reliable as you add retries. The reality is the opposite. Each retry you add without a timeout, without jitter, without error-classification, makes the system more fragile under exactly the load it's supposed to survive, because you've added another path that multiplies traffic at the worst possible moment. Reliability didn't come from the retries. It came from the timeouts that bound them, the jitter that desynchronized them, the breaker that stopped them, and the judgment to know which errors deserved a second ask at all.

Retries are the last thing you add, not the first. And they're only ever as safe as the four things underneath them.

That's the theory and a first draft. The production version is in the paid series, Tuesdays and Thursdays: "Retries that don't take down the callee" (timeouts, exponential backoff with jitter, retry budgets, classifying retryable vs permanent, idempotency-aware retries) and "Circuit breakers, bulkheads & backpressure" (the full closed/open/half-open state machine in TypeScript, bulkhead isolation, load shedding, queue + backpressure) — each with the prompt to generate it and the checklist to verify what the agent hands you.

Subscribe free for the Friday theory. Upgrade when you want the production code and the verification checklist.

And tell me in the comments: what's the smallest blip you've watched cascade into a full outage — and was it the retries that did it?

Your Logs Are Lying to You

Glenn Eggleton — Fri, 29 May 2026 23:58:44 GMT

2:14am. A customer's checkout failed, and the on-call pager doesn't care that you were asleep. You open the dashboard: the API gateway logged the request, the order service logged something, the payment service logged something, inventory logged a warning, the notification worker logged an error. Five services. Five log streams. All of them have logs from 2:14am.

And not one of them tells you which lines belong to this request.

So you do the thing every engineer has done at 2am: you eyeball timestamps. You assume the gateway log at 02:14:06.221 and the payment error at 02:14:06.498 are the same request, because they're close and the story sort of fits. You're wrong, because at 2am there were forty other requests in that same 300ms window, and the error you're staring at belongs to a different user entirely. An hour later you've built a narrative out of coincidence, the incident is still burning, and your "observability stack" has told you nothing it didn't already know.

Here's the position I want to argue: per-service logs without a shared correlation ID aren't observability. They're text. You have a search box and a hope. The moment a request crosses a service boundary, you have lost the ability to reconstruct what happened to it, and no amount of log volume buys that ability back.

A log line answers "what happened here." It can't answer "what happened to this request."

The trap is that logging feels solved. Every service logs. The logs are searchable. You can grep. So when someone says "we need better observability," the instinct is "we already have logs."

But a log line is a local fact: this service, at this moment, observed this. That's genuinely useful when the request lives and dies inside one process. The instant it fans out (gateway calls order, order calls payment and inventory, inventory drops a message on a queue that a worker picks up later), the request becomes a path across processes, and no individual log line knows it's part of a path. Each line is a true sentence with no idea what paragraph it belongs to.

The fix is almost insultingly simple to state and almost never done by default: mint one ID at the edge (the first hop the request touches) and thread it through every downstream call and every log line, in every service, for the entire life of the request. Now "what happened to this request" is a single query: correlation_id = abc123. The path reassembles itself. The five-service eyeballing exercise becomes one filter.

That ID is the load-bearing idea behind the whole observability stack. The three pillars people recite — logs, metrics, traces — aren't three views of the same thing; they answer different questions. Metrics tell you something is wrong (error rate is up, p99 spiked). Logs tell you what specifically happened at a point. Traces tell you where in the path the time went and where it broke. The correlation ID is the thread that lets a metric spike lead you to the exact traces, which lead you to the exact log lines. Without it, the three pillars are three disconnected dashboards you alt-tab between, narrating coincidence.

While we're here: your log levels and your log format are also lying

Two smaller lies ride along with the big one.

First, format. A log line like Order 4471 failed for user 92 after 1200ms is a sentence. To query it you write a regex, and the regex breaks the day someone rephrases the message. A structured log line, { "msg": "order failed", "order_id": 4471, "user_id": 92, "duration_ms": 1200, "correlation_id": "abc123" }, is data. You filter on fields, you aggregate duration_ms, you join on correlation_id. Same information, except one of them you can ask questions of and the other you can only read.

Second, levels. In most codebases info, warn, and error have drifted into "how I felt about this line when I wrote it." If everything routine is error because it looked scary, then error means nothing, and at 2am you can't filter to the lines that actually matter. A level is a routing and alerting decision, not a mood. error should mean a human needs to know. If it doesn't mean that consistently, your most important filter is noise.

The AI angle: ask an agent to "add logging" and watch it make this worse

This is exactly the failure mode this series keeps circling. Ask Claude or Cursor to "add logging to this service" and you get a confident sprinkle of console.log and logger.info calls, each with the local context the agent could see: logger.info('processing order', { orderId }). In a single file, in a single service, it looks great. Reviews clean. Ships.

And it's useless the instant the request crosses a boundary, because the agent had no concept of the request's whole journey, only the function in front of it. There's no ID minted at the edge, nothing propagated to the downstream call, nothing tying this service's lines to the next service's lines. The agent optimized for "this function now logs," which is a local fact, when the actual requirement was "this request is reconstructable across five services," which is a system property. It generated a plausible answer to the wrong question, and it compiles, and it passes review, and you find out at 2am.

You don't fix that by asking it to "add more logging." You fix it by naming the system property up front.

A starter prompt that names the right requirement

Here's a prompt that points the agent at the correlation ID and the async boundary it will otherwise lose track of:

Add a request correlation ID to this Node/TypeScript service.

Requirements:
- Generate a correlation ID at the edge (the first inbound hop). If the
  incoming request already carries one (e.g. an `x-correlation-id` header),
  reuse it instead of minting a new one.
- Store it in AsyncLocalStorage so any code in the request lifecycle can read
  it without passing it through every function signature.
- Automatically attach the correlation ID to every log line via the logger.
- Propagate it on every OUTGOING call this service makes (HTTP headers,
  outbound messages) so the next service inherits the same ID.

Then explain, in comments:
- How the ID survives an `await` (why AsyncLocalStorage and not a plain
  module-level variable).
- What happens when a request enters from a queue/worker instead of HTTP —
  where does the ID come from then, and what's the fallback if there isn't one.

What to verify: confirm the ID actually propagates across an `await` and into the downstream HTTP/queue callsawait and into the downstream HTTP/queue calls* (not just set once at the entry point and lost the moment you cross an async boundary), and that there's an explicit fallback when an inbound request arrives with no ID (queue messages, cron jobs, and replays often won't have one).

"We have distributed tracing, isn't this solved?"

Strongest objection, so let me steelman it: "We run OpenTelemetry. Spans propagate trace context. The correlation ID is just the trace ID. You're describing a solved problem."

If you've genuinely wired OTel context propagation through every hop including your async boundaries and your queue consumers, and your logs carry the trace ID, then yes, you've done the hard part. But "propagates correctly" is the entire game, and it's exactly where it quietly breaks: the worker that pulls from the queue starts a fresh context with no parent, the setTimeout callback runs outside the active context, the third-party client doesn't inject headers — and your traces fill with orphan spans you can't connect to the request that caused them. OTel is the right tool. It doesn't absolve you of understanding why the ID has to be minted at the edge and survive every boundary; it just gives you a heavier way to get it wrong if you don't.

The inversion

Stop measuring your observability by how much you log. Volume is the symptom of the problem, not the cure. More unstructured, uncorrelated lines is just a bigger pile to grep at 2am.

Measure it by one question instead: can you take a single failed request and reconstruct its entire path across every service, in one query, without eyeballing a single timestamp? If the answer is no, you don't have an observability gap. You have a correlation gap, and every log line you add until you close it is text you're paying to store and hoping you never have to read at 2am.

That's the theory and a first draft. The production version — AsyncLocalStorage context propagation done so it survives every async boundary and queue hop, structured logging wired to the logger, OpenTelemetry instrumentation, log↔trace correlation, plus the prompt to generate it and the checklist to verify what the agent hands you against the boundaries it loves to drop — is P4: Correlation IDs & Distributed Tracing in the paid series, which runs Tuesdays and Thursdays.

Subscribe free for the Friday theory. Upgrade when you want the wiring and the verification checklist.

And tell me in the comments: what's the longest you've spent correlating an incident by eyeballing timestamps, and what would a single correlation ID have saved you?

There Is No Transaction Across Two Services

Glenn Eggleton — Fri, 29 May 2026 23:58:31 GMT

3:10am. PagerDuty. A user moved 5,000 points from their wallet to a partner gift card and the balances don't add up: the wallet shows the points gone, the gift-card service shows nothing arrived. Support has eleven of these from overnight. The on-call engineer pulls the code, expecting a missing await or a swallowed error. What they find is worse, because it looks correct.

async function transferPoints(userId: string, amount: number) {
  try {
    await walletService.debit(userId, amount);   // succeeds
    await giftCardService.credit(userId, amount); // throws — partner 503
  } catch (err) {
    logger.error("transfer failed", err);
    throw err;
  }
}

The debit committed. The credit threw. The catch ran, logged, re-threw, and did absolutely nothing about the 5,000 points that are now nowhere. There is no ROLLBACK here that reaches back across the network and un-debits the wallet, because walletService already committed its own local transaction the instant debit returned. The try/catch caught the exception. It could not catch the state.

Here's the position I want to argue: a database transaction's atomicity ends at the boundary of one database. The moment your operation spans two services (or two databases, even inside one service) you do not have a transaction anymore. You have a sequence of independent commits, and the only question is what you do when one of them lands and the next one doesn't.

ACID is a property of one log, not your architecture

Atomicity isn't magic. It's a single transaction log on a single database, where the engine can replay or discard a set of writes as one unit because it owns all of them. walletService has one of those. giftCardService has a different one. Neither can see the other's uncommitted state, neither can veto the other's commit, and nothing on earth makes their two logs agree to flip together.

So when the senior textbook answer comes up, "use two-phase commit," here's why the industry quietly walked away from it. 2PC works by having a coordinator ask every participant to prepare (lock the rows, promise to commit), wait for all of them to say yes, then tell everyone to commit. The promise is atomicity across services. The price is brutal: every participant holds locks for the entire round trip, so one slow service stalls all of them. And the failure mode is the killer. If the coordinator dies after everyone prepared but before it says commit, every participant is stuck holding locks, blocked, waiting for a coordinator that isn't coming back. It doesn't scale, it couples your availability to your slowest node, and it turns one dead process into a system-wide freeze. That's why your message broker, your payment processor, and your favorite cloud database all decline to offer it.

The saga: fake the atomicity, own the in-between

The pattern that replaced 2PC is the saga: model the operation as a sequence of local transactions, one per service, each of which commits independently and immediately. No distributed locks, no coordinator holding everyone hostage. The catch, and it is the whole point, is that every forward step must come with a compensating action: a second local transaction that semantically undoes it.

Debit the wallet. Then credit the gift card. If the credit fails, you don't roll back. You can't. You run the compensation for the debit: credit the wallet back. The system passes through an inconsistent state (points debited, not yet credited) and then claws its way back to a consistent one. That's the trade. You give up "the in-between never exists" and you buy "the in-between is bounded and recoverable."

Two ways to drive it. Orchestration: a coordinator owns the sequence. Call step 1, on success call step 2, on failure walk back through the compensations. The flow lives in one place you can read. Choreography: no coordinator; each service emits an event, the next service reacts to it, and compensations are themselves events others react to. Choreography removes the central component but scatters the flow across N services' event handlers, and "what's the current state of this transfer?" becomes a forensic exercise. For anything you'll have to debug at 3am, start with orchestration. You can read the whole story in one file.

"Fine — I'll just wrap it in a try/catch and undo on failure"

This is the objection worth taking seriously, because it's almost the saga, and it's exactly what an agent hands you when you ask for one. Steelman it: on the error path, call the reverse operation. Isn't that compensation?

It's the happy-path drawing of compensation with all the hard parts erased. Two questions break it. First: what is observable while you're in-between? Between the debit and the credit-back, a user can open the app and see points that vanished into the void. Your support queue is made of that window. A real saga names those states and decides what the user sees in each one: "transfer pending," not a silently wrong balance.

Second, and this is the one that ends the argument: what happens when the compensation itself fails? The gift-card credit threw because the partner returned 503. Now you call walletService.credit to compensate, and that throws too, because the wallet service is mid-deploy. Your try/catch has no catch for its own catch. The points are now gone with no record that a recovery was ever owed. A real saga treats compensation as a first-class, retryable, durably-recorded step — because the compensation is the part most likely to run during the exact incident that caused the failure. A try/catch has nowhere to put that. That's the tell that you're looking at a fake saga, not a real one.

The starter prompt

You don't ask an agent for "a transaction across two services." There isn't one, and it'll cheerfully write you the try/catch above and call it done. You ask it to model the saga, and to confess the parts it likes to skip:

I have a multi-service operation: .

Model this as a saga using orchestration (a single coordinator drives the
steps). For EACH forward step, define its compensating transaction — the
local operation that semantically undoes it. Then, before any code:

1. List every intermediate state that is observable to a user (e.g. money
   debited but not yet credited) and what the user should see in each.
2. Specify what happens if a COMPENSATION itself fails — it cannot just be
   a try/catch. How is it retried and where is the in-flight state recorded?

Do not write the production implementation yet. Give me the step/compensation
map and the two lists first.

What to verify: every forward step has a named compensation (no orphans), and there is a real answer for a failed compensation: durable record plus a retry path, not a catch that logs and re-throws. If the agent's compensation story is "wrap it in try/catch," you got the fake saga. Send it back.

That's the theory and a first map. The production version is the paid series, Tuesdays and Thursdays: a typed saga orchestrator with compensations that retry safely, how to record in-flight state so a failed compensation survives a restart (P5, Building a saga orchestrator), and how to publish those step events reliably without falling back into 2PC via the transactional outbox (P7, The outbox pattern). It ships with the prompt to generate the orchestrator and the checklist to verify what the agent gives you against every failure mode named here.

Subscribe free for the Friday theory. Upgrade when you want the orchestrator and the verification checklist.

And tell me in the comments: what's the worst limbo state a "transaction" has ever left in your system — money, points, or inventory stuck between two services that each thought the other had it?