The first topic to clear the expert gate — what changed over the weekend, and the night it actually mattered
Five days ago I asked whether anything would ever clear the expert gate. Over the weekend I built the machinery to find out. Over the next two days, six sub-areas on the same topic crossed it — the first topic to be fully expert across every one of its parts. This is the post about the work between those two moments, and the nightly that proved the machinery does more than promote on demand.
Five days ago I closed out the journal entry about the first proficient promotions with a hedge:
at least one sub-area, somewhere, has to eventually start clearing the expert gate. If none do across the next several weeks, the graduation path I wired in didn't actually graduate anything; it just made me feel like I'd built one.
The first one crossed on Sunday. By Monday night, all six sub-areas on the same topic had — the first topic to be fully expert across every one of its parts.
This post is about the work that made it possible, what the audit found, and the night the new machinery did something I hadn't asked it to do.
Saturday — the cleanup that wasn't supposed to be a story
Saturday I'd planned a routine system audit. About four hours of work expected — sweep the substrate for contamination, check the predicate vocabulary, prune scope drift. The kind of thing that doesn't make a journal entry.
It expanded into nine separate code fixes and 5,679 polluted triples removed. The big one: a predicate canonicalization gap was silently classifying most failure-mode triples as fragments when they actually had complete trigger + diagnosis + fix loops. The lowercase/uppercase mismatch meant the loop classifier was failing the membership check on every relation that had been canonicalized to uppercase form.
The visible effect: a stage that was supposed to measure how deep a sub-area's failure-mode coverage went had been measuring it about a third of what it actually was. Several sub-areas were stuck at cumulative-stage 3 not because they hadn't built the substrate but because the substrate was being silently underweighted.
I noticed because I went looking for a different bug — why isn't python's type_system advancing? — and the answer turned out to live in a different module entirely.
I rolled the fix in and let the nightly rebuild substrate counts from scratch. Half a dozen sub-areas snapped to higher stages that night purely from accurate measurement. Nobody promoted, but the conditions for promotion were now real.
Sunday — the expert pipeline got built, and it produced a promotion
The plan for Sunday was modest: design the expert-tier exercise shape, hand-curate the seed corpus for one sub-area, smoke-test the pipeline. Two or three honest data points. No promotion expected.
The new shape is different from anything the lower tiers do. Instead of "write this function" the prompt is "a user came to you with this request" — a real-world-shaped ask, intentionally requiring interpretation. The sub-area has to surface clarifying questions, identify the actual unit of work underneath, and produce a plan. Then a separate judge scores it across five dimensions calibrated for that shape.
The novel part is the second reviewer. A single judge scoring its own work has a known blindspot — it tends to rationalize the choices it would have made itself. So the expert shape uses what I'd been calling the designated skeptic: an independent second judge whose only job is to push back on the first judge's verdict and look for the flaws it might have rationalized away.
I picked the skeptic deliberately for difference, not for raw quality — same skill range as the primary judge, built on different foundations, with different blindspots. The pair operates under a conservative aggregator: when they agree, weighted average; when they disagree, take the lower score and flag the dimension. The point isn't to find the right answer between two opinions — it's to refuse to promote when there's an honest disagreement about quality.
The first sub-area I aimed it at was writing.summarization. It promoted to expert that evening — the first sub-area at that tier on any topic.
I went to bed expecting that to be the headline. It wasn't.
Monday — what I'd actually built was a multiplier
Monday morning I started by diagnosing why marketing and python untiered sub-areas weren't picking up. The picker had a structural bug: sub-areas close to mastery got rotation priority, which made sense in isolation, but blocked sub-areas — ones near the threshold but unable to actually advance — were monopolizing the cycles. They couldn't move forward but they kept being picked.
I shipped a small fix that demotes a sub-area out of the priority lane after a few stuck picks in a row, freeing the cycles for sub-areas that could actually use them. At the same time I biased the planner toward the gating stage — the one stage holding a sub-area back — so when a stuck sub-area finally did get a turn, the work concentrated on the blocker instead of being spread across all five stages.
Then I went back to the expert pipeline and ran it against the remaining five writing sub-areas. The dual-judge architecture caught real problems on the way — design_decision_justification came in just below the gate on one session because the skeptic correctly identified an over-engineering pattern the primary judge would have rationalized away. By late evening, all six writing sub-areas had crossed into expert — the first topic to be fully expert across the board.
The thing I want to be careful about: those promotions came faster than I'd budgeted. Six in two days isn't suspicious on its own — the gate is a sustained depth bar across multiple sessions, which is real but not impossible for a deep-enough sub-area to clear if the dual-judge is honest about it. But I wrote up the failure modes the gate currently doesn't catch — tag-uniform sessions, per-dimension floor leaks, adversary-concern clustering — and they're going into the next iteration. The gate that promoted these is the right gate to find out whether a topic can clear it. It's not strict enough to be the long-term standard.
I'll know whether these promotions hold when something tries to use one of them in production for a task it hadn't already practiced on. That test starts now.
The nightly that did something I hadn't asked for
The most interesting part wasn't the promotions. It was the nightly that fired Monday night with the new picker fix and the new planner bias both active for the first time.
I took a baseline snapshot before the restart: 34 untiered sub-areas on marketing and python, of which 23 were stuck at cumulative-stage 3 because their failure-mode coverage wasn't growing. They'd been frozen at that depth for weeks.
By Tuesday morning's diff:
- 24 of the 34 had been picked overnight
- 23 of those had passed an exercise
- 24 of them had grown their stage-4 (failure-mode) coverage — the thing they'd been stuck on
- 20 had actually advanced cumulative-stage
- 9 had crossed into the competent tier
The picker fix was working — the stuck-band-0 sub-areas were getting demoted, with markers on campaign_planning, ab_testing_design, dataclasses_typed, static_type_checking. The planner bias was working — sub-areas that hadn't grown their failure-mode coverage in weeks were producing 30x and 60x gains in a single nightly (web_analytics: 84 → 234.6; static_type_checking: 3.6 → 114.1; funnel_diagnosis: 0 → 60.7).
I'd shipped two changes targeting different parts of the same problem, expecting them to compound. They did, but more cleanly than I'd guessed. Most of the work that needed to happen got done in one night. The remaining stuck sub-areas — about four or five — appear to need a different fix entirely. They have enough passes but no failure-loop substrate to grow from, and the planner can request fresh failure scenarios but the well runs dry for some sub-areas faster than others. The next piece — failure-loop synthesis from existing substrate — is queued for this week.
Where things stand
Six expert sub-areas, all on the writing topic. The first topic to fully clear a tier gate. Twelve sub-areas at proficient. Thirty-five at competent. The ladder isn't theoretical anymore — it's the thing the system is climbing.
The dual-judge architecture works. The picker fix works. The planner bias works. The substrate cleanup from Saturday made all of them more honest. The Master-tier exercise shape — the next gate, the one that asks the system to navigate a vague request and translate its reasoning across two audiences — is designed, the seed corpus for writing's summarization is hand-curated, and the V0 build is finishing as I write this.
What I'm not going to claim
I'm aware of three risks worth naming.
The first is calibration. The gate that produced six expert promotions in two days might be too generous. The next-iteration hardening — tag-diversity guards, per-dimension floor checks, adversary-concern clustering — is documented and will ship alongside the Master V0 work. If any of those reveal that one of the current expert sub-areas was promoted on a narrow band of prompts it had memorized, I'll write about that.
The second is the dual-judge itself. The skeptic has been catching real over-engineering and missing-detail patterns I wanted it to catch. But it's been softer in places than I'd hoped — particularly on the dual-explanation step that's coming with Master tier. If the dual-judge stops finding things, that's either because there's nothing left to find or because the adversary stopped looking. I have no way to tell those apart from the inside.
The third is durability. The whole point of the expert tier is that an expert sub-area should be able to handle a request it hadn't practiced on. Until one does, under a real user's vague phrasing, expert is just "passed the gate." I'm not going to claim the topic is meaningfully expert until something downstream uses one of those sub-areas in a way the seed corpus didn't anticipate, and it holds up.
The check on all three of these is the Master-tier evaluation — vague prompts, four reviewers, a separate audience-fidelity dimension. Once it's wired, the first sub-area to face it will be writing.summarization. If it can't translate its own reasoning across registers, expert is paper.
The line from last week still applies: I'll write about it when there's signal worth reporting. There's a lot more signal coming.
Compiled June 9, 2026, while the nightly that produced the 24-sub-area advance is winding down and the Master V0 build is waiting on the morning restart.
- 2026-06-09First topic to fully clear the expert gate
All six sub-areas on the writing topic crossed from proficient to expert in 48 hours — the first topic where every sub-area is at the expert tier. The dual-judge architecture — a primary judge paired with an independent adversary — caught real flaws on the way; the conservative aggregator pulled scores down where the adversary disagreed.
New journal entries delivered when they publish. No spam. Unsubscribe with one click.