The first proficient promotions — and what I added to the public dashboard to keep myself honest
Yesterday's post ended with "the system gets to make its case." Overnight, it made it. Nine sub-areas crossed from competent to proficient — the first promotions in 51 days. The story of why they happened now, what the audit found, and the new graph-health panels I shipped to the public site so the next promotion has nowhere to hide.
Yesterday I ended the journal with a line I half-meant as a hedge:
So for now, the system gets to make its case. I'll write about it when there's signal worth reporting.
Overnight, the system made its case.
Nine sub-areas crossed from competent to proficient — the first promotions to that tier in fifty-one days.
For most of those fifty-one days I was telling myself the ladder was real. It wasn't — not yet. Nothing had ever climbed it. A ladder that nothing climbs is a ladder you've built but never tested.
This post is about why those promotions happened now, what I found when I audited them, and the new panels I added to the public site to make sure I can't fool myself about the next set.
Why no one was getting promoted
Going into this week I had three suspicions, and a decision I wasn't proud of about all of them.
The first was the verifier — a separate model whose only job is to push back on the system's answers — had been flagging vague failure modes on almost every session. Roughly nine out of ten verification rounds came back with the same complaint, "this is missing important constraints," which is the kind of feedback that means nothing once it's universal. The second was that the only way a sub-area got deeper practice was to keep being randomly picked, and most weren't getting picked — the system kept generating new exercises against shallow material instead of harder ones against material it had already touched. The third was that the promotion gate treated every failure mode the verifier could flag as equally disqualifying, which meant "you missed a constraint" was killing promotions at the same rate as "you hallucinated an API that doesn't exist."
The decision I wasn't proud of: I'd been letting the system run for weeks against all three of these, hoping the ladder would prove itself organically. It wouldn't have. None of them were going to fix themselves.
So this week I shipped three changes.
Fix one: the verifier learned to mean it
I rewrote the verifier's prompt to demand specific evidence before flagging anything, and to default to "clean" under uncertainty rather than "flagged."
The old behavior was the worst kind of unhelpful — it generated plausible-sounding critiques without checking whether they applied. The new behavior is closer to a code reviewer who only raises issues they can point at.
Within a day the flag rate on the most-flagged failure mode dropped from roughly 90% of sessions to something noticeably lower. The verifier wasn't being more permissive; it was being more honest. Most of what it had been flagging hadn't actually been wrong.
Fix two: a way to deepen, not just broaden
I added a new mode of practice the system runs in the background, separate from the normal exercise loop. Its only job is to go deeper on sub-areas that already have some material — not to introduce new ones.
The mechanic is simple. When the deeper-practice job fires, it looks for sub-areas where the existing material is mostly shallow, and forces the next round of exercises to push the depth distribution up. The normal loop is still free to wander; this one specifically refuses to.
Before this existed, depth was a side effect of luck. Now it's a thing the system actively chases.
Fix three: the promotion gate learned to distinguish failure modes
This was the change that mattered most.
The verifier can flag eight different failure modes. Most of them are soft — the answer was real, just underspecified (things like "missed a constraint" or "brittle assumption"). These are corrigible criticisms; a slightly more careful answer would have addressed them.
The other three are hard: "hallucinated confidence," "cargo-cult syntax," "misunderstood the question." These aren't underspecification — they're wrongness.
The old gate didn't distinguish. Any flag at any rate above a low threshold blocked promotion.
The new gate splits them. If the sub-area's depth — measured separately, by judging the substance of its exercises rather than counting flags — is above a meaningful floor, soft criticisms get tolerated. Hard criticisms never do. The reasoning is that a sub-area producing genuinely deep work but occasionally being told "you could have constrained that more" is exactly the kind of work a proficient-tier learner produces. Refusing to promote it for cosmetic flags was the gate punishing the wrong thing.
The next morning, nine sub-areas qualified for the first time.
What the audit found
I don't trust promotions just because the gate says so. The whole point of the verifier is that the system can confidently produce work that looks right but isn't. The whole point of promoting a sub-area is that it gets less supervision afterward. So before celebrating, I went through all nine.
Of the nine, four were rock-solid. Mean depth scores between 0.72 and 0.87 — the verifier's own deepest judgments — across a long enough sample to mean something. These sub-areas would have promoted under almost any reasonable gate.
Five were borderline-but-legitimate. Depth scores in the 0.63–0.70 band — above the floor but not by much. Every one of them had crossed the gate honestly; none had been pushed through by a quirk of the new soft-mode tolerance. But they're the ones I'll watch for regression.
No false positives — no sub-area that promoted on the strength of one good day, no sub-area where the gate got tricked by volume rather than substance. The fix worked the first night it had a chance to.
The bigger thing: a graduation path that didn't exist
There's a quieter change inside fix three that I haven't talked about yet.
Even with the gate now tolerating soft criticisms at the proficient tier, the next tier — expert — still won't. To ever cross that one, a sub-area has to genuinely start articulating constraints rather than just being allowed past the gate without them.
The verifier had been telling the system what it was missing for weeks. The system had never been shown those critiques while generating its next round of exercises. The feedback was being collected and then thrown into a folder.
I wired the feedback back in. Now when a sub-area's last few exercises were flagged for, say, missing constraints, the next exercise's prompt explicitly says: "these are the specific things prior runs missed; address them."
That single change is the graduation mechanism. Without it, proficient was the end of the road and expert was unreachable in principle. With it, there's actually a path. Whether the system walks it is now a question I can answer empirically over the next several weeks instead of just asserting in a journal post.
The public site, made harder to fool
The other thing I did today is the reason I'm writing this much detail.
I added three new panels to the public graph-health dashboard:
-
Hierarchy maturity — the percentage of triples sitting in scopes with no parent edge. It's around 5% right now; a couple of weeks ago, before last week's scope cleanup, it was 11–12%. Lower is better; this is the number that tracks whether new knowledge is being attached into the existing structure or floating loose.
-
Scope concentration — the share of all triples held by the largest 10, 50, and 100 scopes. A healthy graph distributes; a sick one concentrates. After last week's cleanup the top-10 share dropped from 51% to 45%, but I want to watch this every day, not look it up by hand.
-
Strict cross-domain bridges — the percentage of triples where both the subject and the object span multiple domains. The existing cross-domain entity metric counted any entity that appeared in two or more domains; this is the stricter version. It only goes up when the graph is genuinely weaving topics together, not when one popular concept happens to get reused.
I'm putting these on the public site for a specific reason. If I keep these numbers in private dashboards, I will quietly stop looking at them the week they stop trending the way I want. Putting them where anyone can see is a forcing function — first against myself, then against anyone I eventually tell about this project.
This is the same instinct as yesterday's post. The most dangerous failure modes don't crash. They keep running and hand you the wrong answer. The defense is to put the right answer where you can't avoid noticing when it changes.
Where things stand
Twelve sub-areas at the proficient tier — the first nine from that night (four with confidence, five worth watching) plus three more that have crossed since. The graduation mechanism is wired in for the first time. The verifier has stopped crying wolf. The promotion gate distinguishes underspecification from wrongness. And three new graph-health panels are now on the public dashboard so the next time any of this drifts I'll see it before I can talk myself out of it.
What I'm not going to claim
The interesting honesty test isn't this week — it's the next several.
A few things have to hold for these promotions to mean something a month from now. The promoted sub-areas need to keep performing at this depth rather than regressing as the system's attention spreads thinner. The verifier's lower flag rate has to be correct — fewer flags because there's less to flag, not because the verifier has gone soft and stopped trying. And at least one sub-area, somewhere, has to eventually start clearing the expert gate. If none do across the next several weeks, the graduation path I wired in didn't actually graduate anything; it just made me feel like I'd built one.
I'll know within a few weeks whether any of that's holding, and I'll write about it when there's signal worth reporting. The line from yesterday still applies.
Compiled June 4, 2026, the morning after the first nine promotions.
New journal entries delivered when they publish. No spam. Unsubscribe with one click.