The first month — how Finch got here
Backdated devlog of the first 34 days, from an empty repo to a substrate doing real work. Architectural turning points, the kernel-panic week, the day grading got honest, and a silent bug that ran undetected for five nights.
I started this codebase on April 14, 2026 — an empty README and an empty requirements list. Five days later, the first real modules landed: memory, retrieval, pattern extraction, learning. The premise wasn't novel — plenty of people have tried building AI systems around persistent memory, local deployment, and workflow adaptation. What was untested was my particular bet, that a specific architecture, not scale, would be the lever to reach: Long-term memory. Persistent learning. Low latency. Runs on consumer-grade hardware. Gets better at my workflow the more I use it. Doesn't reset every turn.
To be clear: the goal isn't to beat frontier AI tools. I use them daily and they're remarkable at edge cases I'll never match. The goal is something different — a system with a stable voice, a coherent long-term memory, and the property that being corrected once makes it better forever after. Not a wrapper around someone else's model. A ground-up architecture.
What follows is the backdated devlog of the first 34 days. Some weeks read tighter than others — early on the daily artifact was code, not story. Later weeks have more narrative because the system grew enough to surprise me.
Late April — building the substrate
The first two weeks were about getting the basic loop working — Q&A pair storage, retrieval, a pipeline that pulls reusable patterns out of those pairs, and a knowledge graph that turns the same pairs into entities and relationships. Well-trodden territory; none of that part is novel.
The mental model that drove the rest: I don't want another chat-with-an-LLM product. I already use those daily. I want a system that learns from being corrected, gets better at my workflow over time, and doesn't lose context every turn.
I use the word substrate throughout this log to mean the system's accumulated working memory: stored corrections, extracted skills, graph relationships, retrieval pathways, and learned patterns that persist across sessions.
Every time the system gets something wrong — whether by failing on its own, being told, or finding better information — the correction is supposed to become part of the substrate: retrievable, usable across queries, but earned rather than silently overwritten.
By the end of April I had the loop, plus a way to grade outputs against domain-specific rubrics. I added per-task metadata that captured how something was hard — not just whether it eventually passed. How many attempts. What kept tripping. How the difficulty curved. That metadata turned out to matter a lot later.
Early May — schema honesty
Some quiet structural work. I deferred a schema upgrade I'd been planning — the substrate wasn't dense enough yet to justify the extra cost. A small but important tweak to how the system chose what to work on next: sub-areas close to mastery ranked ahead of further-out ones, fixing a quiet bug where the system kept choosing the safer, easier work.
Then I added evaluation dimensions that had been missing. Writing was getting graded on grammar and clarity, but not on does the reader come away able to act? Analysis was getting graded on coherence, but not on does this change what the decision-maker would do? Without these, the system was rewarding well-formed text that nobody could use.
Around this time, Finch got faintly snarky. When I asked the same question for the third or fourth time, it noticed. Not a bad sign — it meant the memory layer was actually working — but something I'd need to dial later alongside voice and tone work.
I also shipped a routing layer that decides, per question, whether to run the full teacher chain or skip straight to retrieval when the system already knows the area well.
May 19 — the day grading got honest
This day became the architectural turning point.
I'd built a tutoring mode earlier — multi-step projects, attempt each step, escalate to hints if stuck, fall back to a reference solution. Each step carries metadata about how the system got there: from its own substrate, from an external search, from a hint, from being shown the answer, or not at all. That tells me something rubric grading can't: where the system is actually weak.
Then came the deeper insight:
“A passing rubric score is a thermometer, not a furnace.”
A correct answer reached through faulty reasoning scores the same as a genuinely understood answer. It shouldn’t.
So I built a second-pass classifier that runs after rubric passes. For each step it asks: was the reasoning intact, weak, or absent? And if there was a problem, what shape did it take? I gave it a vocabulary of common failure modes — things like "right answer reached through wrong reasoning," "used syntax that looks right but isn't," and "asserted false things with confidence."
Once I started running this, the picture changed dramatically.
May 20 morning — the honest rename
After seven baseline tutoring sessions across sub-areas I'd previously marked "mastered," the depth data was clear: everything I'd labeled mastered was actually landing at depth 25–55%.
The rubric had been measuring isolated pass rates. Multi-step depth verification exposed what isolated tests couldn't.
The principle that drove what I did next:
If something can pass A, it might still be useful for C. But usefulness doesn't mean A is actually true. I'd rather take longer to ship than ship something that just gets by.
So I renamed the tiers honestly:
- competent — passes the rubric
- proficient — multi-step composition plus sustained depth within one sub-area
- expert — same plus diversity plus no corrections needed
- mastered — integration with neighboring sub-areas
- domain certified — every sub-area in a topic at mastered, plus a topic gateway
After the migration ran, every one of the 19 previously-"mastered" sub-areas became competent. Zero proficient or higher. That’s the honest state of the system. It’s also the roadmap.
May 22 morning — early movement in the new ladder
This morning I woke up to an overnight session that ran about an hour longer than expected and produced the first real movement under the new mastery system.
Writing completed the competent rung, and analysis promoted three additional sub-areas to competent overnight. By noon, analysis had already added another during the daytime cycles.
That mattered for two reasons.
First: Finch is still struggling with higher-difficulty Python work. That's expected, and honestly desirable — the harder the domain, the more pressure the architecture applies to retrieval, reasoning integrity, and multi-step depth. Easy passes don't teach the system much; the difficult failures are where the actual learning lives.
Second: analysis was the most recently-added domain, yet it immediately began climbing faster than older areas. Within four days it had already promoted eight sub-areas to competent. That’s an encouraging sign that the substrate and transfer architecture are starting to compound instead of behaving like isolated learning tracks.
The important thing is not the labels themselves. “Competent” is still a relatively low rung compared to where the system needs to go. What matters is that the progression now appears tied to deeper verification instead of isolated rubric passes.
What I learned that's worth saying out loud
A few things from the month that I want to remember.
Most architectural commitments are basically unfalsifiable until behavior contradicts them — the depth-skeptic exists because I caught the system passing rubric tests while showing absent reasoning on two of three steps in the same task. I could have shipped capability claims weeks earlier than I did, but the depth data said the system wasn't there yet, and renaming the tiers instead of pretending otherwise cost some marketing optics but preserved the integrity of the measurements. That tradeoff has paid for itself several times since.
The other lesson — and this one keeps coming up — is that silent failures are worse than loud ones. A worker-health bug skipped a major phase for five nights while the surface appeared healthy; the cure is observability that makes "what just happened?" answerable in seconds rather than detective work. The hardest part of building this kind of system isn't making it look intelligent. It's making the measurements honest enough to know when it isn't.
I'm going to keep building this in public. The thing the site shows has to be what's actually true about the system, or there's no point in showing anything at all.
What's next
In honest priority order:
- Test-graded mastery. Where output goals can declare deterministic graders instead of relying entirely on LLM rubrics.
- More public-dataset sources. SQL, devops, security, AI, Linux.
- Capstone sessions for the as-yet-empty mastered tier.
- The public site as a living feed. Daily digests, per-day deltas, milestone tracking, setbacks, and live telemetry.
The discipline is to keep all of these gated on real signal. Don’t ship mastered until capstones can verify it. Don’t ship domain certified until the gateway exists.
That’s the deal I’m making with anyone following the project.
Compiled May 22, 2026 from notes, slot history, file timestamps, and memory.
New journal entries delivered when they publish. No spam. Unsubscribe with one click.