Walden — spec-driven development vs vibe coding

From vibe coding to software that counts

If I called my mother tonight to ask what she’s making for dinner, I wouldn’t be surprised anymore by an answer like: «Nothing much, I’m building a brand-new memory system for agents — comment AGENT and get my definitive guide right away».

No, this is not some dystopian Black Mirror future. It’s June 2026!

Because that’s the point: the “how hard can it be, I’ll knock it out in five minutes” delusion has been gripping our heads for a while now, and sadly we have to take this abuse even from people who have never written a single line of code. I spot them instantly: they have the face of someone who has discovered the truth, proudly wearing the Dunning-Kruger t-shirt — poor souls, they’re just vibe coding.

F E R O C I O U S, yes, read it spaced out, because it sounds better. Slower. That’s what I become when I run into these mythological creatures.

Karpathy — not exactly the village idiot — says that vibe coding is that art where we sit down at the machine and follow the flow in natural language, passively accepting whatever the technology proposes. Well: between that mode and the software we ship to production there’s an abyss, because we, as professionals, actually ship it to production. And today I’m going to tell you what kind of abyss it is.

You know that moment when you switch on active noise cancellation and everything around you suddenly gets clearer and quieter? That’s what my team and I try to do at work: we try to cut through the incredible noise that rains down on us every single day — “MCP is dead, the 4B model that kills the frontier models, LLM Wiki murders RAG…” And I could go on for hours. That’s exactly the hard part: cutting out the noise and making room for the things that work.

My own pattern is always the same: the fear of something, combined with the unreasonable effort I put into studying it, eventually leads me to build tools.

The effort, in these cases, is not optional. It’s the only thing that counts.

So let’s start there. From my anxieties for 2026, and from the effort of these past months.

My two anxieties for 2026

The first — and I mean it, I really do — is the perpetual delay of GTA 6. Rockstar announced it in 2022, spent two billion dollars on it, it will be the masterpiece of modern software engineering, and I’ve already warned my bosses that when it ships I’ll vanish for two days, to savor the game that catapulted me into the video game industry — and into software altogether. The fear is that Rockstar will just keep postponing it…

The second one is serious: I’m afraid of flying. I’ve carried it with me ever since I had the thrill of flying the most dangerous air route in the world to reach the Himalayas — and right there I understood I’d be afraid forever.

My last trip was FOSDEM, in Brussels. Camped out all day in the AI dev room, and on the way back, sitting at the gate with my hands starting to sweat, I had the thought that made them sweat twice as much: in 2050 we’ll board that plane knowing that its software, almost certainly, will have been written by a machine.

And if the new source code is human language, how am I supposed to sit at ten thousand meters and feel calm?

So I started digging into the world of safety-critical software. And I discovered that the FADEC exists. I knew absolutely nothing about it.

Full Authority

FADEC stands for Full Authority Digital Engine Control: a box that lives inside the aircraft’s engines and governs everything that happens in there. An insanely complex, safety-critical, redundant event system — every engine carries two — and Full Authority means exactly what it sounds like: pilots cannot interact with it manually, not even through mechanical controls.

You trust it, and you fly.

So, the question: how on earth are the requirements for something like this written? Because getting a requirement wrong on a system like this means the plane goes down.

It works like this: every single line of FADEC code has an explicit requirement, mathematically traced into the code, tested and signed by an engineer. There is a signature. The certification standard is called DO-178C, and if that rigor in the specifications tied to the code doesn’t exist, you don’t get certified and the plane doesn’t fly.

It doesn’t fly. Period.

Pretty much the opposite of how I write software every day :D .

Implicit requirements are ticking time bombs: «I told you that button was supposed to open that pop-up», «what do you mean you didn’t know? It’s obvious». On a CRUD webapp these mistakes can slide. On software like this they cannot exist — and the times it did happen, it ended very badly.

The new source code

There’s a conceptualization by Andrej Karpathy (who waves at us from his multi-million-dollar yacht wearing an Anthropic t-shirt) that frames this moment perfectly.

Software 1.0: we struggled to learn languages and to build frameworks on top of them, and that was how we had always written code.

Software 2.0: neural networks which, by ingesting enormous amounts of data, ferried us toward 3.0. And in the software 3.0 era the source code is no longer the one we’re used to: it’s natural language.

When natural language becomes the source code, the vulnerability is no longer in the code. It’s in human language.

Think about it. When we walk into a café — «a coffee, please» — we’re confident the person on the other side fully gets the context, because human language carries a huge amount of non-verbal signal, and we take for granted that the listener fills in the gaps. Human language is incredibly ambiguous: that’s why it’s fascinating, and that’s why we argue over it. But today we talk to machines with that very language. It’s the difference between tossing over an unstructured JSON payload hoping the receiver guesses the types, and using a rigorously typed schema. Except now the unstructured payload is our sentences.

And the numbers are these: roughly 56% of software defects originate in the specifications. Not in the code. These systems now write thousands of lines in hours, and the METR curve says capability doubles every seven months: today fifty-minute tasks, soon week-long tasks. The bottleneck has moved: it’s no longer how fast we generate code, it’s specifying exactly what we want the machine to do.

More speed, less context: and this, honestly, filled me with anxiety. Because I’ve seen them with my own eyes, the twenty-, thirty-thousand-line pull requests generated with no spec driven approach, without the craft of sitting down with the team and describing exactly what we want to build. That’s where cognitive debt is born: we’re no longer in control — not so much of what is written (twenty thousand lines are inhuman to read anyway) but of how the software is behaving.

And if the premise is agents that will work for hours and then for weeks, tell me: who sleeps soundly knowing they’ve left an agent working alone for six hours?

Not me. And when something scares me, you know by now how that ends.

When, who, what

When I started developing I was a kid — eight years old — and all that mattered to me was sitting at the machine and watching things happen. There was a translation from human language to machine language, the machine did things, and that was good enough for me. Then this became my profession, and I understood the importance of a well-written requirement: it’s the foundation of our industry, it’s what allows us to sell software.

So, what is an explicit requirement? It answers three questions, always the same ones.

When — under which condition or event the behavior is triggered.
Who or what — the system, the user, a specific component.
Does what — the expected behavior, observable and verifiable.

Three simple questions. If even one of the three is missing, that’s not a requirement: it’s an aspiration. It sounds obvious put like that — then go read your company’s Jira tickets. «Loading must be fast.» «The interface must be intuitive.» Fast is relative: under what latency? With how many users? Intuitive for whom? And on error? Eh, we’ll see. Behind words that vague hide enormous architectural decisions that someone will make at the worst possible moment: under pressure, with production down.

A linter for natural language

And here comes EARS, which by the way has a beautiful origin story.

We’re inside Rolls-Royce, in Derby, England. Alistair Mavin and his team — with Wilkinson, Harwood and Novak — start studying the requirements manuals of engine controls: years of requirements written in natural English, years of things gone wrong. And in that field, “things gone wrong” means disasters. What do they find? They cluster eight ways of writing the same requirement badly, eight well-known patterns repeating everywhere: ambiguity («the system should be fast» — top of the list of things we still read today), vagueness («handle errors correctly», which means nothing), complexity (sentences with four nested subordinate clauses), implicit assumptions (usage conditions never declared — for me, the worst one), omissions (missing edge cases and fallbacks), duplication, multiple requirements compressed into a single sentence, and undefined acronyms. In 2009 they publish the paper introducing EARS — Easy Approach to Requirements Syntax — the antidote to all these problems: the English language stripped to the minimum, constrained to a handful of templates, to write unambiguous requirements that the technical department can understand and trace.

When I discovered it I thought: this thing is genius, but it’s under-pushed, under-sponsored. And what existed of it in the AI world was, trivially, skills — «write in EARS format, act like a requirements engineer» — whose output, however, was not deterministic. It couldn’t be: agents, by definition, aren’t.

Park that thought: we’ll come back to it shortly.

The analogy that works best for me: EARS is a linter for natural language. A linter analyzes code and catches errors before execution. EARS does the same with language: it catches ambiguities before they become software.

Six forms, one rule

Let’s see them. FADEC on the left, our dev life on the right.

Ubiquitous — the invariant, the always-true, in any state or event. The system SHALL <behavior>. FADEC: the fuel/air ratio never exceeds certified limits, under any operating condition. Dev life: every API response is JSON, with Content-Type: application/json.
Event-driven — fires after an event. WHEN <trigger>, the system SHALL <response>. FADEC: WHEN the sensor detects a redline threshold exceedance, cut fuel injection within twenty milliseconds. No ambiguity here: it doesn’t exist. Dev life: WHEN the user clicks Submit, validate the form and show inline errors.
State-driven — the loop: active as long as a condition persists. WHILE <precondition>, the system SHALL <response>. FADEC: WHILE the engine is in takeoff mode, keep N1 within 0.5% of the certified target. Dev life: WHILE the HTTP request is in flight, show the loading indicator. It sounds like a nuance, but it’s what separates a spinner that closes properly from one that spins forever.
Optional — depends on the presence of a feature. WHERE <feature>, the system SHALL <response>. FADEC: WHERE thrust reversers are installed, enable deployment only with wheels on the ground. Dev life: WHERE 2FA is enabled, ask for the OTP after the password.
Unwanted behavior — error handling, the most catastrophic thing in our industry. IF <trigger>, THEN the system SHALL <safe response>. FADEC: IF a fuel flow sensor returns an out-of-range value, THEN switch to the backup and send an alert. Dev life: IF the database doesn’t respond within five seconds, THEN return a 503 with a standard message.
Complex — the trickiest: state and event together. WHILE <precondition>, WHEN <trigger>, the system SHALL <response>. FADEC: WHILE the engine is in cruise, WHEN EGT exceeds 850°, reduce fuel flow by 8% and notify the FMS. Dev life: WHILE the user is authenticated as admin, WHEN they open Users, show full data with edit options.

Side by side, the forms share one thing: zero ambiguity. And writing requirements this way lets us produce a design, and then tasks, that aren’t ambiguous either.

But look closely at the fifth one, because real-world documents fall in love with the happy path and ignore the disasters — and that single omission causes most production outages. Without an explicit IF/THEN, an AI facing an error can invent an infinite retry loop and take everything down.

And then there’s the golden rule, the one to take home even if you sleep through the rest: if you can’t write your requirement in one of these forms, the requirement isn’t clear enough yet. With one merciless footnote: one behavior per acceptance criterion. The conjunction «and» is banned. You cannot write «the system must save the data and send the email». Ever. Because six months from now a test fails, and you won’t know whether it was the database or the SMTP. Two independent requirements, instant diagnosis.

In the age where we’re all vibe coding, this stuff is gold: dictating a standard for agents that will work for hours and days is the foundation the new, solid software has to stand on.

I went to the woods

OK, the theory is beautiful. But when I went looking at the spec-driven frameworks out there today, I found a weakness they all share: they let themselves be dragged along by something we’ve all been subjected to. The mass adoption of these tools was handed down by the industry while hiding the real cost of inference — and not having to worry about cost gave us systems that produce tons of markdown nobody will ever read. «Yeah, but you don’t read the markdown anyway», they tell me. True. But I’m convinced of one thing: tokens will be the new kilowatt-hour. So systems must be designed not to be wasteful.

Walden was born exactly here — and I told you, my pattern is always the same: the fear of flying led me to the FADEC, the fear of agents left alone for hours led me to build a tool. The goal: make spec-driven incredibly lean. But deterministic.

The name, though — that one isn’t anxiety. I was raised on bread and Thoreau — that’s why it’s called Walden. Walden; or, Life in the Woods holds the line I use as a compass: «I went to the woods because I wished to live deliberately, to front only the essential facts of life». My grandfather taught me that principle before I had words for it: do fewer things, but do them with full attention. Software, by nature, does the exact opposite.

Technically it’s a spec-driven delivery kernel: a Go CLI, zero dependencies, open source under Apache-2.0. Between an idea and production there’s an ocean — we start from the idea, «build me a todo app», hand it to the agent, and what comes out cannot go to production, because in between, for the last forty years, we’ve always done a lot of other things. Walden introduces a mandatory gate on every phase: Requirements, Design, Tasks, Execute. The skill teaches the model how to use the CLI and drafts the requirements in EARS format; but since the model isn’t deterministic, we have no guarantee they truly are — and here the CLI steps in, deterministically validating them, with a parser, against those templates. Requirements valid? On to design, where every requirement is exploded into its acceptance criteria. Design approved? Tasks, where every leaf cites the criteria IDs (R1.AC1, R1.AC2…) and carries its verification commands. Only at the very end, the code. It’s the CLI-plus-LLM pattern that’s becoming a first-class citizen, just like MCP before it.

No more «I thought it was obvious».

Intentional friction

And what if I change my mind halfway through? We’ve reached the tasks and I say: no, no relational database, I want NoSQL. The whole chain goes stale. Meaning? You redo the requirements, re-approve them, redo the design for the part you invalidated, and redo the tasks. Consistency is guaranteed by a chain of timestamps: every downstream document, when approved, records the upstream document’s approved_at — and if anything changes upstream, nothing matches downstream anymore. The system blocks until a walden reconcile realigns the chain.

This one — I took it from Apple — is a feature, not a bug :) .

And here’s the point I will underline in blood: the one thing we must not do is accept everything the model says, always, no matter what. That amounts to structured vibe coding. All these approval phases are deliberately mechanical, they add friction on purpose: they exist to make the human accountable for the requirements and the designs they signed. Because our job will no longer be sitting there writing the code: it will be going back to gathering around a table with our colleagues, deciding the specs, writing the design and the tasks — ideally with the biggest model we have, because that’s the hard part — and then letting the tasks run.

Human in the Reasoning

There’s another thing I had to dismantle: good old human in the loop. If agents are going to work for weeks, human in the loop means you see what was produced after a week. That’s just not realistic — especially in the requirements, design and tasks phases. From a 2023 Peking University paper comes the right concept: Human in the Reasoning — the human inside the flow of reasoning, not at the end of it.

In Walden it’s called the Decision Checkpoint Protocol, and its core is a bifurcation test: while drafting, if the agent hits a decision that strays too far from the goal, it stops and asks the human a question. The example I always use: I need to get to Rome. If the agent takes the Rome–Florence highway, that’s too far from the goal — it stops and asks: «are you really sure we should take this road?». If the detour is seven hundred meters, it decides on its own and keeps going. An explicit marker stays in the document — [decision: <the question>] — and, if you tell it to decide by itself, an  in black and white.

And then there’s memory, because if I answer a checkpoint and the next round the agent asks me the same question again, I’ve merely relocated the frustration. Cognitive debt must be turned into cognitive capital, and Walden does it with two files that live inside git, in the .walden directory. The Constitution: the hard rules of the software — tech stack, conventions, architectural decisions — things that technically don’t change, but can evolve with the system. And the Lessons: the registry of the agent’s mistakes, structured in three parts — the trigger (what happened), the lesson (the pattern to avoid), the guardrail (what to check next time) — which the skill re-reads before any non-trivial work. The system learns from the mistakes it made and tries not to repeat them.

And I’m not the only one saying this. Erik Schluntz, from Anthropic, in Vibe coding in prod says things I could sign line by line: the twenty-two-thousand-line merges survive in production because the planning came first, with verifiable checkpoints — not «is the code right?» but «does the system behave correctly?». Meanwhile Karpathy has joined Anthropic, saying verbatim that he’s there to fix agents: I expect that within a few months standards will rain down on us — for orchestrators, for memories — just like they did with MCP and skills. And thank goodness: because right now we’re all burning tokens rewriting the same thing.

Does It Work?

Why open the README with yet another benchmark table? Doubting Thomas should be your spirit guide in this jungle of vibe-coded open source!

TRY IT YOURSELF.

git clone https://github.com/andrearaponi/walden.git
cd walden
./setup.sh

The script builds the binary, installs it into ~/.local/bin/walden and offers you the skill for Claude Code, Codex or Copilot (at work we use it with Copilot: it’s agnostic — today we use one coding agent, tomorrow we’ll use another). Then it goes like this:

/walden Let's build a personal, single-user todo app: CRUD on tasks, categories,
due date, priority, mark as completed. Golang 1.26, TDD fail-fast, SQLite,
OpenAPI with RFC 7807 problem details, frontend embedded in Go with React and
Tailwind. Authentication out of scope.

The skill kicks in, finds the binary, initializes the repo with a Constitution and Lessons to fill in, and the loop you know by now begins: requirements in EARS, CLI validation, bifurcation test («how do we model task priority?» — enum; «categories?» — CRUD entity), approval, design, validation again, tasks with their verification commands. The result: 11 requirements, 59 well-formed acceptance criteria — 33 event-driven, 16 unwanted behavior, 10 ubiquitous — zero warnings. The model rarely gets it wrong; when it does, the CLI raises the error, and the error tells you how to fix it. It’s a parser, after all.

Do the planning with the strongest reasoning model you have, because that’s the critical part. Execution, no: when the agent runs execute it reads the requirement, the design, the task and the verification — implementation becomes a mere translation into code, and a weaker model can do it, even a local one if you can afford a beefy machine.

It works.

And bugs? I’m the bold-opinions guy: if the fix changes neither requirements nor design — a broken listener on a button — no ceremony, just fix it. If fixing it means touching the requirements, the bug was never in the code :) .

Oh, and Walden is actually designed to be orchestrated: one agent that prepares the context, an executor that implements, a third one doing code review. But that’s material for another two, maybe three, articles. That’s all there is — no enterprise behemoth. Literally a kernel.

No airplane has ever been vibe-coded

What to take home, in the order that counts. First: the bottleneck is no longer writing code, it’s knowing exactly what we want to build. Second: EARS is the linter for natural language — six forms, and they’re enough. Third: our craft is shifting from writing code to designing and verifying systems — speed belongs to the machine, logical precision is still on us. Fourth: cognitive debt becomes cognitive capital only if you give it structure — Constitution, Lessons, guardrails.

Vibe coding — sitting at the machine and passively accepting the flow — is incompatible with systems that must stay standing over time. You need requirements, you need EARS, you need gates and checkpoints. You need structured cognitive effort. Otherwise we’ll just keep building — faster and faster — paper infrastructure.

Luckily, no airplane has ever been vibe-coded. And I hope your next project won’t be either.

Walden is young and the roadmap is long: if you have criticism, doubts or suggestions, feel free to reach out.

I went to the woods.

I went to the woods

From vibe coding to software that counts#

My two anxieties for 2026#

Full Authority#

The new source code#

When, who, what#

A linter for natural language#

Six forms, one rule#

I went to the woods#

Intentional friction#

Human in the Reasoning#

Does It Work?#

No airplane has ever been vibe-coded#

From vibe coding to software that counts

My two anxieties for 2026

Full Authority

The new source code

When, who, what

A linter for natural language

Six forms, one rule

I went to the woods

Intentional friction

Human in the Reasoning

Does It Work?

No airplane has ever been vibe-coded