Specs: The Tests You Didn't Write

There’s something circular about the way we write tests. We think of a scenario that could break, write a test for it, watch it pass, and ship. Then a bug shows up in production, and it’s almost always a scenario nobody thought to test for in the first place.

This bothered me when I first started writing tests, and it still does. The tests you write only catch bugs you already thought of, so the ones that actually reach production are the ones you didn’t, which means nobody wrote a test for them. You only add the test after the fact, once the damage is done.

I’ve been trying something different. Instead of writing test scripts, I write product specs in markdown and let an AI agent run them.

What’s in the spec/ directory?

The setup is three files and a features folder:

spec/
  seed.md
  running.md
  writing.md
  features/
    invitations.md
    amenities.md
    communications.md
    ...

The feature files describe what the product does. The other three files are the infrastructure that makes those descriptions executable. That’s it. No test framework, no assertions library, no page objects. Just markdown that an agent can read and act on.

seed.md: the world

Before the agent can test anything, it needs a world to test against. seed.md defines that world.

The app I’m building manages gated residential communities. Think of it like a homeowners association tool: residents invite guests, guards scan QR codes at the gate, admins broadcast announcements, people book shared amenities. The seed defines one pre-built community with a few members:

## Seeded community

### Members

| User     | Role          |
| -------- | ------------- |
| Camila   | Admin         |
| Franco   | Guard         |
| Julieta  | Resident      |

### Amenities

| Name         | Requires approval |
| ------------ | ----------------- |
| Tennis court | No                |
| Pool         | Yes               |
| BBQ area     | No                |

Members, roles, amenities, units. Everything a feature spec might need as a starting point. The key rule: seed data is immutable. Tests can read it, but if a test needs to mutate something, it creates its own data and cleans up afterward. This means every spec can run independently against a fresh seed, in any order, without one spec’s side effects breaking another.

There’s also a set of unassigned users for specs that test account creation. The onboarding spec uses them to create a new community from scratch. The member management spec uses another as an invitation target. Nobody is shared across specs in a way that could create conflicts.

Feature specs: product descriptions, not test scripts

Here’s where it gets interesting. A feature spec doesn’t read like a test. It reads like a product document that happens to be precise enough to execute.

From the invitations spec:

## Goal

Residents can invite guests to their community. A guest receives
a public link to fill in their details and get a QR code. The
guard at the gate scans the code to verify and record entry.
The resident is notified when the guest completes their details
and when they arrive.

That’s a product statement. It describes what the feature does and why it matters, not “verify that the invitation button works.” The spec then walks through the workflow from each persona’s perspective:

### 1. Resident creates a temporary invitation

1. Log in as the resident and open the create invitation page.
   Expected: the form loads with the resident's unit pre-selected.
2. Choose a temporary invitation and set a valid date range.
   Expected: the invitation is created and a shareable link appears.

### 2. Guest completes their details

1. Open the invitation link in a fresh browser (no login required).
   Expected: a public form loads asking for name and ID number.
2. Submit the guest details.
   Expected: a QR code appears for the guest to show at the gate.

### 3. Guard scans the QR code

1. Log in as the guard and open the guard interface.
2. Scan or enter the QR token.
   Expected: the guest's name, ID, unit, and host are displayed.
3. Record the guest's entry.
   Expected: the access log updates with the timestamp.

Three personas, one flow, each step grounded in what a real person would do. But here’s the part that matters most. Every spec ends with something like this:

## Cross-persona expectations

- Revoked invitations are immediately invalid for guest
  completion and guard scanning.
- The resident is notified when the guest submits their
  details and again when the guard records entry.
- The guard cannot see resident-only features like amenity
  booking or invitation creation.

These are invariants for the agent to probe, not scripted assertions. “Revoked invitations are immediately invalid” doesn’t say what to click or what error message to expect; it describes a property that should hold, and the agent figures out how to test it.

running.md: teaching the agent to be a tester

The feature specs describe what to test. running.md describes how to test. The agent has access to a real browser through agent-browser, so a run starts with me telling it something like “run the spec for invitations.” The agent reads running.md to understand the testing discipline it should follow:

## Execution flow

1. Pick the target feature file in spec/features/.
2. Read spec/seed.md to resolve persona assignments.
3. Execute the scenario in persona order, keeping each
   browser context isolated.
4. After each major step, verify the expected result for
   the acting persona and any affected downstream persona.
5. Probe at least the obvious nearby non-happy flows
   before moving on.
6. Log every deviation immediately.

Step 5 is where this diverges from traditional testing. “Probe the obvious nearby non-happy flows” is an open-ended instruction that pushes the agent past a fixed script and into thinking about what could go wrong in the neighborhood of whatever it just tested. The resident just created an invitation successfully? What happens if the guest tries to submit the form twice? What if the guard scans an expired QR code? What if a resident tries to access the guard interface?

The file also establishes the testing discipline: use the browser as a real user would, don’t inspect source code to figure out the next step. If the UI path is unclear, that’s a finding, not a reason to go read the implementation.

When the agent finds something, it logs it with a type:

### [short title]
- Type: `bug` | `ux` | `product-gap`
- Severity: `high` | `medium` | `low`
- Expected: [what should happen]
- Observed: [what actually happens]

Three categories instead of a single pass/fail. A bug is something broken. A ux issue is something that works but feels wrong. A product-gap is something that should exist but doesn’t. That last one is interesting because no traditional test suite has a category for “this feature is missing.” An agent operating from a product spec can notice that a workflow dead-ends, that there’s no way to get from A to B without leaving the interface, that a confirmation message is confusing. Things a human tester would flag but that a Playwright script would never catch.

writing.md: making specs composable

The third infrastructure file, writing.md, codifies how to write new specs so they stay consistent and composable. The key principles:

Reference roles, not people. Specs say “the seeded admin” and “the seeded resident,” not “Camila” or “Julieta.” The runner resolves names from seed.md. If you swap a user in the seed, every spec adapts automatically.

Seed data is immutable. If a spec needs to create an invitation, have it scanned, and then revoke it, that’s fine. The spec creates its own invitation rather than mutating seed data. This is what makes specs independently runnable.

Specs read as product specifications. The writing guide is explicit about this: “declarative statements about cause and consequence, not test scripts.” The goal section describes what the feature does, not what the test verifies. This framing matters because it changes what the agent looks for: a test script asks whether a button works, while a product spec asks whether the feature accomplishes what it’s supposed to.

What does this actually catch?

The honest answer is that this catches different things than a test suite, not necessarily more. A well-written integration test will always be more reliable for regression testing. It runs in CI, it’s deterministic, it fails loudly. Specs don’t replace that.

What specs catch are the things that never make it into a test suite in the first place. The edge case where the guard scans a QR code that the resident revoked five minutes ago and the error message just says “invalid.” The flow where a resident books an amenity and then the admin deletes it, and the resident’s booking page breaks. The invitation form that accepts a blank name field.

These are the bugs that live in the gaps between the features you thought about and the interactions you didn’t. A test suite can’t cover them because you’d need to know about them first. An agent working from a product spec can stumble into them the same way a real user would, by trying something slightly off the happy path and seeing what happens.

That’s the shift I’m interested in. Tests are assertions about behavior you already know exists; a spec is closer to a travel guide that an agent uses to wander around parts of the product you haven’t thought to check, and sometimes it finds things nobody put on the map.

Where this gets interesting

The specs are just markdown. They change as the product changes. When I add a feature, I write a spec for it the same way I’d write a product brief. The spec is useful on its own as documentation, as onboarding material, as a reference for what the product actually does. The fact that an agent can also execute it is almost a side effect.

And because the agent is reading natural language, not parsing a test DSL, you can express things that don’t fit neatly into assertions. “Each amenity card shows meaningful information including the next bookable time” is a real line from a spec. How would you write a Playwright assertion for “meaningful information”? You probably wouldn’t. But an agent looking at the page can tell you whether the card is helpful or empty.

This isn’t something I run in CI. There’s no green checkmark or red X. I run specs after finishing a feature, after changing an existing one, or sometimes just on a quiet afternoon when I want to find rough edges to polish. The output is a log of findings I read through, not a gate that blocks a deploy.

I’m still early with this. The agent misses things, it sometimes gets confused by multi-step flows, and the runs take longer than a test suite. But the things it finds tend to be the kind of bugs and rough edges that would otherwise only surface when a real person uses the product, which is the category of problems we’ve always been worst at catching.