Engineering

Building BrainGrid with BrainGrid: Spec-Driven Development with Claude Code

How we ship features in under an hour — from half-baked idea to tested, deployed code — using spec-driven development with BrainGrid and Claude Code.

Tyler Wells

February 10, 2026

15 min read

How we ship features in under an hour — from half-baked idea to tested, deployed code.

#48 Minutes

Here's what a typical feature build looks like for us:


110:02  /specify "Add time range filter to credits health dashboard"
210:03  → REQ-287 created with 8 acceptance criteria
3
410:03  /build REQ-287
510:03  → Branch: feature/REQ-287-time-range-filter, 5 tasks linked
610:04  → Task 1: Add date range picker component (in progress)
710:11  → Task 1 complete, validation passed
810:11  → Task 2: Update API endpoint with date params (in progress)
910:18  → Task 2 complete, validation passed
10       ...
1110:42  → All 5 tasks complete, yarn validate:fix passes
12
1310:42  → PR created: "feat: add time range filter to credits health dashboard (#1287)"
1410:43  → braingrid requirement review
1510:44  → 8/8 acceptance criteria validated against PR diff
1610:45  → Code review approved, merged to dev
17
1810:46  → Agent writes test spec: time-range-filter.md
1910:47  → agent-browser opens dev deployment
2010:48  → Database seeded with test data for known date ranges
2110:48  → agent-browser navigates to dashboard, selects "Last 7 days"
2210:49  → agent-browser snapshots table, verifies filtered results
2310:50  → Database query confirms filter matches actual data
2410:50  → Test spec updated: status: PASSED

48 minutes. Idea to tested, merged feature. The spec took 60 seconds. The human reviewed diffs, approved the PR, and made one adjustment to a component's padding. The test ran against the live dev deployment — not a mock, not a stub — with real data verified at the database level.

Every feature in BrainGrid is built this way. Not as a marketing exercise — because it's the fastest workflow we've found. This post explains how it works.

#The Problem

Vibe-coding works until you merge it, deploy it, and realize you forgot the error state. Or the loading state. Or what happens when the user has zero credits and clicks the button anyway.

You didn't think about those cases because you're moving fast — and that's fine for prototypes. But the AI didn't think about them either, because you never told it to. It built exactly what you described: the happy path. Everything else is missing.

The fix isn't to slow down or stop using AI. It's to give the AI a requirement that already thought through the edge cases, error handling, and loading states — so you don't have to. The AI that writes your spec thinks like a product engineer. It asks the questions you'd skip. The AI that implements the spec executes like a production software engineer, because that's what a professional requirement demands.

That's the entire philosophy: start with requirements, not code.

#The Workflow: Four Commands

#1. `/specify` — Turn an idea into a requirement


1/specify "Trial users should see upgrade prompt instead of buy credits when out of credits"

AI refines your one-liner into a structured requirement. Here's what REQ-375 looked like after /specify:


1REQ-375: Trial users should see upgrade prompt instead of buy credits
2
3Problem:
4  Trial organizations see the same credit exhaustion messaging as paid orgs,
5  offering "Top-up credits" — but trial users can't purchase credit packs.
6
7Solution:
8  Detect trial status, display trial-appropriate messaging with only the
9  upgrade CTA. Paid orgs continue seeing both options.
10
11Components to modify:
12  - out-of-credits-banner-wrapper.tsx (fetch subscription status)
13  - out-of-credits-banner.tsx (conditional render by trial status)
14  - low-credits-banner.tsx (agent overlay variant)
15
16Acceptance Criteria:
17  ✓ Trial org + 0 credits → top banner shows "Your trial has run out
18    of credits" with only "Upgrade plan" button
19  ✓ Trial org + 0 credits → agent overlay shows "Upgrade now" only
20  ✓ Paid org + 0 credits → shows both "Top-up credits" and "Upgrade plan"
21  ✓ Trial org + 1-50 credits → standard low credits message (not trial-specific)
22  ✓ Loading state → show paid behavior until trial status confirmed
23  ✓ Error state → fall back to paid behavior, log error
24  ✓ Status changes reflect consistently across both banners

That's condensed. The full requirement also included a data fetching strategy (React Query with 5-minute stale time), props interfaces, error/loading state specifications, and a message variation table mapping every condition to its banner variant, message copy, and CTA buttons.

A well-structured requirement has these components:

Problem statement — what's broken or missing, in user-facing terms
Solution summary — the approach, not the implementation
Scope — which files/components are affected (so the AI doesn't wander)
Acceptance criteria — testable given/when/then conditions that define "done"
Edge cases and error handling — loading states, failures, boundary conditions
Out of scope — what this requirement deliberately doesn't cover

The AI generates all of this from a single sentence. You type one line, the AI writes the full spec — problem statement, acceptance criteria, edge cases, scope — and you review it. Most of the time it's 80% right. You fix the 20%, move on. We catch bad specs about 20% of the time — the AI assumed a modal instead of inline editing, or missed an auth check, or scoped too broadly. Editing a spec takes seconds. Debugging a wrong implementation takes an hour.

#2. `/breakdown` — Turn the spec into tasks


1/breakdown REQ-375

This is more than "split the work into chunks." The AI assembles context from three sources: the full requirement (acceptance criteria, edge cases, technical decisions), your codebase structure (repository analysis, file tree, existing patterns), and related documentation. It then generates atomic implementation tasks — each scoped to a single concern — with explicit dependencies between them. The AI knows which files exist, which hooks and components are already in your codebase, and how they're structured.

Here are the actual tasks generated for REQ-375:


1TASK-1: Create useSubscriptionStatus hook
2  → New hook: src/hooks/use-subscription-status.ts
3  → Fetch from /api/organizations/[orgId]/subscription
4  → React Query: cache key ['subscription-status', orgId], staleTime 5min
5  → Return { isTrialSubscription, isLoading, error }
6  → On error: default isTrialSubscription to false (safe fallback)
7
8TASK-2: Update out-of-credits-banner with trial support
9  → File: src/components/out-of-credits-banner/out-of-credits-banner.tsx
10  → Add isTrialSubscription: boolean prop
11  → Trial + 0 credits: render "Your trial has run out of credits" + "Upgrade plan" only
12  → Paid + 0 credits: keep both "Top-up credits" and "Upgrade plan" (existing behavior)
13
14TASK-3: Update out-of-credits-banner-wrapper
15  → File: src/components/out-of-credits-banner/out-of-credits-banner-wrapper.tsx
16  → Call useSubscriptionStatus hook
17  → Pass isTrialSubscription to banner component
18  → While loading: default to paid behavior (no flicker)
19
20TASK-4: Update low-credits-banner with trial support
21  → File: src/components/agent/agent-pane/low-credits-banner.tsx
22  → Trial + 0 credits: show "Upgrade now" only
23  → Trial + 1-50 credits: standard "You have only X credits left" (not trial-specific)
24
25TASK-5: Run validation
26  → yarn validate:fix (type-check + lint + format + test)

These aren't vague tickets. They're prompts — each one tells the agent exactly which file to modify, which pattern to follow, which prop to add, and what the expected behavior should be. The spec already made the design decisions (cache key, stale time, fallback behavior), so the tasks are pure execution.

#3. `/build` — Start implementing


1/build REQ-375

This runs a four-step flow:

Fetches the build plan from BrainGrid with requirement details and the full task array
Creates a feature branch — feature/REQ-375-trial-upgrade-prompt — and associates it in BrainGrid so everything is linked
Creates and links tasks in Claude Code, connecting each to BrainGrid so status syncs automatically
Starts implementing the first task immediately — no "shall I proceed?" prompts

The agent picks up tasks sequentially — implements the code, runs yarn validate:fix, and if validation passes, marks the task complete and moves to the next. If validation fails, it reads the error, fixes the issue, and re-runs before moving on. You watch in real time and course-correct when needed.

You can steer focus by appending instructions: /build REQ-375 start with the data fetching hook — Claude adjusts task priority accordingly.

#4. `braingrid requirement review` — Validate acceptance criteria against the PR


1braingrid requirement review

This is where the spec earns its keep. The command auto-detects the requirement from the branch name and the PR number from git, then uses AI to perform the review. It fetches the PR diff from GitHub and the full requirement with acceptance criteria from BrainGrid, then the AI reasons about whether each criterion is satisfied by the actual code changes — tracing criteria to specific lines in the diff:


1Reviewing PR #1288 against REQ-375...
2
3Acceptance Criteria:
4  ✅ Trial org + 0 credits → banner shows "Upgrade plan" only
5     → out-of-credits-banner.tsx:42 — conditional render on isTrialSubscription
6  ✅ Paid org + 0 credits → shows both buttons
7     → out-of-credits-banner.tsx:38 — default branch renders both CTAs
8  ✅ Loading state → paid behavior until status confirmed
9     → out-of-credits-banner-wrapper.tsx:18 — isTrialSubscription defaults to false
10  ✅ Error state → fallback to paid behavior
11     → out-of-credits-banner-wrapper.tsx:22 — catch block sets isTrialSubscription = false
12  ...
13
147/7 acceptance criteria validated.

Did the implementation miss an edge case? Is there a criterion with no corresponding code change? You find out before the PR merges, not after users report a bug.

Here's what it looks like when a criterion fails — say the agent implemented the top banner but forgot to update the agent overlay:


1Reviewing PR #1288 against REQ-375...
2
3Acceptance Criteria:
4  ✅ Trial org + 0 credits → banner shows "Upgrade plan" only
5     → out-of-credits-banner.tsx:42 — conditional render on isTrialSubscription
6  ❌ Trial org + 0 credits → agent overlay shows "Upgrade now" only
7     → low-credits-banner.tsx — no trial-specific conditional found,
8       still renders standard "Top-up credits" CTA for all org types
9  ✅ Paid org + 0 credits → shows both buttons
10     → out-of-credits-banner.tsx:38 — default branch renders both CTAs
11  ...
12
136/7 acceptance criteria validated. 1 failed.

The AI traces each criterion to specific lines in the diff. It's not grepping for keywords — it's reasoning about whether the code changes actually satisfy the criterion. It catches semantic gaps that compile and lint clean: a component that handles the trial state but renders the wrong CTA text, or an error fallback that works correctly but doesn't match the criterion's specified behavior. Where it can't catch you is runtime logic errors where the code reads correctly but behaves wrong — "the conditional exists but the boolean is inverted." That's exactly what the next layer is for.

#AI-Driven Testing

After the PR merges, testing follows the same pattern: AI writes the test, AI runs the test, humans review. This is the layer that catches what code review can't — behavioral bugs in the running application.

The agent writes a markdown test spec, then executes it by driving a real browser against the deployed app while verifying data at the database level:


1Test: Credits Top-Up with Stripe
2──────────────────────────────────────────────────
3Setup:  Query initial balance → 1,998 credits
4Step 1: agent-browser → navigate to /settings/billing
5Step 2: agent-browser → click "Top-up credits", select $10 / 1,000 credits
6Step 3: agent-browser → fill Stripe test card, click Pay
7Step 4: agent-browser → wait for redirect, verify success message
8Verify: Query final balance → 2,998 credits ✅
9──────────────────────────────────────────────────
10Result: PASSED
11  - Credits added to ORGANIZATION account (not USER)
12  - Transaction type: OVERAGE_PURCHASE
13  - Credits expire after 1 year
14  - Stripe event payload verified in events table

agent-browser drives the actual user flow — clicking buttons, filling forms, navigating pages. MCP servers give the agent direct database access for setup and verification. The agent reads the spec, executes each step against the running app, and appends actual results including event payloads and database state changes it observed.

The defense is layered: spec review catches missing implementations, requirement review catches criterion-to-code gaps, and browser tests catch behavioral bugs in the live app. The real gap — a bad spec and a bad test — is the same gap human engineering has. The difference is that every layer runs automatically.

#When It Breaks

The happy path is nice. Here's what happens when things go wrong.

The spec itself is wrong. This is the one failure mode no amount of automation catches — because every downstream layer executes the spec faithfully. If the spec says "show a modal" and you meant inline editing, the implementation will be correct according to the wrong spec. That's why /specify walks you through clarifying questions before generating the requirement, and why you review the spec before /breakdown. The human is the checkpoint. If you rubber-stamp a bad spec, everything downstream is on you.

The AI misunderstands the spec. Different from above — this is when /specify generates acceptance criteria that don't match your intent and you catch it. We catch bad specs about 20% of the time — the AI assumed a modal when we wanted inline editing, missed an auth requirement, or scoped too broadly. Editing a spec takes seconds. Debugging a wrong implementation takes an hour.

A task fails validation. The agent runs yarn validate:fix after every task. If types break or tests fail, it reads the error, fixes the code, and re-validates before marking the task complete. You see this happening in real time. If it gets stuck in a loop, you intervene — but that's rare because the task description already specified which patterns to follow.

requirement review flags a gap. A criterion shows no corresponding code change. The agent either missed it or decided it was out of scope. You see exactly which criterion failed and can either implement it or mark it as intentionally deferred.

The test fails. agent-browser snapshots the DOM and the agent reads the actual state. "Expected 'Upgrade plan' button, found 'Top-up credits' button." The error is usually obvious from the snapshot. The agent can fix the code and re-run, or you can investigate manually. Element refs (@e1, @e2) change after every page interaction, so the agent re-snapshots after each step — stale refs are the most common failure mode and the tooling handles it.

#Patterns Worth Stealing

These apply to any AI-assisted development workflow, with or without BrainGrid:

Specify before building. A few minutes of upfront clarity saves an hour of rework. This is the single highest-leverage thing you can do.
Tasks are prompts. Write task descriptions as if you're prompting an AI — because you are. Include file paths, patterns, and APIs.
Automate the things humans forget. Task status sync, validation, branch naming conventions. If the developer has to remember to do it, they won't.
Validate against the spec, not just the code. Code review catches bugs. Spec review catches missed requirements. Browser tests catch behavioral regressions. You need all three.
The agent should just start. After an explicit build command, don't ask "shall I proceed?" The user already expressed intent.
Test against real infrastructure. Database queries, browser interactions, deployed endpoints. Mocks hide bugs.
Memory compounds. Store learnings persistently. Your future self (and your agents) will thank you.
Handle errors reactively. Don't pre-check if every tool is installed. Run it and handle failure if it occurs.

#Try It in 5 Minutes

You don't need the full setup to start. The core loop — specify, break down, build — works with just three things:


1# 1. Install the CLI
2npm install -g @braingrid/cli
3
4# 2. Create an account and authenticate
5#    → Sign up at app.braingrid.ai, then:
6braingrid login
7
8# 3. Initialize in your project
9braingrid init
10
11# 4. Open Claude Code, then type:
12/specify "your feature idea here"
13
14# 5. Break it into tasks
15/breakdown REQ-XXX
16
17# 6. Start building
18/build REQ-XXX

That's it. No MCP servers, no hooks, no browser automation. Add those later when you want database verification, automated task sync, or AI-driven testing. Start with the spec.

#Under the Hood

The workflow runs on Claude Code with a few extensions. Everything above works out of the box. Everything below powers the full experience but is optional.

Skills are markdown files that teach Claude domain knowledge. They load automatically when relevant context is detected — you don't invoke them manually. We use braingrid-cli (the spec-driven workflow), agent-browser (E2E testing), frontend-design (UI that doesn't look AI-generated), and memory (persistent learnings via mem0).

Hooks run shell scripts in response to tool calls. Our most important hook: when Claude marks a task complete, a PostToolUse hook syncs the status to BrainGrid automatically. You never manually update a task tracker.

MCP servers give Claude authenticated access to external services. Supabase for database queries — you control the scope (read-write on dev, read-only on production, or production off entirely — your call). Axiom for production logs (read-only). Playwright for browser automation. mem0 for persistent memory. When Claude needs to verify a migration, it queries the dev database directly. When it needs to test a flow, it drives a browser against the staging deployment. No context switching.

Persistent memory via mem0 stores learnings across sessions: "AuthKit login: wait 1500ms after email submit for password field." "Credit expiration cron uses billing_anniversary_day, not created_at." Before starting work, the agent searches memory. After discovering something reusable, it stores it. This compounds — month-old debugging insights surface exactly when needed.

#Links & Resources

#BrainGrid

Web App: app.braingrid.ai
CLI (@braingrid/cli): npmjs.com/package/@braingrid/cli
GitHub: github.com/BrainGridAI/braingrid
Documentation: braingrid.ai

#Claude Code

Claude Code: docs.anthropic.com/en/docs/claude-code
Skills: docs.anthropic.com/en/docs/claude-code/skills
MCP (Model Context Protocol): modelcontextprotocol.io

#Tools

Vercel agent-browser: github.com/vercel-labs/agent-browser
mem0 (Persistent Memory): github.com/mem0ai/mem0
Supabase: supabase.com
Axiom: axiom.co
Playwright: playwright.dev

About the Author

Tyler Wells is the Co-founder & CTO of BrainGrid, where we're building the future of AI-assisted software development. With over 25 years of experience in distributed systems and developer tools, Tyler focuses on making complex technology accessible to engineering teams.

Want to discuss AI coding workflows or share your experiences? Find me on X or connect on LinkedIn.

Keep Reading

Engineering

Why AI Tool Validation Needs Two Layers (Not One)

How we fixed our BrainGrid Requirements Agent crashes by implementing a two-layer validation pattern—LLM-level constraints plus app-level safeParse.

Jan 28, 2026

7 min read

Engineering

Why we switched from Claude web search to Exa for the BrainGrid agent

Claude's web search tool responds with encrypted content, which complicated our implementation. Here's the approach we took with Exa in under 100 lines.

Aug 9, 2025

7 min read