The Design System That Keeps AI-Generated UI Consistent

Let me describe what happens when you tell an AI to "build a settings page" without giving it a design system.

You get a beautiful settings page. Clean layout. Nice spacing. Modern button styles. Looks great in isolation.

Then you navigate to the dashboard that was generated last week. Different button radius. Different font weight for headings. Slightly different shade of blue. The padding on cards is 24px here but 20px on the settings page. The empty state has a different illustration style than the one on the contacts page.

Each page, viewed alone, looks professional. Viewed together, they look like five different apps wearing the same logo.

This is the fundamental problem with AI-generated UI. Language models are good at generating plausible components. They're bad at generating consistent components, because consistency requires remembering decisions made in other files, in other sessions, sometimes weeks ago. And unless you structure things deliberately, those decisions live nowhere except scattered across a codebase.

The Design System First policy

On March 13, I made this a standing order across all 26 agents: always update the design system before updating the app.

The rule applies to colors, typography, spacing, component variants, states (hover, active, disabled, empty, loading, error), animations, shadows, border radii, icons, and any visual pattern. If it touches how something looks, the design system gets updated first.

Here's the workflow in practice.

Say I need a new notification badge component for Wheelhouse, my project management app. Before Forge writes a single line of React code in the Wheelhouse app, it first opens the design system repository. The design system is a standalone project, separate from any app. It contains HTML files that demonstrate every component in every state, using design tokens defined in a central CSS file.

Forge creates the notification badge in the design system. It defines the token values: --wh-badge-bg, --wh-badge-text, --wh-badge-radius. It builds the HTML example showing the badge in all sizes and states. It verifies the component looks right in isolation.

Only then does Forge implement the badge in the actual Wheelhouse app, referencing the design system as the source of truth. The app's badge component uses the same tokens. The same spacing. The same colors. Not approximately the same. Exactly the same, because they're pulling from the same token definitions.

Why this matters more for AI than for humans

Human developers build up intuition. A frontend engineer who's worked on a project for six months remembers that the primary blue is #0A2342 and the card padding is 16px. They carry that context in their head. They might still make inconsistencies, but the baseline consistency comes from memory and habit.

AI doesn't have that. Every session starts fresh. When Forge opens a new session to build a feature, it doesn't remember the session where it built the last feature. It reads files. Whatever is in those files is the entirety of what Forge knows about your design.

If the design system is incomplete, Forge will fill in the gaps with plausible defaults. And "plausible" means "looks reasonable in isolation but doesn't match what exists." This is how you end up with three different gray values for borders, two different hover transition speeds, and button corner radii that range from 4px to 8px across your app.

A well-maintained design system eliminates this problem. Forge reads the token file. The token file says --wh-radius-md: 6px. Forge uses 6px. No guessing. No "what did I use last time?" There is no last time for an AI. There's only what the files say right now.

The 67 to 100 story

I built an audit tool that scores a design system on a 100-point scale across 11 dimensions: UI quality, UX quality, mobile experience, responsive design, design system consistency, code quality, type safety, security, performance, accessibility, and standardization.

When I first ran this audit on the Wheelhouse design system, it scored 67 out of 100.

That felt bad. But the breakdown was useful. The design system had solid foundations. The color tokens were well-defined. Typography was consistent. But it was missing dark mode coverage for about half the components. Several interactive states were undefined. Empty states and loading states were either missing or inconsistent. The token naming convention had drifted, with some tokens using --wh-brand-* (deprecated) and others using the current --wh-* prefix.

The fix was methodical. Forge and I ran iterative audit-fix cycles. Each cycle followed the same pattern: run the audit, identify the lowest-scoring dimension, fix the issues in the design system (not in the app), re-run the audit. Repeat.

Here's what each round addressed.

Round one targeted token consistency. We deleted all deprecated --wh-brand-* tokens and replaced them with the canonical --wh-* equivalents. This touched every HTML file in the design system. 68 files total. No creativity involved. Just find-and-replace with verification that the visual output didn't change.

Round two tackled dark mode. The design system had a light theme that covered everything and a dark theme that covered about 60% of components. We added dark mode variants for every remaining component. This meant defining dark-specific token values for backgrounds, borders, text colors, and accent colors. The key insight: dark mode isn't just inverting colors. It's a separate set of intentional choices for every surface.

Round three was states. Every interactive component needed hover, focus, active, and disabled states. Every data component needed empty, loading, and error states. This round produced the most visual assets because we were building patterns that hadn't existed before. For each state, we created a standalone HTML example that demonstrated the exact appearance.

Round four was accessibility. Contrast ratios, focus indicators, keyboard navigation, ARIA labels. The audit tool checked WCAG AA compliance for every color combination. Several of our lighter gray values on white backgrounds failed. We adjusted them, then verified the visual design still worked.

Round five was the polish pass. Transition speeds standardized to two values (150ms for micro-interactions, 300ms for layout shifts). Shadow tokens consolidated from seven variants to four. Border widths normalized. By this point, the score was in the high 90s, and the remaining points came from minor standardization issues.

The final score: 100 out of 100. All 68 HTML files. Zero deprecated tokens. Full dark and light coverage for every component.

What a design system audit actually looks like

People ask about the scoring. Here's how it works in practice.

The audit tool reads every file in the design system. For each dimension, it checks specific, measurable criteria. Design system consistency, for example, checks whether every color used in the HTML files maps to a defined token. If a file uses color: #333 instead of var(--wh-text-primary), that's a deduction. If a spacing value is padding: 16px instead of var(--wh-space-4), that's a deduction.

UI quality checks component completeness. Does every button have all states? Does every form input show validation states? Is there a consistent visual hierarchy?

Mobile experience checks responsive behavior at standard breakpoints. Does the component stack correctly at 375px wide? Do touch targets meet the 44x44 minimum?

The audit isn't subjective. It's a checklist applied to code. An AI can run it because it's looking at measurable properties, not making aesthetic judgments. That's intentional. I don't want the audit to depend on taste. I want it to depend on rules.

How Forge uses the design system

When Forge gets a task like "add a calendar view to Wheelhouse," here's what actually happens.

First, Forge reads the design system token file to understand the visual vocabulary. What colors are available. What spacing scale is defined. What the typography hierarchy looks like. This gives Forge the palette it can work with, and more importantly, the palette it can not deviate from.

Then Forge checks existing design system components for patterns. Is there a card component? What are its specs? Is there a data table? How does it handle empty states? Forge isn't just looking at tokens. It's looking at the established patterns for how those tokens get composed into components.

If the calendar view needs a new component that doesn't exist in the design system, Forge creates it there first. A date cell component, maybe. A week header row. These get built in the design system with all states and variants, using only defined tokens.

Finally, Forge implements the calendar in the actual app, pulling from the design system as reference. The app code mirrors the design system patterns. Same token references. Same state handling. Same responsive breakpoints.

The result is that a component built today by Forge in one session looks like it belongs with a component built two weeks ago by Forge in a different session. Because they both followed the same reference.

The tokens

Everything comes back to tokens. Here's what PortLink's token system looks like.

Colors are defined with semantic names, not raw values. The Navy primary is #0A2342. The accent blue is #00A8E8. The muted text is #56708c. The background is #F8FAFC. These hex values appear exactly once, in the token definition file. Everywhere else, they're referenced as variables.

PortLink uses a --pl- prefix for its tokens. Wheelhouse uses --wh-. This prevents collision when I'm working on both projects in the same session. Forge sees --pl-accent and knows it's PortLink's accent color. No ambiguity.

Spacing follows a scale: 4, 8, 12, 16, 24, 32, 48, 64. These map to tokens like --wh-space-1 through --wh-space-8. When Forge needs to add padding to a component, it picks from this scale. It doesn't invent a value. The scale constrains the choices, which is exactly the point.

Typography defines a hierarchy: display headings, section headings, body text, small text, caption text. Each with a specific font size, line height, and weight. When Forge creates a new component with text, it maps each text element to a tier in the hierarchy. There's no "hmm, 14px looks about right." There's "this is body text, therefore it uses --wh-text-body-size."

What I got wrong at first

I'll be honest about the early mistakes.

Initially, the design system lived inside the app repository. Same repo, just a /design-system directory. This was convenient but caused problems. Forge would skip the design system step and just edit the app directly, because the files were right there. Why go to /design-system/buttons.html when you can just edit /src/components/Button.tsx? The incentive structure was wrong.

Moving the design system to its own repository changed the dynamic. It's a separate project with its own deploy pipeline (Netlify, auto-published). When Forge needs to reference the design system, it reads files from a different repo path. The physical separation enforces the mental separation.

Another mistake: I started with too few tokens. I defined colors and fonts but not spacing, shadows, or transitions. Forge would use the defined tokens for colors, then invent its own values for everything else. The inconsistency migrated from colors (fixed) to spacing (not fixed). The lesson is that tokens need to cover everything that matters visually. If you can see it, it should have a token.

I also underestimated the importance of states. My first design system showed every component in its default state. That's maybe 30% of what a component actually looks like. The hover state, the disabled state, the loading state, the error state, the empty state: these are all visual configurations that need to be defined. When Forge builds a data table and the user has no data, what should it show? If the design system doesn't answer that question, Forge will make something up. And what Forge makes up in one session won't match what it makes up in another.

The standing order

The Design System First policy is now a standing order across all 26 agents. Nyx broadcasts it. Synapse verifies it's in every agent's MEMORY.md during the weekly audit. It's the kind of rule that has to be universal to work, because the moment one agent starts skipping the design system step, drift creeps back in.

The rule in full: (1) Update the design system first. (2) Verify the change in the design system in isolation. (3) Apply the same change in the app using the design system as reference. (4) Never patch the app directly and skip the design system.

Step 2 is the one people skip. Verifying the design system change in isolation, meaning looking at the HTML file in a browser and confirming it's right before touching the app. It feels like an extra step. It is an extra step. It's also the step that catches problems before they multiply. If the token value is wrong, you catch it once in the design system instead of discovering it later in the app, where it's mixed in with business logic and harder to isolate.

When AI writes your frontend, the design system isn't optional. It's how you get coherence across sessions, across agents, across weeks of development. Without it, every AI-generated component is a beautiful island. With it, those islands connect into something that actually looks like a product.

The score went from 67 to 100. But the score isn't the point. The point is that when I open any page of any app, it looks like the same app. The components agree with each other. The spacing is predictable. The colors make sense. Not because an AI has great taste, but because an AI follows rules. The design system is those rules.