Running a Real SaaS on Almost Entirely AI-Generated Code

A 99.5 and a security hole

EventRipple is an event management SaaS. It runs on React, Vite, TypeScript, Supabase, and Stripe. It has a design system, internationalization in English and Norwegian, a real domain at eventripple.app, and actual users.

Almost every line of code in the application was written by an AI agent named Forge, running on Claude Opus 4.6.

On March 14th, I ran a full code audit across 11 dimensions. EventRipple scored 99.5 out of 100. That's not a number I'm inflating. It's the actual result. Visual quality, responsive design, accessibility, type safety, performance, standardization, security. Near-perfect across the board.

And yet, right now, there are five database tables with Row Level Security policies set to USING(true), which means anyone with the anonymous API key can read every row. Customer PII is technically exposed. It's a GDPR violation waiting to happen.

That's the honest picture of AI-generated production code. The highs are genuinely impressive. The gaps are real.

How we got here

EventRipple started as a product idea in my OpenClaw setup. I have 26 AI agents organized in Slack, each with a specific role. Forge is the lead engineer, running on Opus 4.6 because architecture and code generation need the strongest reasoning model I can get.

When I decided to build EventRipple, I didn't sit down and write code myself. I described what I wanted to Nyx, my orchestrator agent, and Nyx delegated the engineering work to Forge. Forge wrote the components, the API routes, the database schema, the Stripe integration, the entire design system. Sentinel, my code review agent, reviewed the pull requests before they merged.

The development happened over about six weeks. Not continuously. Forge would get a spec, write the code, submit a PR. Sentinel would review. I'd look at the result, provide feedback, and the cycle would repeat. It felt less like using a coding tool and more like managing a junior developer who happens to write code very fast and never gets tired.

The codebase grew to dozens of components, multiple page routes, a full authentication flow, an offer and counter-offer system, guest management, date polling, and an event creation wizard. All generated by AI.

What the audit actually checked

I didn't just look at the code and say "looks good." The audit ran against 11 specific dimensions, each scored from 0 to 10:

Visual polish. Does it look like a real product or a hackathon demo? Forge followed the design system tokens for colors, spacing, typography, border radius. The visual consistency is better than most human-built MVPs I've seen, honestly.

Responsive design. EventRipple works on mobile, tablet, and desktop. The components adapt properly. There are still some edge cases on the mobile design system page that need an audit-fix loop, but the core app is solid.

Accessibility. WCAG AA compliance. Proper heading hierarchy, color contrast ratios, focus management. This is where having a design system first approach pays off. When the tokens define accessible contrast ratios, every component that uses those tokens inherits accessibility for free.

Type safety. Zero TypeScript errors. Every prop is typed, every API response has an interface. This is one area where AI-generated code actually has an advantage. Forge doesn't get lazy about types the way human developers do after a long day.

Performance. Vite bundles are reasonable. No unnecessary re-renders. Lazy loading where it makes sense. The app loads fast on a real connection.

Standardization. Consistent naming conventions, consistent file structure, consistent component patterns. When one agent writes all the code, you get consistency almost by accident. There's no "different developer, different style" problem.

Security. This is where the 0.5-point deduction came from. The RLS policies.

Code organization, test coverage, documentation, build health, internationalization. All checked. All scored high.

The jump from 69/100 on March 9th to 99.5/100 on March 14th happened through a focused quality sprint. Forge went through seven improvement phases in a single session: splitting oversized files, converting hardcoded values to CSS variables, accessibility fixes, type safety cleanup.

How Forge actually writes code

People ask me this a lot, and I think they expect the answer to be "I paste a prompt into ChatGPT and copy the output." It's nothing like that.

Forge has a workspace with memory files, project context, and standing orders. When it starts a session, it reads its MEMORY.md (long-term knowledge), the project's context.md (current state, recent decisions, open issues), and any relevant daily notes. It knows the codebase before it writes a line.

There's a strict rule in my setup: always update the design system before touching the app. Forge follows this. When a new component is needed, the design system spec gets written first with tokens, variants, states, and responsive breakpoints documented. Then the component gets built in the app using those tokens. This prevents drift between what the design system says and what the app actually does.

Forge writes in TypeScript. It types everything. It follows the file naming conventions and folder structure already established in the project. When it creates a new component, it looks at how existing components are structured and matches the pattern.

After writing code, Forge submits a PR to the GitHub repo. Sentinel, the code review agent, picks it up. Sentinel checks for security issues, accessibility problems, type errors, and deviation from the design system. If something fails review, it goes back to Forge with specific feedback.

The code then gets pushed to main, Netlify auto-deploys, and EventRipple updates live.

This loop runs without me writing a single line of code. My job is specifying what to build, reviewing the output, and making architectural decisions when there's a fork in the road.

The RLS bug and what it tells you

Let me talk about the elephant in the room. Five tables in the Supabase database have their Row Level Security policies set to allow public reads. The tables are: accepted_offers, offer_views, offer_counteroffers, offer_versions, and offer_comments. There's also guest_date_poll_votes with anonymous insert and update, which means vote stuffing is theoretically possible.

How did this happen? When Forge set up the initial database schema, the RLS policies were created to be permissive during development. The idea was to lock them down before launch. That lockdown hasn't happened yet.

This is the kind of thing that a human developer would also defer and potentially forget about. The difference is that a human developer might notice it during a routine security review. Forge flagged it in the project documentation. It's listed as a critical issue. But flagging isn't fixing. The fix requires manual SQL changes in the Supabase dashboard (we have a standing order that RLS policy changes are done manually, not by Forge), and I haven't prioritized it yet.

There's also a hardcoded fallback URL in the code. The PUBLIC_SITE_URL environment variable isn't set in production, so it falls back to https://app.eventflow.com. That's a competitor's domain. An AI wrote that fallback by pulling from training data, and nobody caught it until the audit.

These are the kinds of bugs that AI generates. Not syntax errors or logic flaws. Configuration oversights, security policy gaps, training data leakage into production values. The code compiles, the app works, the tests pass (well, there are zero tests, which is its own problem), but the deployment configuration has landmines.

What production-ready actually means

The 99.5/100 score tells you the code quality is high. The unfixed RLS bug tells you code quality isn't the same as production-readiness.

Production-ready means: security policies are locked down, environment variables are properly configured, there's monitoring, there are tests, there's a runbook for when things go wrong. EventRipple has high code quality and an incomplete operational story.

I think this is the gap that most people miss when they talk about AI-generated code. The code itself can be excellent. The surrounding infrastructure, the ops work, the security review, the testing, the deployment pipeline, the incident response plan. That's where human judgment still matters.

Forge can write a React component that's better structured than what most developers would produce on their first attempt. It can't decide what the RLS policy should be for a specific business context. It can't weigh the tradeoff between strict security and developer velocity during early development. It can suggest, and it does suggest. But the decision is mine.

The zero-tests problem

EventRipple has zero automated tests. This is embarrassing to admit in a post about production-grade code. But it's the truth, and I think it's instructive.

Forge can write tests. It's good at it. My Budget app, built by the same agent, has 79 passing tests in Vitest. The difference is that I prioritized tests for Budget because it handles financial data. For EventRipple, I prioritized features and design quality.

That's a human decision failure, not an AI failure. I told Forge what to build, and tests weren't in the spec. The agent didn't push back and say "hey, we should write tests for this." It could have, if I'd configured it to. But I didn't set up that guardrail.

This is a pattern I've noticed: AI agents do exactly what you configure them to do. If your process includes mandatory test coverage, they'll write tests. If it doesn't, they won't volunteer. The responsibility for process quality sits with the human who designs the system.

Lessons after shipping AI-generated code

First: the design system first approach is non-negotiable. When Forge has a design system to follow, the output is consistent, accessible, and visually coherent. When it doesn't, it improvises, and AI improvisation in UI design produces generic-looking interfaces.

Second: code review by a separate agent catches real issues. Sentinel has flagged type inconsistencies, missing error handling, and accessibility violations that Forge missed. Two agents are better than one for code quality, the same way two humans would be.

Third: the audit score matters, but don't confuse it with deployment readiness. A high audit score means the codebase is clean. It doesn't mean the product is ready for users. Those are different things with different checklists.

Fourth: AI-generated code has a specific failure mode. It's not the obvious bugs. It's the subtle configuration errors, the training data artifacts, the security policies that work during development but aren't appropriate for production. You need a human reviewing these specific categories.

Fifth: zero tests is always wrong. I should have included testing requirements in every spec I gave to Forge. Next project, I will.

Where this is heading

EventRipple is live. The code is clean. The RLS bug needs fixing, and it's at the top of the priority list. Tests need to be written. The mobile design system needs another pass.

But the fact that I can point to a real SaaS application, running in production, serving users, scoring 99.5 on a code audit, with nearly every line written by an AI agent on Opus 4.6, tells me something about where software development is going.

It's not that AI replaces developers. It's that the role of the developer shifts. I didn't write code for EventRipple. I designed the system, made the decisions, reviewed the output, and caught the things the AI missed. That's still real work. It's just different work.

The 99.5 is real. The RLS bug is also real. Both things can be true at the same time, and understanding that tension is the key to using AI-generated code in production.