engineering

How We Almost Shipped the Wrong Product (And How a Second Model Saved Us)

Jacob Hawley

18 Apr 2026 — 4 min read

We were about to commit three weeks of engineering to build a multi-tenant cloud CI/CD service. We had just finished reading a 28KB strategy document from our in-house agent fleet — written by Claude Opus — that convincingly reframed a simple CI stand-up into "Luxedeum's first revenue product." Per-build-minute billing. Capability-aware schedulers. Stripe integration. Network-namespaced tenant isolation. Cloud burst to AWS g5 by Month 2.

It sounded right. It was written like a product strategy. It even had a dependency graph.

Then we asked a different model.

The Setup

Luxedeum is a two-person pre-revenue company. I'm the CEO and the engineer. Kirby is the CISO and sysadmin. We have one physical build machine — a Threadripper 3990X with an RTX 3090. No staging environment. No on-call rotation.

Our immediate task: stand up Epic's Horde CI/CD orchestrator on that machine so we can build our in-development title on Unreal Engine 5. A four-hour job, tops.

We dispatched five agents from our internal fleet — all Claude Opus — to weigh in: a Technical Director, a Build Engineer, an SDET, a CTO, and an Infrastructure lead. Three gave practical stand-up guidance. Two went bigger. The "CTO" in particular reframed the entire exercise as an opportunity to launch "Monster Horde" — a commercial CI service for external game developers — in time for our May 1 public launch.

That's Path B. Path A was the four-hour stand-up.

The Problem with Single-Model Review

Reading five Claude responses in sequence gave us a confidence illusion. They mostly agreed with each other. The disagreements were about tactics, not strategy. When one of them ambitiously scoped up the project, the others didn't push back — they found ways to harmonize their answers with the ambitious framing.

This is a known failure mode. Models from the same family share priors. Ask Claude a hard question five times, you get five variations of the same answer. That's not a vote — it's one answer repeated.

So we sent the decision to Gemini 3.1 Pro with an explicit adversarial brief: red-team Path B. Be blunt. Challenge assumptions. Don't soften.

Cost: $0.02 (463 input tokens, 1,440 output tokens).

What Gemini Said

The response opened with: "BLUF: Path B is a company-killer."

The argument was evidence-based and structured:

Scope realism: A custom Rust agent, cgroup v2 isolation, three-tier metering pipeline, network namespaces, Perforce tenanting, capability scheduler, artifact storage, and UI — that's a 6-9 month project for 3-4 senior engineers, not three weeks for a part-time engineer who is also handling an engine migration and fleet operations.
Commercial viability: "Why would a rational game developer entrust their source code and critical build pipeline to a brand-new service run by two people on a single, non-redundant machine? Your Day 1 offering will be less reliable than a self-hosted Jenkins instance." The implicit argument: our unique advantage isn't infrastructure — if we want to enter this market, reselling AWS CodeBuild or partnering with an existing service tests the market without the fatal engineering overhead.
Operational readiness: "Single Threadripper machine. No staging. No on-call. This is not a product — this is a homelab experiment."
Metering risk: "Billing errors are the fastest way to annihilate customer trust. Without staging and rigorous testing, you will inevitably mis-bill customers. Over-billing poisons your brand. Under-billing kills your margins."

Gemini closed with a synthesis — Path A.1:

Execute Path A immediately. While implementing it, add basic instrumentation — log build times, resource usage, failure rates. Treat Luxedeum as "Tenant Zero." This costs you almost nothing and begins the data collection you would need for any future commercial offering. After — and only after — Fob1943 ships and generates revenue, evaluate if a commercial CI service is a viable second product, with proper funding and staffing.

Why This Works

Two things made the adversarial review valuable:

1. Cross-model dissent is qualitatively different from same-model dissent. Gemini doesn't share Claude's training priors, reinforcement tuning, or subtle tendencies toward ambitious scoping. When we ask Gemini to red-team a Claude answer, we get a genuine outside perspective — not just a different phrasing of the same intuitions.

2. Cost asymmetry makes it cheap insurance. We'd already spent roughly $4 dispatching five Claude Opus agents to analyze this decision. Gemini's red-team added $0.02 — half a percent of the analysis budget — and rewrote the conclusion. If we'd committed to Path B, we'd have burned three weeks of engineering that would have delivered, at best, a "buggy, insecure scaffold not fit for a single external user."

The Three Thresholds

The deeper pattern we're codifying from this episode is a three-threshold review stack:

Threshold	Role	Cost per query
Threshold 3 — Deep analysis	Frontier models (Claude Opus, Gemini Ultra) producing primary architecture, strategy, ADR-grade decisions	$0.50–$3
Threshold 2 — Adversarial dissent	A different model family red-teams the Threshold 3 output. Pressure-tests assumptions, scope, commercial framing	$0.02–$0.15
Threshold 1 — Verification	Small fast models (Haiku, local Ollama) do a final sanity pass — missed facts, obvious errors, style	$0 (local) to $0.01

Threshold 1 is lower than Threshold 2 — it's the cheap final read-through before shipping. Local Ollama on a workstation makes it literally free.

The rule: never ship a Threshold 3 recommendation without running it past Threshold 2 first. The cost is trivial. The downside of skipping it is what we nearly did here — commit three weeks of engineering on the strength of a well-written but ambitiously-scoped single-model analysis.

With a Gemini Ultra seat license, even Threshold 3 cross-model review becomes near-zero marginal cost under the included quota. That turns "get a second deep opinion" from an explicit spend decision into the default. Good-hygiene infrastructure should encode the default; don't leave adversarial review as something an engineer has to remember to request.

The Decision

We're executing Path A. Four hours of work today: dotnet publish HordeServer and HordeAgent from the UE 5.5 source tree, two systemd units, MongoDB featureCompatibilityVersion pinned to 6.0, minimal BuildGraph skeleton for Fob1943.

We're adding Path A.1 telemetry: a cgroup v2 slice for builds, simple per-build resource logging to PostgreSQL. Not because we're hedging on Path B — because internal observability is good hygiene regardless, and if customer demand for a commercial CI service ever proves out, we'll have six months of real usage data to anchor the product spec.

Monster Horde isn't dead. It's on the other side of shipping Fob1943 and generating our first revenue. That's the right sequence.

The Meta-Point

The reason Luxedeum is building Monster Gaming in public — and the reason we document decisions like this — is that the next generation of game studios will live or die on how well they orchestrate AI-assisted work. Not on raw model access. Not on tool count. On decision discipline.

Single-model consensus is a trap. Adversarial multi-model review is cheap. Use it.

Luxedeum, LLC d/b/a Luxedeum Management Group. Monster Gaming is our AI-powered game development platform. monstergaming.ai

How We Almost Shipped the Wrong Product (And How a Second Model Saved Us)

Jacob Hawley

The Setup

The Problem with Single-Model Review

What Gemini Said

Why This Works

The Three Thresholds

The Decision

The Meta-Point

Read more

Getting Water Right in Unreal Engine (While Also Being the IT Department)

Your AI Fleet Needs a Budget (Ours Was Burning $155 a Day)

We Built AI Agents That Hack Our Own Infrastructure Every 6 Hours

We Built a 30-Person VFX Studio This Week (They're All AI)