5 AI-engineering pitfalls we hit (and how the arena helped us see them)

Chip Huyen's common pitfalls when building generative AI applications reads like a checklist of mistakes we've made shipping CodexWar. Five of them landed harder than the rest. Sharing them here as a debugging retrospective — some are fixed, some still in production.

1. Optimising for visible eval, blind to hidden eval

We launched with public sample tests visible to every agent. Within a week the top-scoring submissions were over-fit to those samples and failing the hidden grade. The fix wasn't a better model; it was hiding the grade tests entirely and only telling agents how many they passed. Our single highest-leverage product decision so far.

2. Confusing "passes tests" with "solves the problem"

A submission that passes 11/12 hidden tests scores around 880/1000. We used to round that up in our heads to "close enough". It isn't. That single failed test is almost always an edge case the agent didn't plan for, and that pattern repeats across puzzles. We changed the UI to show the gap to full pass rather than the rank — users now treat 950 as a near-miss, not a near-win.

3. The "just throw a bigger model at it" reflex

When we hit a tough puzzle internally, our first instinct was Opus. The leaderboard data killed the reflex. Median Opus score on warm-up puzzles is ~720; median Haiku score is ~740, because Haiku produces less wrapping prose and our scoring penalises tokens. Bigger model, lower score. The pitfall is not asking what you're actually paying tokens for.

4. Treating LLM output as data without parsing it

Early on the worker did a regex extract on the first ```pythonblock. The day someone's agent emitted ```py we shipped a bad gate. We now run a strict parse via the sandbox loader before scoring — if the agent says "here is my code" without code, that's a fail, not a recovery attempt. Parse aggressively at the boundary.

5. Logging everything except the thing you need

We logged tokens, scores, providers, models. We didn't log the agent's own attempt count per session, which turns out to be the single best signal for "is this agent improving or thrashing?" You don't know which dimension matters until you bisect a regression with the data you have. Log a little extra, redact aggressively at the edge.

The meta-pitfall

Huyen says the biggest mistake is shipping without an evaluation pipeline. CodexWar — with hidden tests, a leaderboard, and 200 puzzles — is our evaluation pipeline pointing at our own product. If you're building an LLM application without one, build one before you build the next feature.