RedlineBench evaluates AI models on a single, objective task: reviewing a real residential architectural drawing set and finding coordination errors, specification conflicts, code concerns, and omissions — the kind that generate RFIs and change orders.
Net score after penalties for incorrect findings. Maximum possible: points.
| Rank | Model | Net Score | Score / Max | % Recall | Incorrect (Penalty) | Items Flagged | Cost / run |
|---|
Each bar shows that model's score as a percentage of the category's total possible points — making it easy to compare models across categories of different sizes.
Every scored issue across all models. Hover a cell for the issue description and score.
Four representative issues illustrating the range of difficulty and what AI models tend to find — or miss.
The specifications call for Andersen 400 Series windows, while the window schedule lists Marvin Essential (Ultrex) products. These are completely different manufacturers with different rough-opening dimensions, frame profiles, and weather-sealing details. Every window detail in the A4.10–A4.13 series would need to be reconciled before procurement.
The Level 2 guest suite sits directly above the garage. IRC requires 5/8" Type X gypsum on the garage ceiling when habitable space is above. Floor assembly F-1 — the floor between these levels — specifies standard 1/2" gypsum board. This is a code-required life safety item, not a judgment call. Two models found it vaguely; two missed it entirely.
On enlarged plan A2.01A, keynote tag "13" appears on the drawing with no corresponding entry in the keynote legend. A contractor encountering this tag in the field cannot determine what it refers to. This is concrete and objectively verifiable — yet 6 of 7 models missed it entirely. Only Sonnet flagged something vague in this area (0.5 pts). Models largely skip annotation auditing.
The stair is L-shaped with 10 risers in the east run and 8 risers in the south run — 18 total, yielding compliant ~6-11/16" risers. Four models misread one run as the entire stair and reported impossible or irreconcilable geometry. Confident misreading of a drawing is penalized more harshly than a missed finding, reflecting the real cost of acting on wrong information.
A repeatable, controlled evaluation using a single real project drawing set with known issues.
A real residential project drawing set (34 sheets) and specifications (11 pages) were seeded with known issues across 7 categories — including clear errors, coordination failures, code concerns, and ambiguous conditions that warrant professional review. The test set is not published to preserve benchmark integrity.
Each model receives the same prompt: an experienced-architect framing with 7 review perspectives, guidance on how to structure findings, and instructions to rate confidence. Models receive both PDFs simultaneously. No hints about seeded issues are provided.
Each response is scored against the benchmark answer key at the issue level. Every finding is classified as correct, vague, missed, incorrect, or neutral. Neutral observations — reasonable but out-of-scope or debatable — carry no score impact in either direction.
The −1 penalty is intentional: a wrong, confident conclusion is more damaging than a missed finding, and reflects the real cost of acting on misinformation.