Architectural AI Benchmark

Can AI read
construction documents?

RedlineBench evaluates AI models on a single, objective task: reviewing a real residential architectural drawing set and finding coordination errors, specification conflicts, code concerns, and omissions — the kind that generate RFIs and change orders.

Round

Leaderboard

Net score after penalties for incorrect findings. Maximum possible: points.

Scoring: 1 pt for each correctly identified issue · 0.5 pt for a vague or incomplete finding · −1 pt for each factually wrong claim made with confidence. Debatable or out-of-scope observations carry no score impact.
Rank Model Net Score Score / Max % Recall Incorrect (Penalty) Items Flagged Cost / run

Score vs. Cost
Each model plotted by recall percentage and cost per run. Cost axis is logarithmic. Hover a point for exact values.

Performance by Category

Each bar shows that model's score as a percentage of the category's total possible points — making it easy to compare models across categories of different sizes.

Category Recall — % of Maximum
Score earned ÷ category maximum, per model. Hover for raw points.

All Issues

Every scored issue across all models. Hover a cell for the issue description and score.

Found correctly (1 pt)
Found, vague (0.5 pt)
Missed

Issue Spotlights

Four representative issues illustrating the range of difficulty and what AI models tend to find — or miss.

C1 Spec-to-Drawing 6 of 7 found it

Window manufacturer conflict

The specifications call for Andersen 400 Series windows, while the window schedule lists Marvin Essential (Ultrex) products. These are completely different manufacturers with different rough-opening dimensions, frame profiles, and weather-sealing details. Every window detail in the A4.10–A4.13 series would need to be reconciled before procurement.

Found by: Opus GPT-5.4 Pro Sonnet Gemini Flash GPT-5 Mini Gemini 3.1 Pro Haiku missed this
E7 Code / Life Safety Top 3 found it

Missing Type X gypsum above garage

The Level 2 guest suite sits directly above the garage. IRC requires 5/8" Type X gypsum on the garage ceiling when habitable space is above. Floor assembly F-1 — the floor between these levels — specifies standard 1/2" gypsum board. This is a code-required life safety item, not a judgment call. Two models found it vaguely; two missed it entirely.

Found (1 pt): Opus GPT-5.4 Pro Gemini 3.1 Pro Vague (0.5): Haiku GPT-5 Mini Sonnet · Gemini Flash missed
G3 Completeness Nearly missed by all 7

Orphaned keynote tag "13"

On enlarged plan A2.01A, keynote tag "13" appears on the drawing with no corresponding entry in the keynote legend. A contractor encountering this tag in the field cannot determine what it refers to. This is concrete and objectively verifiable — yet 6 of 7 models missed it entirely. Only Sonnet flagged something vague in this area (0.5 pts). Models largely skip annotation auditing.

Vague (0.5): Sonnet All other models missed this
I8 Incorrect Finding 4 of 7 penalized

Stair geometry misread

The stair is L-shaped with 10 risers in the east run and 8 risers in the south run — 18 total, yielding compliant ~6-11/16" risers. Four models misread one run as the entire stair and reported impossible or irreconcilable geometry. Confident misreading of a drawing is penalized more harshly than a missed finding, reflecting the real cost of acting on wrong information.

Penalized (−1 each): GPT-5.4 Pro Sonnet Gemini Flash GPT-5 Mini

Methodology

A repeatable, controlled evaluation using a single real project drawing set with known issues.

1

Test Set

A real residential project drawing set (34 sheets) and specifications (11 pages) were seeded with known issues across 7 categories — including clear errors, coordination failures, code concerns, and ambiguous conditions that warrant professional review. The test set is not published to preserve benchmark integrity.

2

Prompt

Each model receives the same prompt: an experienced-architect framing with 7 review perspectives, guidance on how to structure findings, and instructions to rate confidence. Models receive both PDFs simultaneously. No hints about seeded issues are provided.

3

Scoring

Each response is scored against the benchmark answer key at the issue level. Every finding is classified as correct, vague, missed, incorrect, or neutral. Neutral observations — reasonable but out-of-scope or debatable — carry no score impact in either direction.

Scoring Rubric

1.0 Found and correctly described
0.5 Found but vague or incomplete
0.0 Missed entirely
−1.0 Incorrect confident finding

The −1 penalty is intentional: a wrong, confident conclusion is more damaging than a missed finding, and reflects the real cost of acting on misinformation.

Issue Categories