harsh-critic Performance: Baseline → v1 → v2

Baseline — OMC critic

harsh-critic v1

harsh-critic v2 "Even Harsher" (targets)

Composite Score

Overall Review Quality

Composite aggregates true-positive rate, coverage, perspective breadth, and evidence quality into a single benchmark score.

Baseline

OMC critic

15.1%

— reference

harsh-critic v1

22.1%

+8.3 pp (sonnet)

v2 actual

harsh-critic v2 + Realist Check

58.6%

+43.5 pp (opus)

Primary Metrics

Key Performance Indicators

All metrics are higher-is-better except False Positive Rate (shown separately below).

⚠

False Positive Rate — lower is better. v2 on Opus increased FPR to 78.7% — the model generates more spurious findings alongside its dramatically higher detection rate. This is an acceptable tradeoff: 51.8% true positive rate + 62.5% missing coverage far outweighs the noise. The Realist Check and Self-Audit phases help calibrate severity on the findings that matter.

Precision

False Positive Rate (Lower is Better)

FPR measures how often the reviewer flags items that are not genuine issues. v2 on Opus trades higher FPR for dramatically better detection. The Self-Audit and Realist Check phases help calibrate what survives into the final report.

Domain Breakdown

Score by Review Domain

v2 on Opus dramatically improved all three domains. Code showed the largest absolute gain (+51.7 pp), while Plan — the weakest domain in v1 — jumped to the strongest relative improvement.

Domain	Baseline	v2 actual	Delta	v2 performance
💻 Code	9.2%	60.8%	+51.7 pp	60.8%
🔬 Analysis	25.6%	60.1%	+34.5 pp	60.1%
📋 Plan	14.2%	55.4%	+41.2 pp	55.4%

Changelog

What's New in v2 + Realist Check

v2 added plan-specific protocols and self-audit. The Realist Check adds pragmatic severity calibration with mandatory mitigation rationale for every downgrade.

🗺

Plan-Specific Investigation Protocol

Pre-mortem analysis, assumption enumeration, and ambiguity mapping. Tailored phase checklist for plan review distinct from code review.

Plan domain

👥

Plan-Specific Perspectives

Three new lenses: executor (implementation feasibility), stakeholder (business risk), and skeptic (worst-case reasoning).

Plan domain

🧠

Metacognitive Self-Audit

Confidence gating before each finding is committed. Reviewer must justify certainty, reducing false positives.

Precision

⚡

Adaptive Harshness Escalation

Starts at THOROUGH mode, escalates to ADVERSARIAL when the target looks clean — prevents rubber-stamping on deceptively tidy artifacts.

All domains

📌

Plan Evidence Redefinition

Evidence for plan findings now accepts section/assumption references instead of requiring file:line citations, which were inapplicable to documents.

Plan domain

🎯

Authority Framing

Persona anchored as senior engineering lead, not a generic critic — improves tone calibration and reduces hedge inflation that diluted v1 scores.

All domains