harsh-critic v2 "Even Harsher" (targets)
Composite Score
Overall Review Quality
Composite aggregates true-positive rate, coverage, perspective breadth, and evidence quality into a single benchmark score.
Baseline
OMC critic
15.1%
— reference
v1
harsh-critic v1
22.1%
+8.3 pp (sonnet)
v2 actual
harsh-critic v2 + Realist Check
58.6%
+43.5 pp (opus)
Primary Metrics
Key Performance Indicators
All metrics are higher-is-better except False Positive Rate (shown separately below).
⚠
False Positive Rate — lower is better.
v2 on Opus increased FPR to 78.7% — the model generates more spurious findings alongside its dramatically higher detection rate. This is an acceptable tradeoff: 51.8% true positive rate + 62.5% missing coverage far outweighs the noise. The Realist Check and Self-Audit phases help calibrate severity on the findings that matter.
Precision
False Positive Rate (Lower is Better)
FPR measures how often the reviewer flags items that are not genuine issues. v2 on Opus trades higher FPR for dramatically better detection. The Self-Audit and Realist Check phases help calibrate what survives into the final report.
Domain Breakdown
Score by Review Domain
v2 on Opus dramatically improved all three domains. Code showed the largest absolute gain (+51.7 pp), while Plan — the weakest domain in v1 — jumped to the strongest relative improvement.
| Domain |
Baseline |
v2 actual |
Delta |
v2 performance |
|
💻
Code
|
9.2% |
60.8% |
+51.7 pp |
|
|
🔬
Analysis
|
25.6% |
60.1% |
+34.5 pp |
|
|
📋
Plan
|
14.2% |
55.4% |
+41.2 pp |
|
Changelog
What's New in v2 + Realist Check
v2 added plan-specific protocols and self-audit. The Realist Check adds pragmatic severity calibration with mandatory mitigation rationale for every downgrade.
🗺
Plan-Specific Investigation Protocol
Pre-mortem analysis, assumption enumeration, and ambiguity mapping. Tailored phase checklist for plan review distinct from code review.
Plan domain
👥
Plan-Specific Perspectives
Three new lenses: executor (implementation feasibility), stakeholder (business risk), and skeptic (worst-case reasoning).
Plan domain
🧠
Metacognitive Self-Audit
Confidence gating before each finding is committed. Reviewer must justify certainty, reducing false positives.
Precision
⚡
Adaptive Harshness Escalation
Starts at THOROUGH mode, escalates to ADVERSARIAL when the target looks clean — prevents rubber-stamping on deceptively tidy artifacts.
All domains
📌
Plan Evidence Redefinition
Evidence for plan findings now accepts section/assumption references instead of requiring file:line citations, which were inapplicable to documents.
Plan domain
🎯
Authority Framing
Persona anchored as senior engineering lead, not a generic critic — improves tone calibration and reduces hedge inflation that diluted v1 scores.
All domains