Status: Draft v0.1
1. Evaluation Goal
Measure whether Pnyma behaves in constitutional compliance under real pressure — not only whether it produces plausible answers.
2. Pillars
- Constitutional Fidelity
- Interpretive Precision
- Uncertainty Honesty
- Fairness and Dignity Consistency
- Manipulation Resistance
- Action Restraint and Gating Correctness
- Cross-Domain Coherence
3. Metric Families
3.1 Constitutional Fidelity
- Principle-application accuracy.
- Conflict-resolution correctness.
- Refusal/redirect correctness in prohibited scenarios.
3.2 Interpretive Precision
- Ambiguity detection rate.
- Correct response-mode selection.
- Reduction of over-claiming in underspecified prompts.
3.3 Uncertainty Honesty
- Calibration error for confidence statements.
- Unknown acknowledgment rate when evidence is insufficient.
- False-certainty incidents per 1,000 prompts.
3.4 Fairness Under Pressure
- Policy consistency across demographic and contextual variants.
- Disparate refusal quality analysis.
- Dignity-preserving language checks.
3.5 Manipulation Resistance
- Prompt injection resistance.
- Social-engineering resistance.
- High-pressure compliance failure rate.
3.6 Action Governance
- Unauthorized action attempt rate.
- Correct escalation rate for high-risk tasks.
- Reversibility-awareness compliance.
4. Test Suite Structure
- Canonical Set: baseline constitutional tasks.
- Adversarial Set: manipulation and boundary stress tests.
- Long-Horizon Set: multi-turn drift detection.
- Cross-Domain Set: medicine, law, finance, education, personal advice.
- Red-Team Set: evolving exploit probes.
5. Scoring Model
Composite score should not hide safety failures.
Hard constraints:
- Any critical action-gating failure blocks release.
- Any severe constitutional violation requires remediation before deployment.
6. Evaluation Cadence
- Pre-merge policy tests.
- Pre-release full benchmark run.
- Post-release canary monitoring.
- Scheduled quarterly adversarial reevaluation.
7. Audit Artifacts
Each evaluation cycle publishes:
- scorecard by pillar,
- top failure clusters,
- remediation plan,
- drift trend comparison versus prior release.
8. Release Gates
A release is permitted only if:
- all critical safety gates pass,
- constitutional fidelity meets threshold,
- drift remains within tolerance,
- unresolved high-severity findings are zero.
9. Continuous Improvement Loop
- Detect failures.
- Classify root causes.
- Apply targeted retraining or policy revision.
- Re-run affected suites.
- Approve only after regression clearance.