Architecture

Evaluation & Benchmarking Framework

Status: Draft v0.1

1. Evaluation Goal

Measure whether Pnyma behaves in constitutional compliance under real pressure — not only whether it produces plausible answers.

2. Pillars

  1. Constitutional Fidelity
  2. Interpretive Precision
  3. Uncertainty Honesty
  4. Fairness and Dignity Consistency
  5. Manipulation Resistance
  6. Action Restraint and Gating Correctness
  7. Cross-Domain Coherence

3. Metric Families

3.1 Constitutional Fidelity

  • Principle-application accuracy.
  • Conflict-resolution correctness.
  • Refusal/redirect correctness in prohibited scenarios.

3.2 Interpretive Precision

  • Ambiguity detection rate.
  • Correct response-mode selection.
  • Reduction of over-claiming in underspecified prompts.

3.3 Uncertainty Honesty

  • Calibration error for confidence statements.
  • Unknown acknowledgment rate when evidence is insufficient.
  • False-certainty incidents per 1,000 prompts.

3.4 Fairness Under Pressure

  • Policy consistency across demographic and contextual variants.
  • Disparate refusal quality analysis.
  • Dignity-preserving language checks.

3.5 Manipulation Resistance

  • Prompt injection resistance.
  • Social-engineering resistance.
  • High-pressure compliance failure rate.

3.6 Action Governance

  • Unauthorized action attempt rate.
  • Correct escalation rate for high-risk tasks.
  • Reversibility-awareness compliance.

4. Test Suite Structure

  • Canonical Set: baseline constitutional tasks.
  • Adversarial Set: manipulation and boundary stress tests.
  • Long-Horizon Set: multi-turn drift detection.
  • Cross-Domain Set: medicine, law, finance, education, personal advice.
  • Red-Team Set: evolving exploit probes.

5. Scoring Model

Composite score should not hide safety failures.

Hard constraints:

  • Any critical action-gating failure blocks release.
  • Any severe constitutional violation requires remediation before deployment.

6. Evaluation Cadence

  • Pre-merge policy tests.
  • Pre-release full benchmark run.
  • Post-release canary monitoring.
  • Scheduled quarterly adversarial reevaluation.

7. Audit Artifacts

Each evaluation cycle publishes:

  • scorecard by pillar,
  • top failure clusters,
  • remediation plan,
  • drift trend comparison versus prior release.

8. Release Gates

A release is permitted only if:

  • all critical safety gates pass,
  • constitutional fidelity meets threshold,
  • drift remains within tolerance,
  • unresolved high-severity findings are zero.

9. Continuous Improvement Loop

  1. Detect failures.
  2. Classify root causes.
  3. Apply targeted retraining or policy revision.
  4. Re-run affected suites.
  5. Approve only after regression clearance.