Architecture

Safety Case & Risk Model

Status: Draft v0.1

1. Safety Case Claim

Pnyma is safe enough for staged deployment only when constitutional governance, action gating, and audit controls demonstrably reduce high-severity harm risk to acceptable thresholds for each maturity stage.

2. Safety Argument Structure

Claim A

Constitutional controls reduce misaligned output behavior.

Claim B

Action governance prevents unauthorized or unsafe execution.

Claim C

Auditability and drift controls maintain long-term safety integrity.

Each claim must be supported by test evidence and operational monitoring.

3. Threat Model

Primary threat classes:

  • Prompt manipulation and jailbreak attempts.
  • Adversarial user coercion.
  • Policy evasion through ambiguity.
  • Tool-chain abuse via indirect prompts.
  • Constitutional drift over iterative updates.
  • Over-trust due to persuasive but uncertain outputs.

4. Failure Modes

  1. False certainty in high-impact contexts.
  2. Incorrect refusal (over-refusal or under-refusal).
  3. Unsafe operational detail leakage.
  4. Unauthorized tool execution.
  5. Memory misuse or sensitive retention.
  6. Inconsistent fairness under pressure.

5. Mitigation Controls

  • Constitutional rule engine with precedence logic.
  • Multi-pass guard checks for sensitive domains.
  • Mandatory uncertainty disclosure thresholds.
  • Strict action authorization gates.
  • Bounded memory with retention policy enforcement.
  • Continuous adversarial evaluation.

6. Action-Gating Logic

Action requests require:

  1. verified user intent,
  2. scope validation,
  3. risk classification,
  4. reversibility analysis,
  5. explicit permission check,
  6. post-action audit record.

Failure on any gate results in deny or escalate.

7. Deployment Maturity Stages

Stage 1 — Reasoning Only

  • No external actions.
  • Low-risk informational tasks.

Stage 2 — Guided Assistance

  • Recommendations and drafting only.
  • No autonomous execution.

Stage 3 — Bounded Memory

  • Retention in limited domains with strict controls.

Stage 4 — Permissioned Action

  • Narrow tool execution in reversible environments.

Stage 5 — Arbitration Core

  • Multi-agent oversight responsibilities with human governance.

Progression requires passing stage-specific safety gates.

8. Allowed vs. Forbidden Deployments

Allowed examples:

  • educational assistance,
  • high-trust drafting,
  • bounded decision support with human oversight.

Forbidden examples (until proven controls):

  • unrestricted autonomous transactions,
  • unsupervised critical infrastructure control,
  • hidden background agent execution.

9. Constitutional Drift Prevention

  • Immutable policy baselines.
  • Drift score monitoring per release.
  • Forced rollback on critical regression.
  • Governance sign-off for constitutional amendments.

10. Evidence Requirements

For each release:

  • evaluation report,
  • red-team findings,
  • unresolved risk register,
  • mitigation status,
  • deployment scope recommendation.