Failure Modes & Safe Defaults

Governance systems must define behavior when components fail. The choice between fail-closed (block operations) and fail-open (allow operations) depends on your threat model and operational requirements.

What failure modes exist?

The governance boundary can fail in several ways, each requiring defined behavior:

Boundary Unavailable

The governance boundary process is down, crashed, or unreachable. Operations cannot be evaluated.

Policy Load Failure

The policy artifact cannot be loaded, parsed, or verified. Boundary cannot determine how to evaluate operations.

Measurement Failure

Subject measurement cannot be completed. Can't compute hash, file missing, or measurement timeout.

Receipt Persistence Failure

Receipt cannot be written to the chain. Storage full, I/O error, or signing key unavailable.

Time Attestation Failure

TSA is unreachable or returns error. Cannot obtain trusted timestamp for receipt.

What is fail-closed?

Fail-closed blocks operations when the governance system cannot evaluate them. This prioritizes security over availability.

When to Use Fail-Closed

  • High-risk operations (financial, safety-critical)
  • Regulatory requirements mandate audit trail
  • Untrusted execution environment
  • Zero-tolerance security posture

Tradeoffs

  • Availability impact during failures
  • Denial-of-service risk if boundary attacked
  • Cascading failures possible
  • Requires high-availability boundary

What is fail-open?

Fail-open allows operations to proceed when governance cannot evaluate them. This prioritizes availability over security.

When to Use Fail-Open

  • Availability-critical systems
  • Governance is advisory, not mandatory
  • Trusted execution environment
  • Compensating controls exist

Tradeoffs

  • Ungoverned operations during failure
  • Audit trail gaps
  • Potential compliance issues
  • Must track DEGRADED state

How do I configure failure behavior?

The policy artifact defines failure behavior per failure mode:

"failure_policy": {
  "boundary_unavailable": "FAIL_CLOSED",
  "policy_load_failure": "FAIL_CLOSED",
  "measurement_failure": "BLOCK_AND_ALERT",
  "receipt_persistence_failure": "CONTINUE_DEGRADED",
  "time_attestation_failure": "CONTINUE_DEGRADED",

  "degraded_behavior": {
    "max_operations": 100,         // Max ungoverned ops
    "max_duration_seconds": 300,   // Max degraded time
    "on_limit_reached": "FAIL_CLOSED"
  },

  "recovery": {
    "auto_retry_interval_ms": 5000,
    "max_retries": 3,
    "on_recovery": "EMIT_RECOVERY_RECEIPT"
  }
}

What are the degraded states?

When operating in degraded mode, the system should clearly mark the condition:

  • DEGRADED_LOCAL: TSA unavailable, using local timestamps
  • DEGRADED_MEASUREMENT: Partial measurement, some files inaccessible
  • DEGRADED_PERSISTENCE: Receipts buffered in memory, not yet persisted
  • UNGOVERNED: Boundary bypassed, no evaluation occurred

Frequently asked questions

Can I mix fail-open and fail-closed?

Yes. Configure different behaviors for different failure modes. For example: fail-closed for policy load failures, but continue-degraded for TSA failures.

How do auditors know about degraded periods?

The evidence bundle includes degraded state markers in receipts. The verifier outputs PASS_WITH_CAVEATS and lists the degraded periods with their reason codes.

What if the boundary recovers mid-operation?

Emit a RECOVERY receipt marking the transition. Operations started during degraded mode remain degraded; new operations are fully governed.