Security & Governance

Strict boundaries and safety guarantees for autonomous infrastructure operations.

The AI Safety Model

CortexOps operates under a strict separation of concerns regarding Artificial Intelligence.

AI = Recommendation

The LLM analyzes telemetry and proposes a root cause and a suggested fix. It has zero direct access to the Kubernetes API.

OPA + Temporal = Action

The workflow engine receives the proposal, subjects it to OPA policy evaluation, requests human approval, and executes safely.

Deterministic Execution

Correlation heuristics and remediation actions are 100% deterministic. Given the same set of telemetry events, CortexOps will always produce the exact same incident grouping and trigger the same workflow path. There are no probabilistic models in the critical execution path.

Replay Safety

If a network partition causes NATS JetStream to redeliver an event, or if a Temporal worker crashes and is restarted, the system guarantees that mutations are not applied twice. Idempotency keys and deterministic state machines prevent unintended side effects.

Policy Enforcement

Every `Execute()` call must pass Open Policy Agent (OPA) validation. Policies are written in Rego and distributed across the cluster. Examples include blocking operations on the `kube-system` namespace or requiring 2-person approvals for statefulset scale-downs.

Rollback Guarantees

Remediation is not fire-and-forget. The workflow pauses in a `VERIFYING` state to analyze post-patch telemetry. If the system does not stabilize, a pre-calculated compensation transaction is executed to roll back the cluster to its previous state.