A resilient, event-driven control plane built for operational correctness.
CortexOps operates strictly outside the critical path of your application traffic. It functions as an out-of-band control plane listening to operational telemetry.
The Collector service watches the Kubernetes API for events, metrics, and state changes. These are normalized into a standard protobuf schema and published to NATS JetStream. Concurrently, the Topology Engine maintains an in-memory graph of all resources to instantly calculate blast radius.
An in-memory graph ensures sub-millisecond queries during a cascading failure. Normalizing via NATS ensures that the correlation engine receives ordered, replayable event streams, decoupling ingestion from processing.
When an incident is correlated and a root cause is proposed, the system orchestrates a deterministic remediation workflow using Temporal.
Every workflow state is durably persisted. If the remediation worker crashes during execution, Temporal resumes the exact step upon restart. OPA policies guarantee that no forbidden mutations occur.
If a patch fails or post-execution telemetry degrades, the workflow automatically transitions to the ROLLING_BACK state. If rollback fails, the incident escalates to PagerDuty.
CortexOps guarantees exactly-once processing semantics during network partitions or pod failures.
In distributed systems, failures are inevitable. If NATS redelivers an event due to a timeout, the Correlation Engine must deterministically drop the duplicate using the Audit DB to prevent triggering duplicate remediation workflows.