The design maturity and tradeoffs behind CortexOps.
Why Microservices?
Failure isolation. If the RCA Engine runs out of memory, the Correlation Engine still processes events and drops generic alerts.
Independent scaling. The Collector scales linearly with K8s cluster size, while the Topology Service scales by cluster complexity.
Clear service ownership. Forces well-defined API boundaries via Protobufs.
Why NATS JetStream?
Replay safety. The Correlation Engine can crash, reboot, and replay from the exact sequence ID it left off.
Event durability. JetStream persists telemetry to disk synchronously, ensuring no data loss even during network partitions.
Decoupled services. Collectors don't know about Correlators. The broker acts as the ultimate shock absorber.
Guarantees: Exactly-once delivery semantics using message IDs.
Why Temporal?
Durable execution. Workflows are effectively immortal. If a pod dies during a 10-minute wait for human approval, it resumes immediately upon scheduling.
Workflow replay. When code changes, Temporal handles history replay to ensure the workflow doesn't diverge unsafely.
State recovery. Built-in retry loops and compensation blocks replace thousands of lines of bespoke distributed systems code.
Why Heuristics over ML for Correlation?
Deterministic behavior. In an operational control plane, you must be able to trace exactly *why* two events were correlated.
Auditability. A weighted sum based on TraceID matches, time-proximity, and topology depth is 100% explainable in an incident review.
Predictable outcomes. Probabilistic models hallucinate. Heuristics do not.
Why Advisory AI?
Safety. AI models are not deterministic enough to hold the keys to production infrastructure.
Governance. By decoupling the recommendation (LLM) from the action (Temporal/OPA), we enforce safety programmatically.
Human accountability. The AI acts as the ultimate Staff Engineer standing over your shoulder, but a human or deterministic policy clicks the button.