First-class operational procedures for managing CortexOps in production.
Telemetry Flow: Monitor ingestion rates via the cortexops_events_ingested_total metric in Prometheus. Backpressure alerts indicate downstream NATS constraints.
Dashboard Metrics: The Grafana dashboard surfaces Correlation Latency, Workflow Success Rates, and LLM Token Usage.
Workflow States: Monitored via the Temporal UI. Look for workflows stuck in APPROVAL_PENDING or looping in DRY_RUNNING.
Runtime Visibility: Use the Diagnostics API (`/debug/healthz`) to verify sub-component availability.
Occurs during massive cascading outages. CortexOps will auto-scale the Correlation Engine. If NATS hits memory limits, it drops older telemetry in favor of fresh signals.
If the LLM provider times out, CortexOps falls back to deterministic heuristic outputs. No workflows are blocked, but advisories become generic.
Workflows stuck in remediation usually indicate missing K8s RBAC permissions for the worker. Check Temporal activity logs for Unauthorized errors.
Dry-Run Execution: All mutating actions are executed in dry-run mode against the API server first to validate syntax and permissions.
Rollback Handling: If post-remediation metrics do not stabilize within the configured window, the workflow executes a compensation transaction to revert the change.
Human Approval Workflows: High-risk operations (e.g., database restarts) pause execution and send a Slack interactive message for approval. They timeout after 15 minutes.
OPA Denials: Rejected operations log a policy violation event in the Audit DB and terminate immediately.
Temporal is the brain of our remediation orchestration. If it fails, all autonomous actions stall.
docker compose logs postgres. Ensure migrations have completed.tctl workflow terminate -w [workflow_id] to forcefully halt it.Chaos Testing: Run make demo-failure to simulate NATS outages, pod deletions, and rollback scenarios. Validate that exactly-once semantics hold.
Backup & Restore: Take daily pg_dump backups of PostgreSQL for Temporal state and Audit history. Qdrant volumes should be snapshot weekly.