Operator Chaos¶
Chaos engineering for Kubernetes operators.
Test reconciliation semantics, not just pod restarts.
Why Operator Chaos?¶
Existing chaos tools (Krkn, Litmus, Chaos Mesh) test infrastructure resilience: kill a pod, verify it comes back. But Kubernetes operators manage complex resource graphs — Deployments, Services, ConfigMaps, CRDs — where the real question is:
"When something breaks, does the operator put everything back the way it should be?"
Operator Chaos answers this by testing reconciliation: verifying operators restore resources to their intended state after operator-semantic faults like CRD mutation, config drift, and RBAC revocation.
How It Compares to Other Chaos Tools¶
| Operator Chaos | Krkn | LitmusChaos | Chaos Mesh | |
|---|---|---|---|---|
| Focus | Operator reconciliation logic | Cluster/infrastructure resilience | Application and infrastructure resilience | Kubernetes-native fault injection |
| Core question | "Did the operator restore the correct state?" | "Does the cluster survive this failure?" | "Does the application recover?" | "How does the system behave under fault?" |
| Fault types | 20 types across 4 categories: pod/network lifecycle, webhook/RBAC/config drift, resource ownership, controller-runtime client faults | Infrastructure: node kill, network chaos, etcd split, zone outage, CPU/memory hog | Mixed: pod kill, node drain, disk fill, HTTP chaos, cloud provider faults | Kubernetes-native: pod kill, network delay/loss, IO stress, time skew, JVM faults |
| Verdict model | Resilient / Degraded / Failed with recovery time | Pass / Fail based on cluster health | Pass / Fail based on probe checks | Status-based (experiment CR conditions) |
| Operator awareness | Declarative knowledge models describe operator topology (deployments, webhooks, RBAC, CRDs) | No declarative operator topology model | No declarative operator topology model | No declarative operator topology model |
| Best for | Operator developers validating reconciliation correctness | SRE teams validating cluster resilience | Platform teams testing application resilience at scale | Teams needing fine-grained Kubernetes fault injection |
These tools are complementary, not competing. Krkn, Litmus, and Chaos Mesh test whether the platform and applications survive infrastructure failures. Operator Chaos tests whether the operator's reconciliation logic correctly restores managed resources after operator-level faults. A pod-kill test in Krkn checks if Kubernetes reschedules the pod. A pod-kill test in Operator Chaos checks if the operator re-reconciles all the resources that pod was managing.
How It Works¶
flowchart LR
A["Define<br/>Experiment"] --> B["Verify<br/>Baseline"]
B --> C["Inject<br/>Fault"]
C --> D["Observe<br/>Recovery"]
D --> E{"Render<br/>Verdict"}
E -->|recovered| R["Resilient"]
E -->|partial| G["Degraded"]
E -->|not recovered| F["Failed"]
style A fill:#bbdefb,stroke:#1565c0
style B fill:#ce93d8,stroke:#6a1b9a
style C fill:#ffcc80,stroke:#e65100
style D fill:#a5d6a7,stroke:#2e7d32
style E fill:#b0bec5,stroke:#37474f
style R fill:#a5d6a7,stroke:#2e7d32
style G fill:#ffcc80,stroke:#e65100
style F fill:#ef9a9a,stroke:#c62828
Testing Fidelity¶
Operator Chaos is a test harness, not a fixed-fidelity tool. The fidelity of your chaos tests depends on the environment you point it at:
| Environment | Fidelity | What You Learn |
|---|---|---|
| Fake client (fuzz mode) | Unit-level | Reconciler logic handles faults correctly |
kind / minikube |
Integration | Operator recovers resources on a real API server |
| Staging OpenShift | System | Operator works with real RBAC, webhooks, network policies |
| Production-like OCP | Production | Operator handles real workloads under real constraints |
The tool itself is lightweight (single static binary, ~20MB container image). What changes is the target: same experiments, same verdicts, different confidence levels. Start with fuzz tests during development, graduate to live cluster tests for release qualification.
Offline vs Live Capabilities¶
Many operator-chaos commands work without any cluster connection:
| Command | Cluster Required? | What It Does |
|---|---|---|
operator-chaos validate |
No | Validates experiment and knowledge YAML syntax |
operator-chaos types |
No | Lists all available injection types |
operator-chaos init |
No | Scaffolds new experiment files |
operator-chaos preflight --local |
No | Validates knowledge YAML structure without cluster |
operator-chaos run |
Yes | Executes experiments against a live cluster |
operator-chaos suite |
Yes | Runs experiment suites against a live cluster |
operator-chaos preflight (no --local) |
Yes | Checks that declared resources exist on cluster |
operator-chaos clean |
Yes | Removes leftover chaos artifacts from cluster |
This means you can validate experiments, lint knowledge models, and scaffold new tests entirely offline, in CI without a cluster, or during development before you have access to a test environment.
Four Usage Modes¶
| Mode | What It Tests | Cluster? | When to Use |
|---|---|---|---|
| CLI Experiments | Full operator recovery on a live cluster | Yes | Pre-release validation, CI/CD |
| SDK Middleware | Operator behavior under API-level faults | Yes (or fake client) | Integration tests |
| Fuzz Testing | Reconciler correctness under random faults | No | Development, unit tests, CI |
| Upgrade Testing | Structural changes between operator versions | Yes | Release qualification, upgrade testing |
-
CLI Experiments
Run structured chaos experiments against a live cluster. Orchestrates the full lifecycle: steady state, inject, observe, evaluate.
-
SDK Middleware
Wrap a controller-runtime client with fault injection. No code changes to your reconciler needed.
-
Fuzz Testing
Test reconciler correctness under random faults. No cluster needed — uses fake client.
-
Upgrade Testing
Auto-generate chaos experiments from version diffs. Test CRD schema changes, resource ownership shifts, and dependency mutations.
Verdicts¶
Every experiment produces a structured verdict:
| Verdict | Meaning |
|---|---|
| Resilient | Operator restored all resources correctly within the timeout |
| Degraded | Operator recovered but with deviations from expected state |
| Failed | Operator did not recover within the timeout |
| Inconclusive | Baseline check failed, experiment could not run |