Prometheus Operator Failure Modes¶
Coverage¶
| Injection Type | Danger | Experiment | Description |
|---|---|---|---|
| PodKill | low | prometheus-operator/pod-kill.yaml | Killing the prometheus-operator pod triggers a Deployment rollout. |
| NetworkPartition | medium | prometheus-operator/network-partition.yaml | Isolating the operator from the API server stalls Prometheus reconciliation. |
| LabelStomping | high | prometheus-operator/label-stomping.yaml | Overwriting a label on the operator Deployment. |
| QuotaExhaustion | high | prometheus-operator/quota-exhaustion.yaml | Applying zero-limit ResourceQuota prevents new pod creation. |
| RBACRevoke | high | prometheus-operator/rbac-revoke.yaml | Revoking ClusterRoleBinding blocks Prometheus resource management. |
| DeploymentScaleZero | high | prometheus-operator/deployment-scale-zero.yaml | Scaling operator to zero replicas. |
Experiment Details¶
prometheus-operator¶
prometheus-operator-pod-kill¶
- Type: PodKill
- Danger Level: low
- Component: prometheus-operator
Killing the prometheus-operator pod triggers a Deployment rollout. Running Prometheus and Alertmanager instances continue serving metrics. Reconciliation of ServiceMonitor and PrometheusRule changes stalls until the operator recovers.
prometheus-operator-network-partition¶
- Type: NetworkPartition
- Danger Level: medium
- Component: prometheus-operator
Isolating the operator from the API server stalls Prometheus configuration reconciliation. Existing Prometheus instances continue scraping and serving queries. After the partition is lifted, the operator reconnects and processes backlogged events.
prometheus-operator-label-stomping¶
- Type: LabelStomping
- Danger Level: high
- Component: prometheus-operator
Overwriting a label on the operator Deployment tests whether cluster-monitoring-operator detects and restores the label.
prometheus-operator-quota-exhaustion¶
- Type: QuotaExhaustion
- Danger Level: high
- Component: prometheus-operator
Applying a zero-limit ResourceQuota prevents new pod creation. The chaos framework removes the quota via TTL-based cleanup.
prometheus-operator-rbac-revoke¶
- Type: RBACRevoke
- Danger Level: high
- Component: prometheus-operator
Revoking the operator's ClusterRoleBinding blocks Prometheus and Alertmanager resource management. After rollback, reconciliation resumes.
prometheus-operator-deployment-scale-zero¶
- Type: DeploymentScaleZero
- Danger Level: high
- Component: prometheus-operator
Scaling the operator Deployment to zero replicas. cluster-monitoring-operator restores the replica count.