Prometheus Operator Custom Experiments¶
This page provides templates and guidance for writing custom chaos experiments targeting the Prometheus Operator.
Component Overview¶
The Prometheus Operator has 1 component in the openshift-monitoring namespace:
- prometheus-operator: Manages Prometheus, Alertmanager, ThanosRuler, and related monitoring resources via custom resources (Prometheus, ServiceMonitor, PrometheusRule, Alertmanager)
Example Template¶
prometheus-operator¶
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: prometheus-operator-custom
spec:
target:
operator: prometheus-operator
component: prometheus-operator
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: prometheus-operator
namespace: openshift-monitoring
conditionType: Available
timeout: "60s"
injection:
type: PodKill # Change to desired injection type
parameters:
labelSelector: app.kubernetes.io/name=prometheus-operator
ttl: "120s"
hypothesis:
description: >-
Describe the expected behavior after fault injection.
recoveryTimeout: 120s
Running Custom Experiments¶
- Save your experiment YAML to a file
- Run:
chaos-cli run --experiment <file> - Check results:
chaos-cli results --latest
Design Considerations¶
- cluster-monitoring-operator: The Prometheus Operator is managed by cluster-monitoring-operator, which provides aggressive reconciliation. Most disruptions are self-healed, including DeploymentScaleZero.
- Monitoring data plane: Disrupting the Prometheus Operator does not affect running Prometheus instances. Metrics scraping and alerting continue. Only configuration reconciliation (new ServiceMonitors, PrometheusRules) stalls.
- openshift-monitoring namespace: This namespace has strict RBAC and may require elevated permissions to apply chaos experiments. Ensure your chaos service account has the necessary permissions.
- Platform component: Be cautious with experiments in production clusters, as disrupting monitoring can mask other issues.