Skip to content

Strimzi Failure Modes

Coverage

Injection Type Danger Experiment Description
PodKill low cluster-operator/pod-kill.yaml Killing the cluster-operator pod triggers a Deployment rollout. Kafka clusters remain running.
NetworkPartition medium cluster-operator/network-partition.yaml Isolating the cluster-operator from the API server stalls Kafka cluster reconciliation.
LabelStomping high cluster-operator/label-stomping.yaml Overwriting a label on the cluster-operator Deployment tests OLM label reconciliation.
QuotaExhaustion high cluster-operator/quota-exhaustion.yaml Applying zero-limit ResourceQuota prevents new pod creation.
RBACRevoke high cluster-operator/rbac-revoke.yaml Revoking ClusterRoleBinding blocks Kafka resource reconciliation.
DeploymentScaleZero high cluster-operator/deployment-scale-zero.yaml Scaling the Deployment to zero replicas. OLM does not restore replicas.
LeaderElectionDisrupt medium cluster-operator/leader-election-disrupt.yaml Deleting the leader election Lease forces re-election.
ConfigDrift high cluster-operator/config-drift.yaml Corrupting operator configuration tests self-healing.

Experiment Details

cluster-operator

cluster-operator-pod-kill

  • Type: PodKill
  • Danger Level: low
  • Component: cluster-operator

Killing the Strimzi cluster-operator pod triggers a Deployment rollout that recreates it. Existing Kafka clusters remain operational since they run independently.


cluster-operator-network-partition

  • Type: NetworkPartition
  • Danger Level: medium
  • Component: cluster-operator

Isolating the cluster-operator from the API server stalls Kafka cluster reconciliation. Existing clusters continue serving traffic. After the partition is lifted, the operator reconnects and processes backlogged reconciliation events.


cluster-operator-label-stomping

  • Type: LabelStomping
  • Danger Level: high
  • Component: cluster-operator

Overwriting a label on the cluster-operator Deployment tests whether OLM detects and restores the label.


cluster-operator-quota-exhaustion

  • Type: QuotaExhaustion
  • Danger Level: high
  • Component: cluster-operator

Applying a zero-limit ResourceQuota to the namespace prevents new pod creation. The chaos framework removes the quota via TTL-based cleanup.


cluster-operator-rbac-revoke

  • Type: RBACRevoke
  • Danger Level: high
  • Component: cluster-operator

Revoking the cluster-operator's ClusterRoleBinding blocks Kafka resource reconciliation. The pod remains running but cannot manage Kafka clusters. After rollback, reconciliation resumes.


cluster-operator-deployment-scale-zero

  • Type: DeploymentScaleZero
  • Danger Level: high
  • Component: cluster-operator

Scaling the cluster-operator Deployment to zero replicas. OLM does not automatically restore the replica count, making this a Degraded finding.


cluster-operator-leader-election-disrupt

  • Type: LeaderElectionDisrupt
  • Danger Level: medium
  • Component: cluster-operator

Deleting the leader election Lease forces the cluster-operator to re-acquire leadership. During re-election, reconciliation is temporarily paused.


cluster-operator-config-drift

  • Type: ConfigDrift
  • Danger Level: high
  • Component: cluster-operator

Corrupting the operator's configuration tests whether the operator or its controller can detect and restore correct configuration values.