Skip to content

Spark Operator Failure Modes

Coverage

Injection Type Danger Experiment Description
PodKill low controller/pod-kill.yaml Killing the controller pod triggers a Deployment rollout.
NetworkPartition medium controller/network-partition.yaml Isolating the controller from the API server. Controller becomes permanently non-functional.
LabelStomping high controller/label-stomping.yaml Overwriting a label on the controller Deployment.
QuotaExhaustion high controller/quota-exhaustion.yaml Applying zero-limit ResourceQuota prevents new pod creation.
RBACRevoke high controller/rbac-revoke.yaml Revoking ClusterRoleBinding blocks SparkApplication reconciliation.
DeploymentScaleZero high controller/deployment-scale-zero.yaml Scaling controller to zero replicas. Not restored automatically.
PodKill low webhook/pod-kill.yaml Killing the webhook pod triggers a Deployment rollout.
NetworkPartition medium webhook/network-partition.yaml Isolating the webhook from the API server.
LabelStomping high webhook/label-stomping.yaml Overwriting a label on the webhook Deployment.
QuotaExhaustion high webhook/quota-exhaustion.yaml Applying zero-limit ResourceQuota prevents new pod creation.
DeploymentScaleZero high webhook/deployment-scale-zero.yaml Scaling webhook to zero replicas. Not restored automatically.
ConfigDrift high webhook/webhook-cert-corrupt.yaml Corrupting the webhook TLS certificate.

Experiment Details

controller

controller-pod-kill

  • Type: PodKill
  • Danger Level: low
  • Component: controller

Killing the Spark operator controller pod triggers a Deployment rollout that recreates it. Running SparkApplications are not affected.


controller-network-partition

  • Type: NetworkPartition
  • Danger Level: medium
  • Component: controller

Isolating the controller from the API server causes the controller to become permanently non-functional. This is a verified Failed finding: the controller does not recover connectivity after the partition is lifted without a pod restart.


controller-label-stomping

  • Type: LabelStomping
  • Danger Level: high
  • Component: controller

Overwriting a label on the controller Deployment tests whether the managing controller detects and restores the label.


controller-quota-exhaustion

  • Type: QuotaExhaustion
  • Danger Level: high
  • Component: controller

Applying a zero-limit ResourceQuota prevents new pod creation. The chaos framework removes the quota via TTL-based cleanup.


controller-rbac-revoke

  • Type: RBACRevoke
  • Danger Level: high
  • Component: controller

Revoking the controller's ClusterRoleBinding blocks SparkApplication reconciliation. The pod remains running but cannot manage resources. After rollback, reconciliation resumes.


controller-deployment-scale-zero

  • Type: DeploymentScaleZero
  • Danger Level: high
  • Component: controller

Scaling the controller Deployment to zero replicas. Since Spark operator is Helm/kustomize-managed, there is no controller to restore replicas automatically.


webhook

webhook-pod-kill

  • Type: PodKill
  • Danger Level: low
  • Component: webhook

Killing the webhook pod triggers a Deployment rollout. SparkApplication validation is temporarily unavailable.


webhook-network-partition

  • Type: NetworkPartition
  • Danger Level: medium
  • Component: webhook

Isolating the webhook from the API server blocks SparkApplication admission. Recovery occurs after the partition is lifted.


webhook-label-stomping

  • Type: LabelStomping
  • Danger Level: high
  • Component: webhook

Overwriting a label on the webhook Deployment tests label reconciliation.


webhook-quota-exhaustion

  • Type: QuotaExhaustion
  • Danger Level: high
  • Component: webhook

Applying a zero-limit ResourceQuota prevents new webhook pod creation.


webhook-deployment-scale-zero

  • Type: DeploymentScaleZero
  • Danger Level: high
  • Component: webhook

Scaling the webhook Deployment to zero replicas. Not restored automatically.


webhook-cert-corrupt

  • Type: ConfigDrift
  • Danger Level: high
  • Component: webhook

Corrupting the webhook TLS certificate tests whether the operator or cert-manager can regenerate it.