Skip to content

Spark Operator Validation Results

Test Platform

  • Platform: OpenShift 4.20
  • Spark Operator Version: Helm/kustomize-managed
  • Test Date: 2026-05

Results

Experiment Component Injection Verdict Notes
controller/pod-kill controller PodKill Resilient
controller/network-partition controller NetworkPartition Failed Controller permanently non-functional after partition
controller/label-stomping controller LabelStomping Resilient
controller/deployment-scale-zero controller DeploymentScaleZero Degraded No controller to restore replicas
controller/leader-election-disrupt controller LeaderElectionDisrupt Resilient
controller/crashloop-inject controller CrashLoopInject Resilient
controller/image-corrupt controller ImageCorrupt Resilient
controller/pdb-block controller PDBBlock Resilient
webhook/pod-kill webhook PodKill Resilient
webhook/deployment-scale-zero webhook DeploymentScaleZero Degraded No controller to restore replicas
webhook/resource-deletion-service webhook ResourceDeletion Resilient Service recreated in ~10s
webhook/webhook-disrupt webhook WebhookDisrupt Resilient

Key Findings

NetworkPartition: Controller Permanently Non-Functional

The most critical finding. After a network partition is applied and lifted, the Spark operator controller does not re-establish its API server watch connections. The controller remains running but is permanently non-functional until the pod is manually restarted. This indicates missing reconnection logic in the controller's informer/watch setup.

DeploymentScaleZero: No Automatic Recovery

Both the controller and webhook Deployments remain at zero replicas when scaled down. The Spark operator is Helm/kustomize-managed (not OLM), so there is no higher-level controller to detect and restore the replica count. Manual intervention is required.

Webhook Resilience

The webhook component shows good resilience across most failure modes. Unlike the controller, it recovers from network partitions correctly.