Spark Operator Validation Results¶
Test Platform¶
- Platform: OpenShift 4.20
- Spark Operator Version: Helm/kustomize-managed
- Test Date: 2026-05
Results¶
| Experiment | Component | Injection | Verdict | Notes |
|---|---|---|---|---|
| controller/pod-kill | controller | PodKill | Resilient | |
| controller/network-partition | controller | NetworkPartition | Failed | Controller permanently non-functional after partition |
| controller/label-stomping | controller | LabelStomping | Resilient | |
| controller/deployment-scale-zero | controller | DeploymentScaleZero | Degraded | No controller to restore replicas |
| controller/leader-election-disrupt | controller | LeaderElectionDisrupt | Resilient | |
| controller/crashloop-inject | controller | CrashLoopInject | Resilient | |
| controller/image-corrupt | controller | ImageCorrupt | Resilient | |
| controller/pdb-block | controller | PDBBlock | Resilient | |
| webhook/pod-kill | webhook | PodKill | Resilient | |
| webhook/deployment-scale-zero | webhook | DeploymentScaleZero | Degraded | No controller to restore replicas |
| webhook/resource-deletion-service | webhook | ResourceDeletion | Resilient | Service recreated in ~10s |
| webhook/webhook-disrupt | webhook | WebhookDisrupt | Resilient |
Key Findings¶
NetworkPartition: Controller Permanently Non-Functional¶
The most critical finding. After a network partition is applied and lifted, the Spark operator controller does not re-establish its API server watch connections. The controller remains running but is permanently non-functional until the pod is manually restarted. This indicates missing reconnection logic in the controller's informer/watch setup.
DeploymentScaleZero: No Automatic Recovery¶
Both the controller and webhook Deployments remain at zero replicas when scaled down. The Spark operator is Helm/kustomize-managed (not OLM), so there is no higher-level controller to detect and restore the replica count. Manual intervention is required.
Webhook Resilience¶
The webhook component shows good resilience across most failure modes. Unlike the controller, it recovers from network partitions correctly.