Service Mesh Validation Results¶
22/22 Resilient
All Service Mesh experiments passed with sub-3 second recovery times. Tested on ROSA HyperShift 4.21.11 with OSSM 3.2.0 (Istio 1.27.3).
Test Environment¶
| Property | Value |
|---|---|
| Platform | ROSA HyperShift 4.21.11 |
| OSSM Version | 3.2.0 |
| Istio Version | 1.27.3 |
| Operator CSV | servicemeshoperator3.v3.2.0 |
| Test Date | 2026-05-07 |
Results¶
servicemesh-operator3¶
| Experiment | Verdict | Recovery Time | Cycles |
|---|---|---|---|
| pod-kill | Resilient | 1.8s | 1 |
| network-partition | Resilient | 2s | 1 |
| quota-exhaustion | Resilient | 2s | 1 |
| label-stomping | Resilient | 2s | 1 |
| rbac-revoke | Resilient | 1.5s | 1 |
istiod¶
| Experiment | Verdict | Recovery Time | Cycles |
|---|---|---|---|
| pod-kill | Resilient | 2s | 1 |
| network-partition | Resilient | 1s | 1 |
| quota-exhaustion | Resilient | 2s | 1 |
| label-stomping | Resilient | 2s | 1 |
| rbac-revoke | Resilient | 1.5s | 1 |
| webhook-disrupt | Resilient | 2.6s | 1 |
| sidecar-injector-disrupt | Resilient | 1.5s | 1 |
| crd-mutation | Resilient | 1.6s | 1 |
| finalizer-block | Resilient | 1.5s | 1 |
| config-drift | Resilient | 1.7s | 1 |
| webhook-latency | Resilient | 1.6s | 1 |
| ownerref-orphan | Resilient | 1.8s | 1 |
| crashloop-inject | Resilient | <3s | 1 |
| image-corrupt | Resilient | <3s | 1 |
| resource-deletion-service | Resilient | <3s | 1 |
| pdb-block | Resilient | <3s | 1 |
OLM Lifecycle¶
| Experiment | Verdict | Recovery Time | Cycles |
|---|---|---|---|
| olm-subscription-corrupt | Resilient | 1.5s | 1 |
Analysis¶
Service Mesh v3 (Sail Operator) shows excellent resilience across all fault categories:
Operator Layer: OLM restores the operator deployment, RBAC, and labels within 2 seconds. The operator uses a single replica, but the fast recovery via OLM CSV reconciliation means downtime is minimal.
Control Plane Layer: istiod recovers quickly from all faults. The Deployment controller handles pod-kill, the operator handles label restoration and webhook recovery, and OLM handles RBAC restoration. The Istio CR's Ready condition is maintained through all experiments.
Webhook Resilience: Both the validating webhook (/validate) and the mutating sidecar injector (/inject) are restored within 3 seconds after disruption. istiod actively monitors and re-registers its webhooks.
CRD Validation: The Istio CR has CEL validation on spec.version that prevents invalid values from being set. This blocked our initial CRDMutation attempt, which is actually a positive finding: the CRD schema prevents accidental misconfiguration at the admission layer.
Config Recovery: Both the Istio CR spec drift (CRDMutation) and ConfigMap corruption (ConfigDrift) were recovered in under 2 seconds. The Sail operator actively reconciles the full resource tree, not just the deployment.
Key Observation: Unlike some operators where RBAC revocation causes prolonged degradation (e.g., Kueue OLM taking ~3 minutes), Service Mesh recovers RBAC in 1.5 seconds. OLM's CSV reconciliation for the Service Mesh operator is highly efficient.