Knative Serving Validation Results¶
Test Platform¶
- Platform: ROSA HyperShift 4.21.11
- Knative Version: OpenShift Serverless 1.37.1
- Test Date: 2026-05-06
Results¶
| Experiment | Component | Injection | Verdict | Recovery Time | Reconcile Cycles |
|---|---|---|---|---|---|
| activator/pod-kill | activator | PodKill | Resilient | 916ms | 1 |
| activator/all-pods-kill | activator | PodKill | Resilient | 893ms | 1 |
| activator/network-partition | activator | NetworkPartition | Failed | N/A | N/A |
| activator/label-stomping | activator | LabelStomping | Resilient | 904ms | 1 |
| activator/quota-exhaustion | activator | QuotaExhaustion | Resilient | 934ms | 1 |
| activator/rbac-revoke | activator | RBACRevoke | Resilient | 1.3s | 1 |
| autoscaler/pod-kill | autoscaler | PodKill | Resilient | 928ms | 1 |
| autoscaler/all-pods-kill | autoscaler | PodKill | Resilient | 890ms | 1 |
| autoscaler/network-partition | autoscaler | NetworkPartition | Resilient | 907ms | 1 |
| autoscaler/label-stomping | autoscaler | LabelStomping | Resilient | 1.0s | 1 |
| autoscaler/quota-exhaustion | autoscaler | QuotaExhaustion | Resilient | 899ms | 1 |
| autoscaler-hpa/pod-kill | autoscaler-hpa | PodKill | Resilient | ~1s | 1 |
| autoscaler-hpa/network-partition | autoscaler-hpa | NetworkPartition | Resilient | ~1s | 1 |
| controller/pod-kill | controller | PodKill | Resilient | 917ms | 1 |
| controller/all-pods-kill | controller | PodKill | Resilient | ~1s | 1 |
| controller/network-partition | controller | NetworkPartition | Resilient | ~1s | 1 |
| controller/label-stomping | controller | LabelStomping | Resilient | 890ms | 1 |
| controller/quota-exhaustion | controller | QuotaExhaustion | Resilient | ~1s | 1 |
| controller/rbac-revoke | controller | RBACRevoke | Resilient | 1.4s | 1 |
| webhook/pod-kill | webhook | PodKill | Resilient | 895ms | 1 |
| webhook/all-pods-kill | webhook | PodKill | Resilient | ~1s | 1 |
| webhook/network-partition | webhook | NetworkPartition | Resilient | ~1s | 1 |
| webhook/cert-corrupt | webhook | ConfigDrift | Resilient | 879ms | 1 |
| webhook/label-stomping | webhook | LabelStomping | Resilient | ~1s | 1 |
| webhook/quota-exhaustion | webhook | QuotaExhaustion | Resilient | ~1s | 1 |
| webhook/webhook-disrupt | webhook | WebhookDisrupt | Resilient | 898ms | 1 |
| kourier-gateway/pod-kill | kourier-gateway | PodKill | Resilient | 893ms | 1 |
| kourier-gateway/all-pods-kill | kourier-gateway | PodKill | Resilient | 869ms | 1 |
| kourier-gateway/network-partition | kourier-gateway | NetworkPartition | Resilient | 893ms | 1 |
| kourier-gateway/label-stomping | kourier-gateway | LabelStomping | Resilient | ~1s | 1 |
| kourier-gateway/quota-exhaustion | kourier-gateway | QuotaExhaustion | Resilient | ~1s | 1 |
| net-kourier-controller/pod-kill | net-kourier-controller | PodKill | Resilient | ~1s | 1 |
| net-kourier-controller/all-pods-kill | net-kourier-controller | PodKill | Resilient | ~1s | 1 |
| net-kourier-controller/network-partition | net-kourier-controller | NetworkPartition | Resilient | ~1s | 1 |
| net-kourier-controller/rbac-revoke | net-kourier-controller | RBACRevoke | Resilient | 943ms | 1 |
Key Findings¶
activator/network-partition Failure¶
The activator network partition experiment revealed a critical failure mode. When activator pods are network-isolated and then the partition is lifted, the pods report HTTP 500 errors on health probe endpoints. The activator process does not automatically recover from network isolation. The pods require a liveness probe restart to restore service.
Root Cause: The activator's health check endpoint depends on a WebSocket connection to the autoscaler. When the network partition blocks this connection, the health check handler enters a failed state. After the partition lifts, the WebSocket connection does not automatically re-establish, and the health check continues returning HTTP 500.
Impact: During network partitions affecting the activator, scale-from-zero traffic will be blocked. After network recovery, manual intervention (liveness probe timeout and restart) is required before service resumes.
Mitigation: The activator should implement connection retry logic for the autoscaler WebSocket. Alternatively, the health check should differentiate between transient connectivity issues and permanent failures.
Overall Resilience¶
Knative Serving demonstrates excellent resilience across 34 of 35 experiments. The platform handles pod failures, RBAC disruptions, quota exhaustion, and configuration drift without manual intervention. Recovery is consistently fast, with most experiments recovering in under 1 second and requiring only a single reconcile cycle.
The activator/network-partition failure is the only experiment that requires manual recovery (liveness probe restart). This is a known architectural limitation of the activator's dependency on persistent WebSocket connections to the autoscaler.