Skip to content

Knative Serving Validation Results

Test Platform

  • Platform: ROSA HyperShift 4.21.11
  • Knative Version: OpenShift Serverless 1.37.1
  • Test Date: 2026-05-06

Results

Experiment Component Injection Verdict Recovery Time Reconcile Cycles
activator/pod-kill activator PodKill Resilient 916ms 1
activator/all-pods-kill activator PodKill Resilient 893ms 1
activator/network-partition activator NetworkPartition Failed N/A N/A
activator/label-stomping activator LabelStomping Resilient 904ms 1
activator/quota-exhaustion activator QuotaExhaustion Resilient 934ms 1
activator/rbac-revoke activator RBACRevoke Resilient 1.3s 1
autoscaler/pod-kill autoscaler PodKill Resilient 928ms 1
autoscaler/all-pods-kill autoscaler PodKill Resilient 890ms 1
autoscaler/network-partition autoscaler NetworkPartition Resilient 907ms 1
autoscaler/label-stomping autoscaler LabelStomping Resilient 1.0s 1
autoscaler/quota-exhaustion autoscaler QuotaExhaustion Resilient 899ms 1
autoscaler-hpa/pod-kill autoscaler-hpa PodKill Resilient ~1s 1
autoscaler-hpa/network-partition autoscaler-hpa NetworkPartition Resilient ~1s 1
controller/pod-kill controller PodKill Resilient 917ms 1
controller/all-pods-kill controller PodKill Resilient ~1s 1
controller/network-partition controller NetworkPartition Resilient ~1s 1
controller/label-stomping controller LabelStomping Resilient 890ms 1
controller/quota-exhaustion controller QuotaExhaustion Resilient ~1s 1
controller/rbac-revoke controller RBACRevoke Resilient 1.4s 1
webhook/pod-kill webhook PodKill Resilient 895ms 1
webhook/all-pods-kill webhook PodKill Resilient ~1s 1
webhook/network-partition webhook NetworkPartition Resilient ~1s 1
webhook/cert-corrupt webhook ConfigDrift Resilient 879ms 1
webhook/label-stomping webhook LabelStomping Resilient ~1s 1
webhook/quota-exhaustion webhook QuotaExhaustion Resilient ~1s 1
webhook/webhook-disrupt webhook WebhookDisrupt Resilient 898ms 1
kourier-gateway/pod-kill kourier-gateway PodKill Resilient 893ms 1
kourier-gateway/all-pods-kill kourier-gateway PodKill Resilient 869ms 1
kourier-gateway/network-partition kourier-gateway NetworkPartition Resilient 893ms 1
kourier-gateway/label-stomping kourier-gateway LabelStomping Resilient ~1s 1
kourier-gateway/quota-exhaustion kourier-gateway QuotaExhaustion Resilient ~1s 1
net-kourier-controller/pod-kill net-kourier-controller PodKill Resilient ~1s 1
net-kourier-controller/all-pods-kill net-kourier-controller PodKill Resilient ~1s 1
net-kourier-controller/network-partition net-kourier-controller NetworkPartition Resilient ~1s 1
net-kourier-controller/rbac-revoke net-kourier-controller RBACRevoke Resilient 943ms 1

Key Findings

activator/network-partition Failure

The activator network partition experiment revealed a critical failure mode. When activator pods are network-isolated and then the partition is lifted, the pods report HTTP 500 errors on health probe endpoints. The activator process does not automatically recover from network isolation. The pods require a liveness probe restart to restore service.

Root Cause: The activator's health check endpoint depends on a WebSocket connection to the autoscaler. When the network partition blocks this connection, the health check handler enters a failed state. After the partition lifts, the WebSocket connection does not automatically re-establish, and the health check continues returning HTTP 500.

Impact: During network partitions affecting the activator, scale-from-zero traffic will be blocked. After network recovery, manual intervention (liveness probe timeout and restart) is required before service resumes.

Mitigation: The activator should implement connection retry logic for the autoscaler WebSocket. Alternatively, the health check should differentiate between transient connectivity issues and permanent failures.

Overall Resilience

Knative Serving demonstrates excellent resilience across 34 of 35 experiments. The platform handles pod failures, RBAC disruptions, quota exhaustion, and configuration drift without manual intervention. Recovery is consistently fast, with most experiments recovering in under 1 second and requiring only a single reconcile cycle.

The activator/network-partition failure is the only experiment that requires manual recovery (liveness probe restart). This is a known architectural limitation of the activator's dependency on persistent WebSocket connections to the autoscaler.