knative-serving Custom Experiments¶
This page provides templates and guidance for writing custom chaos experiments targeting Knative Serving components.
Component Overview¶
Knative Serving has 7 main components across two namespaces:
knative-serving namespace¶
- activator: Request buffering proxy for scale-from-zero (label:
app=activator) - autoscaler: Makes scaling decisions based on metrics (label:
app=autoscaler) - autoscaler-hpa: HPA-based autoscaler (label:
app=autoscaler-hpa) - controller: Main controller for Service/Route/Configuration resources (label:
app=controller) - webhook: Validation and mutation webhook (label:
app=webhook)
knative-serving-ingress namespace¶
- kourier-gateway: Envoy-based ingress gateway (label:
app=3scale-kourier-gateway) - net-kourier-controller: Programs Envoy routes (label:
app=net-kourier-controller)
Key Architectural Relationships¶
Understanding these relationships helps design meaningful experiments:
-
activator → autoscaler WebSocket: The activator maintains a persistent WebSocket connection to the autoscaler for metric streaming. Network partitions that break this connection can cause health check failures.
-
controller → leader election leases: The controller uses leader election (Lease resources in knative-serving). Experiments that disrupt API server connectivity or RBAC can trigger leader failover.
-
net-kourier-controller → kourier-gateway Envoy config: The controller programs Envoy routes via ConfigMap updates. Disrupting the controller blocks new route creation but doesn't affect existing routes.
-
webhook → cert-manager: The webhook's TLS certificates are managed by cert-manager. Certificate corruption triggers automatic regeneration.
Example Templates¶
activator¶
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: knative-activator-custom
spec:
target:
operator: knative-serving
component: activator
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: activator
namespace: knative-serving
conditionType: Available
timeout: "30s"
injection:
type: PodKill # Change to desired injection type
parameters:
labelSelector: app=activator
ttl: "120s"
hypothesis:
description: >-
Describe the expected behavior after fault injection.
recoveryTimeout: 120s
controller¶
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: knative-controller-custom
spec:
target:
operator: knative-serving
component: controller
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: controller
namespace: knative-serving
conditionType: Available
timeout: "30s"
injection:
type: NetworkPartition
parameters:
labelSelector: app=controller
direction: ingress
ttl: "120s"
hypothesis:
description: >-
Describe the expected behavior after fault injection.
recoveryTimeout: 120s
kourier-gateway¶
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: knative-kourier-custom
spec:
target:
operator: knative-serving
component: kourier-gateway
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: 3scale-kourier-gateway
namespace: knative-serving-ingress
conditionType: Available
timeout: "30s"
injection:
type: PodKill
parameters:
labelSelector: app=3scale-kourier-gateway
count: 1
ttl: "120s"
hypothesis:
description: >-
Describe the expected behavior after fault injection.
recoveryTimeout: 120s
Running Custom Experiments¶
- Save your experiment YAML to a file
- Run:
chaos-cli run --experiment <file> - Check results:
chaos-cli results --latest
Design Considerations¶
When designing custom experiments for Knative Serving:
- Test scale-from-zero behavior: Many experiments should validate that scale-from-zero continues working during and after faults.
- Measure inference latency: Knative Serving experiments should measure request latency and availability, not just component recovery.
- Consider multi-component faults: Knative Serving's distributed architecture means that simultaneous faults in activator + autoscaler or controller + webhook can have compounding effects.
- Validate route programming: For kourier-gateway and net-kourier-controller experiments, verify that new Knative Services can be created and routed after recovery.