knative-serving Custom Experiments¶

This page provides templates and guidance for writing custom chaos experiments targeting Knative Serving components.

Component Overview¶

Knative Serving has 7 main components across two namespaces:

knative-serving namespace¶

activator: Request buffering proxy for scale-from-zero (label: app=activator)
autoscaler: Makes scaling decisions based on metrics (label: app=autoscaler)
autoscaler-hpa: HPA-based autoscaler (label: app=autoscaler-hpa)
controller: Main controller for Service/Route/Configuration resources (label: app=controller)
webhook: Validation and mutation webhook (label: app=webhook)

knative-serving-ingress namespace¶

kourier-gateway: Envoy-based ingress gateway (label: app=3scale-kourier-gateway)
net-kourier-controller: Programs Envoy routes (label: app=net-kourier-controller)

Key Architectural Relationships¶

Understanding these relationships helps design meaningful experiments:

activator → autoscaler WebSocket: The activator maintains a persistent WebSocket connection to the autoscaler for metric streaming. Network partitions that break this connection can cause health check failures.
controller → leader election leases: The controller uses leader election (Lease resources in knative-serving). Experiments that disrupt API server connectivity or RBAC can trigger leader failover.
net-kourier-controller → kourier-gateway Envoy config: The controller programs Envoy routes via ConfigMap updates. Disrupting the controller blocks new route creation but doesn't affect existing routes.
webhook → cert-manager: The webhook's TLS certificates are managed by cert-manager. Certificate corruption triggers automatic regeneration.

Example Templates¶

activator¶

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-activator-custom
spec:
  target:
    operator: knative-serving
    component: activator
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: activator
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill  # Change to desired injection type
    parameters:
      labelSelector: app=activator
    ttl: "120s"
  hypothesis:
    description: >-
      Describe the expected behavior after fault injection.
    recoveryTimeout: 120s

controller¶

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-controller-custom
spec:
  target:
    operator: knative-serving
    component: controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: controller
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      labelSelector: app=controller
      direction: ingress
    ttl: "120s"
  hypothesis:
    description: >-
      Describe the expected behavior after fault injection.
    recoveryTimeout: 120s

kourier-gateway¶

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-kourier-custom
spec:
  target:
    operator: knative-serving
    component: kourier-gateway
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: 3scale-kourier-gateway
        namespace: knative-serving-ingress
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=3scale-kourier-gateway
    count: 1
    ttl: "120s"
  hypothesis:
    description: >-
      Describe the expected behavior after fault injection.
    recoveryTimeout: 120s

Running Custom Experiments¶

Save your experiment YAML to a file
Run: chaos-cli run --experiment <file>
Check results: chaos-cli results --latest

Design Considerations¶

When designing custom experiments for Knative Serving:

Test scale-from-zero behavior: Many experiments should validate that scale-from-zero continues working during and after faults.
Measure inference latency: Knative Serving experiments should measure request latency and availability, not just component recovery.
Consider multi-component faults: Knative Serving's distributed architecture means that simultaneous faults in activator + autoscaler or controller + webhook can have compounding effects.
Validate route programming: For kourier-gateway and net-kourier-controller experiments, verify that new Knative Services can be created and routed after recovery.