Skip to content

knative-serving Failure Modes

Coverage

Injection Type Danger Experiment Description
PodKill low activator/pod-kill.yaml Killing one activator pod should not affect inference traffic since the other re...
PodKill high activator/all-pods-kill.yaml Killing both activator pods simultaneously causes a complete traffic blackout fo...
NetworkPartition medium activator/network-partition.yaml Isolating activator pods from the network blocks all inference requests routed t...
LabelStomping medium activator/label-stomping.yaml Removing the app=activator label from pods disconnects them from the activator S...
QuotaExhaustion medium activator/quota-exhaustion.yaml A restrictive ResourceQuota in the knative-serving namespace prevents the activa...
RBACRevoke high activator/rbac-revoke.yaml Revoking the activator's ClusterRoleBinding removes its ability to read Services...
PodKill medium autoscaler/pod-kill.yaml Killing a single autoscaler pod temporarily pauses scale decisions. The remainin...
PodKill high autoscaler/all-pods-kill.yaml Killing both autoscaler pods simultaneously stops all scale-to-zero and scale-up...
NetworkPartition medium autoscaler/network-partition.yaml Isolating the autoscaler from the network blocks metric collection from activato...
LabelStomping medium autoscaler/label-stomping.yaml Removing the app=autoscaler label disconnects pods from the autoscaler Service. ...
QuotaExhaustion medium autoscaler/quota-exhaustion.yaml A restrictive ResourceQuota prevents the autoscaler from restarting.
PodKill medium autoscaler-hpa/pod-kill.yaml Killing an autoscaler-hpa pod temporarily pauses HPA-based scaling. The other re...
NetworkPartition medium autoscaler-hpa/network-partition.yaml Isolating the HPA autoscaler blocks its ability to receive metrics and update HP...
PodKill low controller/pod-kill.yaml Killing one Knative Serving controller pod should not affect existing inference ...
PodKill high controller/all-pods-kill.yaml Killing both controller pods stops all Knative Service reconciliation. Existing ...
NetworkPartition medium controller/network-partition.yaml Isolating the controller from the network prevents it from reading or updating K...
LabelStomping medium controller/label-stomping.yaml Removing the app=controller label from pods. The Deployment controller should re...
QuotaExhaustion medium controller/quota-exhaustion.yaml A restrictive ResourceQuota prevents the controller from restarting after failur...
RBACRevoke high controller/rbac-revoke.yaml Revoking the controller's admin ClusterRoleBinding removes its ability to manage...
PodKill low webhook/pod-kill.yaml Killing one Knative webhook pod should not block Knative Service operations sinc...
PodKill high webhook/all-pods-kill.yaml Killing both webhook pods blocks all Knative Service creation and modification. ...
NetworkPartition medium webhook/network-partition.yaml Isolating webhook pods makes the webhook Service unreachable. The API server get...
ConfigDrift high webhook/cert-corrupt.yaml Corrupting the Knative webhook TLS certificate should cause webhook validation t...
LabelStomping medium webhook/label-stomping.yaml Removing the app=webhook label disconnects pods from webhook Service endpoints, ...
QuotaExhaustion medium webhook/quota-exhaustion.yaml A restrictive ResourceQuota prevents the webhook from restarting.
WebhookDisrupt high webhook/webhook-disrupt.yaml Changing the validating webhook's failurePolicy from Fail to Ignore bypasses all...
PodKill medium kourier-gateway/pod-kill.yaml The Kourier gateway is the Envoy-based ingress proxy for all Knative Serving tra...
PodKill high kourier-gateway/all-pods-kill.yaml Killing both Kourier gateway pods simultaneously causes a complete inference traf...
NetworkPartition medium kourier-gateway/network-partition.yaml Network-isolating the Kourier gateway pods blocks all inference traffic at the i...
LabelStomping medium kourier-gateway/label-stomping.yaml Removing the app=3scale-kourier-gateway label disconnects gateway pods from the ...
QuotaExhaustion medium kourier-gateway/quota-exhaustion.yaml A restrictive ResourceQuota prevents the kourier gateway from restarting.
PodKill medium net-kourier-controller/pod-kill.yaml Killing a net-kourier-controller pod temporarily pauses Envoy route programming....
PodKill high net-kourier-controller/all-pods-kill.yaml Killing both net-kourier-controller pods stops all Envoy route programming. Exis...
NetworkPartition medium net-kourier-controller/network-partition.yaml Isolating the net-kourier-controller prevents it from receiving API server event...
RBACRevoke high net-kourier-controller/rbac-revoke.yaml Revoking the net-kourier ClusterRoleBinding prevents the controller from reading...

Experiment Details

activator

knative-activator-pod-kill

  • Type: PodKill
  • Danger Level: low
  • Component: activator

The activator is the request-buffering proxy that holds traffic during scale-from-zero. Killing one activator pod should not affect inference traffic since the other replica handles requests. The Deployment controller recreates the pod.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-activator-pod-kill
spec:
  tier: 1
  target:
    operator: knative-serving
    component: activator
    resource: Deployment/activator
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: activator
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=activator
    count: 1
    ttl: "120s"
  hypothesis:
    description: >-
      The activator is the request-buffering proxy that holds traffic
      during scale-from-zero. Killing one activator pod should not
      affect inference traffic since the other replica handles requests.
      The Deployment controller recreates the pod.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - knative-serving

knative-activator-all-pods-kill

  • Type: PodKill
  • Danger Level: high
  • Component: activator

Killing both activator pods simultaneously causes a complete traffic blackout for scale-from-zero and request buffering. Active inference services with running pods may still serve directly, but any service in scale-to-zero state becomes unreachable. After Deployment recreates both pods, traffic routing should resume.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-activator-all-pods-kill
spec:
  tier: 4
  target:
    operator: knative-serving
    component: activator
    resource: Deployment/activator
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: activator
        namespace: knative-serving
        conditionType: Available
    timeout: "60s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=activator
    count: 2
    ttl: "120s"
  hypothesis:
    description: >-
      Killing both activator pods simultaneously causes a complete
      traffic blackout for scale-from-zero and request buffering.
      Active inference services with running pods may still serve
      directly, but any service in scale-to-zero state becomes
      unreachable. After Deployment recreates both pods, traffic
      routing should resume.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-activator-network-partition

  • Type: NetworkPartition
  • Danger Level: medium
  • Component: activator

Isolating activator pods from the network blocks all inference requests routed through the activator. This simulates a network failure between the ingress layer and the activator. The autoscaler loses visibility into request counts, potentially causing erratic scaling decisions. After partition is lifted, traffic routing should resume.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-activator-network-partition
spec:
  tier: 3
  target:
    operator: knative-serving
    component: activator
    resource: Deployment/activator
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: activator
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      labelSelector: app=activator
      direction: ingress
    ttl: "60s"
  hypothesis:
    description: >-
      Isolating activator pods from the network blocks all inference
      requests routed through the activator. This simulates a network
      failure between the ingress layer and the activator. The
      autoscaler loses visibility into request counts, potentially
      causing erratic scaling decisions. After partition is lifted,
      traffic routing should resume.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-activator-label-stomping

  • Type: LabelStomping
  • Danger Level: medium
  • Component: activator

Removing the app=activator label from pods disconnects them from the activator Service endpoints. Traffic can no longer reach the activator. The Deployment controller should recreate pods with correct labels or the existing pods should be re-labeled by the operator.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-activator-label-stomping
spec:
  tier: 3
  target:
    operator: knative-serving
    component: activator
    resource: Deployment/activator
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: activator
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: LabelStomping
    parameters:
      apiVersion: apps/v1
      kind: Deployment
      name: activator
      labelKey: app
      action: overwrite
    ttl: "120s"
  hypothesis:
    description: >-
      Removing the app=activator label from pods disconnects them from the
      activator Service endpoints. Traffic can no longer reach the activator.
      The Deployment controller should recreate pods with correct labels or
      the existing pods should be re-labeled by the operator.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-activator-quota-exhaustion

  • Type: QuotaExhaustion
  • Danger Level: medium
  • Component: activator

A restrictive ResourceQuota in the knative-serving namespace prevents the activator from restarting after a failure. If the pods are evicted or crash, new pods cannot be scheduled until the quota is relaxed.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-activator-quota-exhaustion
spec:
  tier: 3
  target:
    operator: knative-serving
    component: activator
    resource: Deployment/activator
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: activator
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: QuotaExhaustion
    parameters:
      quotaName: chaos-quota-activator
      pods: "0"
      cpu: "0"
      memory: "0"
    ttl: "60s"
  hypothesis:
    description: >-
      A restrictive ResourceQuota in the knative-serving namespace prevents
      the activator from restarting after a failure. If the pods are evicted
      or crash, new pods cannot be scheduled until the quota is relaxed.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-activator-rbac-revoke

  • Type: RBACRevoke
  • Danger Level: high
  • Component: activator

Revoking the activator's ClusterRoleBinding removes its ability to read Services, Endpoints, and other resources. The activator can't proxy traffic correctly without these permissions. After RBAC is restored, the activator should resume normal operation without restart.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-activator-rbac-revoke
spec:
  tier: 4
  target:
    operator: knative-serving
    component: activator
    resource: Deployment/activator
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: activator
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: RBACRevoke
    dangerLevel: high
    parameters:
      bindingName: knative-serving-activator-cluster
      bindingType: ClusterRoleBinding
    ttl: "120s"
  hypothesis:
    description: >-
      Revoking the activator's ClusterRoleBinding removes its ability to
      read Services, Endpoints, and other resources. The activator can't
      proxy traffic correctly without these permissions. After RBAC is
      restored, the activator should resume normal operation without restart.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true

autoscaler

knative-autoscaler-pod-kill

  • Type: PodKill
  • Danger Level: medium
  • Component: autoscaler

Killing a single autoscaler pod temporarily pauses scale decisions. The remaining replica should take over via leader election. Workloads continue running but may not scale correctly during the brief outage.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-autoscaler-pod-kill
spec:
  tier: 3
  target:
    operator: knative-serving
    component: autoscaler
    resource: Deployment/autoscaler
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: autoscaler
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=autoscaler
    count: 1
    ttl: "120s"
  hypothesis:
    description: >-
      Killing a single autoscaler pod temporarily pauses scale decisions.
      The remaining replica should take over via leader election. Workloads
      continue running but may not scale correctly during the brief outage.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-autoscaler-all-pods-kill

  • Type: PodKill
  • Danger Level: high
  • Component: autoscaler

Killing both autoscaler pods simultaneously stops all scale-to-zero and scale-up decisions. Running inference pods continue serving but no new scaling events are processed. After recovery, the autoscaler reads current metrics and resumes scaling.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-autoscaler-all-pods-kill
spec:
  tier: 4
  target:
    operator: knative-serving
    component: autoscaler
    resource: Deployment/autoscaler
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: autoscaler
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=autoscaler
    count: 2
    ttl: "120s"
  hypothesis:
    description: >-
      Killing both autoscaler pods simultaneously stops all scale-to-zero
      and scale-up decisions. Running inference pods continue serving but
      no new scaling events are processed. After recovery, the autoscaler
      reads current metrics and resumes scaling.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-autoscaler-network-partition

  • Type: NetworkPartition
  • Danger Level: medium
  • Component: autoscaler

Isolating the autoscaler from the network blocks metric collection from activator pods and prevents scale decisions from being communicated. Running pods continue but autoscaling stalls. This is particularly interesting because the activator's health check depends on a WebSocket to the autoscaler.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-autoscaler-network-partition
spec:
  tier: 3
  target:
    operator: knative-serving
    component: autoscaler
    resource: Deployment/autoscaler
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: autoscaler
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      direction: ingress
      labelSelector: app=autoscaler
    ttl: "120s"
  hypothesis:
    description: >-
      Isolating the autoscaler from the network blocks metric collection
      from activator pods and prevents scale decisions from being communicated.
      Running pods continue but autoscaling stalls. This is particularly
      interesting because the activator's health check depends on a WebSocket
      to the autoscaler.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-autoscaler-label-stomping

  • Type: LabelStomping
  • Danger Level: medium
  • Component: autoscaler

Removing the app=autoscaler label disconnects pods from the autoscaler Service. The Deployment controller should recreate pods with the correct labels.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-autoscaler-label-stomping
spec:
  tier: 3
  target:
    operator: knative-serving
    component: autoscaler
    resource: Deployment/autoscaler
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: autoscaler
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: LabelStomping
    parameters:
      apiVersion: apps/v1
      kind: Deployment
      name: autoscaler
      labelKey: app
      action: overwrite
    ttl: "120s"
  hypothesis:
    description: >-
      Removing the app=autoscaler label disconnects pods from the autoscaler
      Service. The Deployment controller should recreate pods with the correct
      labels.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-autoscaler-quota-exhaustion

  • Type: QuotaExhaustion
  • Danger Level: medium
  • Component: autoscaler

A restrictive ResourceQuota prevents the autoscaler from restarting.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-autoscaler-quota-exhaustion
spec:
  tier: 3
  target:
    operator: knative-serving
    component: autoscaler
    resource: Deployment/autoscaler
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: autoscaler
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: QuotaExhaustion
    parameters:
      quotaName: chaos-quota-autoscaler
      pods: "0"
      cpu: "0"
      memory: "0"
    ttl: "60s"
  hypothesis:
    description: >-
      A restrictive ResourceQuota prevents the autoscaler from restarting.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

autoscaler-hpa

knative-autoscaler-hpa-pod-kill

  • Type: PodKill
  • Danger Level: medium
  • Component: autoscaler-hpa

Killing an autoscaler-hpa pod temporarily pauses HPA-based scaling. The other replica takes over via leader election.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-autoscaler-hpa-pod-kill
spec:
  tier: 3
  target:
    operator: knative-serving
    component: autoscaler-hpa
    resource: Deployment/autoscaler-hpa
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: autoscaler-hpa
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=autoscaler-hpa
    count: 1
    ttl: "120s"
  hypothesis:
    description: >-
      Killing an autoscaler-hpa pod temporarily pauses HPA-based scaling.
      The other replica takes over via leader election.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-autoscaler-hpa-network-partition

  • Type: NetworkPartition
  • Danger Level: medium
  • Component: autoscaler-hpa

Isolating the HPA autoscaler blocks its ability to receive metrics and update HPA resources. Existing HPA targets maintain their current replica count but cannot adjust.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-autoscaler-hpa-network-partition
spec:
  tier: 3
  target:
    operator: knative-serving
    component: autoscaler-hpa
    resource: Deployment/autoscaler-hpa
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: autoscaler-hpa
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      direction: ingress
      labelSelector: app=autoscaler-hpa
    ttl: "120s"
  hypothesis:
    description: >-
      Isolating the HPA autoscaler blocks its ability to receive metrics
      and update HPA resources. Existing HPA targets maintain their current
      replica count but cannot adjust.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

controller

knative-controller-pod-kill

  • Type: PodKill
  • Danger Level: low
  • Component: controller

Killing one Knative Serving controller pod should not affect existing inference services since the other replica takes over leader election. New Knative Service creation may briefly stall during leader transition.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-controller-pod-kill
spec:
  tier: 1
  target:
    operator: knative-serving
    component: controller
    resource: Deployment/controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: controller
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=controller
    count: 1
    ttl: "120s"
  hypothesis:
    description: >-
      Killing one Knative Serving controller pod should not affect
      existing inference services since the other replica takes over
      leader election. New Knative Service creation may briefly stall
      during leader transition.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - knative-serving

knative-controller-all-pods-kill

  • Type: PodKill
  • Danger Level: high
  • Component: controller

Killing both controller pods stops all Knative Service reconciliation. Existing services continue running but new deployments, updates, and route changes are not processed. After recovery, leader election completes and reconciliation resumes.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-controller-all-pods-kill
spec:
  tier: 4
  target:
    operator: knative-serving
    component: controller
    resource: Deployment/controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: controller
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=controller
    count: 2
    ttl: "120s"
  hypothesis:
    description: >-
      Killing both controller pods stops all Knative Service reconciliation.
      Existing services continue running but new deployments, updates, and
      route changes are not processed. After recovery, leader election
      completes and reconciliation resumes.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-controller-network-partition

  • Type: NetworkPartition
  • Danger Level: medium
  • Component: controller

Isolating the controller from the network prevents it from reading or updating Knative resources. Reconciliation stalls. Existing services continue running. After the partition lifts, the controller should resume reconciliation.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-controller-network-partition
spec:
  tier: 3
  target:
    operator: knative-serving
    component: controller
    resource: Deployment/controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: controller
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      direction: ingress
      labelSelector: app=controller
    ttl: "120s"
  hypothesis:
    description: >-
      Isolating the controller from the network prevents it from reading
      or updating Knative resources. Reconciliation stalls. Existing services
      continue running. After the partition lifts, the controller should
      resume reconciliation.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-controller-label-stomping

  • Type: LabelStomping
  • Danger Level: medium
  • Component: controller

Removing the app=controller label from pods. The Deployment controller should recreate pods with correct labels.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-controller-label-stomping
spec:
  tier: 3
  target:
    operator: knative-serving
    component: controller
    resource: Deployment/controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: controller
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: LabelStomping
    parameters:
      apiVersion: apps/v1
      kind: Deployment
      name: controller
      labelKey: app
      action: overwrite
    ttl: "120s"
  hypothesis:
    description: >-
      Removing the app=controller label from pods. The Deployment controller
      should recreate pods with correct labels.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-controller-quota-exhaustion

  • Type: QuotaExhaustion
  • Danger Level: medium
  • Component: controller

A restrictive ResourceQuota prevents the controller from restarting after failure.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-controller-quota-exhaustion
spec:
  tier: 3
  target:
    operator: knative-serving
    component: controller
    resource: Deployment/controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: controller
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: QuotaExhaustion
    parameters:
      quotaName: chaos-quota-controller
      pods: "0"
      cpu: "0"
      memory: "0"
    ttl: "60s"
  hypothesis:
    description: >-
      A restrictive ResourceQuota prevents the controller from restarting
      after failure.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-controller-rbac-revoke

  • Type: RBACRevoke
  • Danger Level: high
  • Component: controller

Revoking the controller's admin ClusterRoleBinding removes its ability to manage Knative resources (Services, Routes, Configurations, Revisions). All reconciliation fails with authorization errors. After RBAC is restored, reconciliation should resume.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-controller-rbac-revoke
spec:
  tier: 4
  target:
    operator: knative-serving
    component: controller
    resource: Deployment/controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: controller
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: RBACRevoke
    dangerLevel: high
    parameters:
      bindingName: knative-serving-controller-admin
      bindingType: ClusterRoleBinding
    ttl: "120s"
  hypothesis:
    description: >-
      Revoking the controller's admin ClusterRoleBinding removes its ability
      to manage Knative resources (Services, Routes, Configurations, Revisions).
      All reconciliation fails with authorization errors. After RBAC is
      restored, reconciliation should resume.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true

webhook

knative-webhook-pod-kill

  • Type: PodKill
  • Danger Level: low
  • Component: webhook

Killing one Knative webhook pod should not block Knative Service operations since the other replica handles validation/mutation. Existing running services are unaffected.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-webhook-pod-kill
spec:
  tier: 1
  target:
    operator: knative-serving
    component: webhook
    resource: Deployment/webhook
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: webhook
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=webhook
    count: 1
    ttl: "120s"
  hypothesis:
    description: >-
      Killing one Knative webhook pod should not block Knative Service
      operations since the other replica handles validation/mutation.
      Existing running services are unaffected.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - knative-serving

knative-webhook-all-pods-kill

  • Type: PodKill
  • Danger Level: high
  • Component: webhook

Killing both webhook pods blocks all Knative Service creation and modification. The API server cannot validate or mutate Knative resources. Existing services are unaffected but cannot be updated.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-webhook-all-pods-kill
spec:
  tier: 5
  target:
    operator: knative-serving
    component: webhook
    resource: Deployment/webhook
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: webhook
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=webhook
    count: 2
    ttl: "120s"
  hypothesis:
    description: >-
      Killing both webhook pods blocks all Knative Service creation and
      modification. The API server cannot validate or mutate Knative resources.
      Existing services are unaffected but cannot be updated.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-webhook-network-partition

  • Type: NetworkPartition
  • Danger Level: medium
  • Component: webhook

Isolating webhook pods makes the webhook Service unreachable. The API server gets timeout errors when validating Knative resources. With failurePolicy=Fail (default), all Knative operations are blocked.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-webhook-network-partition
spec:
  tier: 3
  target:
    operator: knative-serving
    component: webhook
    resource: Deployment/webhook
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: webhook
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      direction: ingress
      labelSelector: app=webhook
    ttl: "120s"
  hypothesis:
    description: >-
      Isolating webhook pods makes the webhook Service unreachable. The API
      server gets timeout errors when validating Knative resources. With
      failurePolicy=Fail (default), all Knative operations are blocked.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-webhook-cert-corrupt

  • Type: ConfigDrift
  • Danger Level: high
  • Component: webhook

Corrupting the Knative webhook TLS certificate should cause webhook validation to fail with TLS errors. This blocks creation and modification of all Knative Service resources. Existing services continue running but cannot be updated. The webhook or cert-manager should detect and regenerate the certificate.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-webhook-cert-corrupt
spec:
  tier: 3
  target:
    operator: knative-serving
    component: webhook
    resource: Secret/webhook-certs
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: webhook
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: ConfigDrift
    dangerLevel: high
    parameters:
      name: webhook-certs
      key: server-cert.pem
      value: "Y2hhb3MtY29ycnVwdGVk"
      resourceType: Secret
    ttl: "60s"
  hypothesis:
    description: >-
      Corrupting the Knative webhook TLS certificate should cause webhook
      validation to fail with TLS errors. This blocks creation and
      modification of all Knative Service resources. Existing services
      continue running but cannot be updated. The webhook or cert-manager
      should detect and regenerate the certificate.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-webhook-label-stomping

  • Type: LabelStomping
  • Danger Level: medium
  • Component: webhook

Removing the app=webhook label disconnects pods from webhook Service endpoints, making the webhook unreachable for API server calls.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-webhook-label-stomping
spec:
  tier: 3
  target:
    operator: knative-serving
    component: webhook
    resource: Deployment/webhook
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: webhook
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: LabelStomping
    parameters:
      apiVersion: apps/v1
      kind: Deployment
      name: webhook
      labelKey: app
      action: overwrite
    ttl: "120s"
  hypothesis:
    description: >-
      Removing the app=webhook label disconnects pods from webhook Service
      endpoints, making the webhook unreachable for API server calls.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-webhook-quota-exhaustion

  • Type: QuotaExhaustion
  • Danger Level: medium
  • Component: webhook

A restrictive ResourceQuota prevents the webhook from restarting.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-webhook-quota-exhaustion
spec:
  tier: 3
  target:
    operator: knative-serving
    component: webhook
    resource: Deployment/webhook
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: webhook
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: QuotaExhaustion
    parameters:
      quotaName: chaos-quota-webhook
      pods: "0"
      cpu: "0"
      memory: "0"
    ttl: "60s"
  hypothesis:
    description: >-
      A restrictive ResourceQuota prevents the webhook from restarting.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-webhook-webhook-disrupt

  • Type: WebhookDisrupt
  • Danger Level: high
  • Component: webhook

Changing the validating webhook's failurePolicy from Fail to Ignore bypasses all Knative resource validation. Invalid resources can be created. After restoration, validation resumes.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-webhook-webhook-disrupt
spec:
  tier: 4
  target:
    operator: knative-serving
    component: webhook
    resource: Deployment/webhook
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: webhook
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: WebhookDisrupt
    dangerLevel: high
    parameters:
      webhookName: validation.webhook.serving.knative.dev
      action: setFailurePolicy
      value: Ignore
    ttl: "60s"
  hypothesis:
    description: >-
      Changing the validating webhook's failurePolicy from Fail to Ignore
      bypasses all Knative resource validation. Invalid resources can be
      created. After restoration, validation resumes.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true

kourier-gateway

knative-kourier-gateway-pod-kill

  • Type: PodKill
  • Danger Level: medium
  • Component: kourier-gateway

The Kourier gateway is the Envoy-based ingress proxy for all Knative Serving traffic. Killing one gateway pod should shift traffic to the other replica. Brief request failures may occur during the transition. After pod restart, traffic routing resumes normally.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-kourier-gateway-pod-kill
spec:
  tier: 2
  target:
    operator: knative-serving
    component: kourier-gateway
    resource: Deployment/3scale-kourier-gateway
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: 3scale-kourier-gateway
        namespace: knative-serving-ingress
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=3scale-kourier-gateway
    count: 1
    ttl: "120s"
  hypothesis:
    description: >-
      The Kourier gateway is the Envoy-based ingress proxy for all
      Knative Serving traffic. Killing one gateway pod should shift
      traffic to the other replica. Brief request failures may occur
      during the transition. After pod restart, traffic routing
      resumes normally.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true
    allowedNamespaces:
      - knative-serving-ingress

knative-kourier-all-pods-kill

  • Type: PodKill
  • Danger Level: high
  • Component: kourier-gateway

Killing both Kourier gateway pods simultaneously causes a complete inference traffic blackout. No external requests can reach any Knative Service. This is the most impactful single fault for inference availability. The Deployment controller should recreate both pods and Envoy should reload its config from the control plane. During the outage, all inference requests fail.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-kourier-all-pods-kill
spec:
  tier: 5
  target:
    operator: knative-serving
    component: kourier-gateway
    resource: Deployment/3scale-kourier-gateway
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: 3scale-kourier-gateway
        namespace: knative-serving-ingress
        conditionType: Available
    timeout: "60s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=3scale-kourier-gateway
    count: 2
    ttl: "120s"
  hypothesis:
    description: >-
      Killing both Kourier gateway pods simultaneously causes a
      complete inference traffic blackout. No external requests can
      reach any Knative Service. This is the most impactful single
      fault for inference availability. The Deployment controller
      should recreate both pods and Envoy should reload its config
      from the control plane. During the outage, all inference
      requests fail.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving-ingress

knative-kourier-network-partition

  • Type: NetworkPartition
  • Danger Level: medium
  • Component: kourier-gateway

Network-isolating the Kourier gateway pods blocks all inference traffic at the ingress layer. Unlike pod kill, the pods remain running but cannot accept connections. This simulates a network failure between the OpenShift router and the Knative ingress. After the partition lifts, Envoy should resume serving traffic without needing a restart.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-kourier-network-partition
spec:
  tier: 3
  target:
    operator: knative-serving
    component: kourier-gateway
    resource: Deployment/3scale-kourier-gateway
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: 3scale-kourier-gateway
        namespace: knative-serving-ingress
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      labelSelector: app=3scale-kourier-gateway
      direction: ingress
    ttl: "60s"
  hypothesis:
    description: >-
      Network-isolating the Kourier gateway pods blocks all inference
      traffic at the ingress layer. Unlike pod kill, the pods remain
      running but cannot accept connections. This simulates a network
      failure between the OpenShift router and the Knative ingress.
      After the partition lifts, Envoy should resume serving traffic
      without needing a restart.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving-ingress

knative-kourier-gateway-label-stomping

  • Type: LabelStomping
  • Danger Level: medium
  • Component: kourier-gateway

Removing the app=3scale-kourier-gateway label disconnects gateway pods from the gateway Service. External traffic cannot reach any Knative Service.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-kourier-gateway-label-stomping
spec:
  tier: 3
  target:
    operator: knative-serving
    component: kourier-gateway
    resource: Deployment/3scale-kourier-gateway
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: 3scale-kourier-gateway
        namespace: knative-serving-ingress
        conditionType: Available
    timeout: "30s"
  injection:
    type: LabelStomping
    parameters:
      apiVersion: apps/v1
      kind: Deployment
      name: 3scale-kourier-gateway
      labelKey: app
      action: overwrite
    ttl: "120s"
  hypothesis:
    description: >-
      Removing the app=3scale-kourier-gateway label disconnects gateway pods
      from the gateway Service. External traffic cannot reach any Knative
      Service.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving-ingress

knative-kourier-gateway-quota-exhaustion

  • Type: QuotaExhaustion
  • Danger Level: medium
  • Component: kourier-gateway

A restrictive ResourceQuota prevents the kourier gateway from restarting.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-kourier-gateway-quota-exhaustion
spec:
  tier: 3
  target:
    operator: knative-serving
    component: kourier-gateway
    resource: Deployment/3scale-kourier-gateway
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: 3scale-kourier-gateway
        namespace: knative-serving-ingress
        conditionType: Available
    timeout: "30s"
  injection:
    type: QuotaExhaustion
    parameters:
      quotaName: chaos-quota-kourier-gateway
      pods: "0"
      cpu: "0"
      memory: "0"
    ttl: "60s"
  hypothesis:
    description: >-
      A restrictive ResourceQuota prevents the kourier gateway from restarting.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving-ingress

net-kourier-controller

knative-net-kourier-controller-pod-kill

  • Type: PodKill
  • Danger Level: medium
  • Component: net-kourier-controller

Killing a net-kourier-controller pod temporarily pauses Envoy route programming. Existing routes continue working but new Knative Service routes are not programmed. The other replica takes over.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-net-kourier-controller-pod-kill
spec:
  tier: 3
  target:
    operator: knative-serving
    component: net-kourier-controller
    resource: Deployment/net-kourier-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: net-kourier-controller
        namespace: knative-serving-ingress
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=net-kourier-controller
    count: 1
    ttl: "120s"
  hypothesis:
    description: >-
      Killing a net-kourier-controller pod temporarily pauses Envoy route
      programming. Existing routes continue working but new Knative Service
      routes are not programmed. The other replica takes over.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true
    allowedNamespaces:
      - knative-serving-ingress

knative-net-kourier-controller-all-pods-kill

  • Type: PodKill
  • Danger Level: high
  • Component: net-kourier-controller

Killing both net-kourier-controller pods stops all Envoy route programming. Existing routes continue serving but no new routes are added. After recovery, the controller re-syncs and programs missing routes.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-net-kourier-controller-all-pods-kill
spec:
  tier: 4
  target:
    operator: knative-serving
    component: net-kourier-controller
    resource: Deployment/net-kourier-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: net-kourier-controller
        namespace: knative-serving-ingress
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=net-kourier-controller
    count: 2
    ttl: "120s"
  hypothesis:
    description: >-
      Killing both net-kourier-controller pods stops all Envoy route
      programming. Existing routes continue serving but no new routes are
      added. After recovery, the controller re-syncs and programs missing
      routes.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving-ingress

knative-net-kourier-controller-network-partition

  • Type: NetworkPartition
  • Danger Level: medium
  • Component: net-kourier-controller

Isolating the net-kourier-controller prevents it from receiving API server events and updating Envoy configuration. Existing routes remain but new Knative Services cannot be exposed.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-net-kourier-controller-network-partition
spec:
  tier: 3
  target:
    operator: knative-serving
    component: net-kourier-controller
    resource: Deployment/net-kourier-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: net-kourier-controller
        namespace: knative-serving-ingress
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      direction: ingress
      labelSelector: app=net-kourier-controller
    ttl: "120s"
  hypothesis:
    description: >-
      Isolating the net-kourier-controller prevents it from receiving API
      server events and updating Envoy configuration. Existing routes remain
      but new Knative Services cannot be exposed.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving-ingress

knative-net-kourier-controller-rbac-revoke

  • Type: RBACRevoke
  • Danger Level: high
  • Component: net-kourier-controller

Revoking the net-kourier ClusterRoleBinding prevents the controller from reading Ingress resources and programming Envoy. New routes cannot be created. After RBAC is restored, the controller should re-sync.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-net-kourier-controller-rbac-revoke
spec:
  tier: 4
  target:
    operator: knative-serving
    component: net-kourier-controller
    resource: Deployment/net-kourier-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: net-kourier-controller
        namespace: knative-serving-ingress
        conditionType: Available
    timeout: "30s"
  injection:
    type: RBACRevoke
    dangerLevel: high
    parameters:
      bindingName: net-kourier
      bindingType: ClusterRoleBinding
    ttl: "120s"
  hypothesis:
    description: >-
      Revoking the net-kourier ClusterRoleBinding prevents the controller
      from reading Ingress resources and programming Envoy. New routes cannot
      be created. After RBAC is restored, the controller should re-sync.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true