knative-serving Failure Modes¶

Coverage¶

Injection Type	Danger	Experiment	Description
PodKill	low	activator/pod-kill.yaml	Killing one activator pod should not affect inference traffic since the other re...
PodKill	high	activator/all-pods-kill.yaml	Killing both activator pods simultaneously causes a complete traffic blackout fo...
NetworkPartition	medium	activator/network-partition.yaml	Isolating activator pods from the network blocks all inference requests routed t...
LabelStomping	medium	activator/label-stomping.yaml	Removing the app=activator label from pods disconnects them from the activator S...
QuotaExhaustion	medium	activator/quota-exhaustion.yaml	A restrictive ResourceQuota in the knative-serving namespace prevents the activa...
RBACRevoke	high	activator/rbac-revoke.yaml	Revoking the activator's ClusterRoleBinding removes its ability to read Services...
PodKill	medium	autoscaler/pod-kill.yaml	Killing a single autoscaler pod temporarily pauses scale decisions. The remainin...
PodKill	high	autoscaler/all-pods-kill.yaml	Killing both autoscaler pods simultaneously stops all scale-to-zero and scale-up...
NetworkPartition	medium	autoscaler/network-partition.yaml	Isolating the autoscaler from the network blocks metric collection from activato...
LabelStomping	medium	autoscaler/label-stomping.yaml	Removing the app=autoscaler label disconnects pods from the autoscaler Service. ...
QuotaExhaustion	medium	autoscaler/quota-exhaustion.yaml	A restrictive ResourceQuota prevents the autoscaler from restarting.
PodKill	medium	autoscaler-hpa/pod-kill.yaml	Killing an autoscaler-hpa pod temporarily pauses HPA-based scaling. The other re...
NetworkPartition	medium	autoscaler-hpa/network-partition.yaml	Isolating the HPA autoscaler blocks its ability to receive metrics and update HP...
PodKill	low	controller/pod-kill.yaml	Killing one Knative Serving controller pod should not affect existing inference ...
PodKill	high	controller/all-pods-kill.yaml	Killing both controller pods stops all Knative Service reconciliation. Existing ...
NetworkPartition	medium	controller/network-partition.yaml	Isolating the controller from the network prevents it from reading or updating K...
LabelStomping	medium	controller/label-stomping.yaml	Removing the app=controller label from pods. The Deployment controller should re...
QuotaExhaustion	medium	controller/quota-exhaustion.yaml	A restrictive ResourceQuota prevents the controller from restarting after failur...
RBACRevoke	high	controller/rbac-revoke.yaml	Revoking the controller's admin ClusterRoleBinding removes its ability to manage...
PodKill	low	webhook/pod-kill.yaml	Killing one Knative webhook pod should not block Knative Service operations sinc...
PodKill	high	webhook/all-pods-kill.yaml	Killing both webhook pods blocks all Knative Service creation and modification. ...
NetworkPartition	medium	webhook/network-partition.yaml	Isolating webhook pods makes the webhook Service unreachable. The API server get...
ConfigDrift	high	webhook/cert-corrupt.yaml	Corrupting the Knative webhook TLS certificate should cause webhook validation t...
LabelStomping	medium	webhook/label-stomping.yaml	Removing the app=webhook label disconnects pods from webhook Service endpoints, ...
QuotaExhaustion	medium	webhook/quota-exhaustion.yaml	A restrictive ResourceQuota prevents the webhook from restarting.
WebhookDisrupt	high	webhook/webhook-disrupt.yaml	Changing the validating webhook's failurePolicy from Fail to Ignore bypasses all...
PodKill	medium	kourier-gateway/pod-kill.yaml	The Kourier gateway is the Envoy-based ingress proxy for all Knative Serving tra...
PodKill	high	kourier-gateway/all-pods-kill.yaml	Killing both Kourier gateway pods simultaneously causes a complete inference traf...
NetworkPartition	medium	kourier-gateway/network-partition.yaml	Network-isolating the Kourier gateway pods blocks all inference traffic at the i...
LabelStomping	medium	kourier-gateway/label-stomping.yaml	Removing the app=3scale-kourier-gateway label disconnects gateway pods from the ...
QuotaExhaustion	medium	kourier-gateway/quota-exhaustion.yaml	A restrictive ResourceQuota prevents the kourier gateway from restarting.
PodKill	medium	net-kourier-controller/pod-kill.yaml	Killing a net-kourier-controller pod temporarily pauses Envoy route programming....
PodKill	high	net-kourier-controller/all-pods-kill.yaml	Killing both net-kourier-controller pods stops all Envoy route programming. Exis...
NetworkPartition	medium	net-kourier-controller/network-partition.yaml	Isolating the net-kourier-controller prevents it from receiving API server event...
RBACRevoke	high	net-kourier-controller/rbac-revoke.yaml	Revoking the net-kourier ClusterRoleBinding prevents the controller from reading...

Experiment Details¶

activator¶

knative-activator-pod-kill¶

Type: PodKill
Danger Level: low
Component: activator

The activator is the request-buffering proxy that holds traffic during scale-from-zero. Killing one activator pod should not affect inference traffic since the other replica handles requests. The Deployment controller recreates the pod.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-activator-pod-kill
spec:
  tier: 1
  target:
    operator: knative-serving
    component: activator
    resource: Deployment/activator
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: activator
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=activator
    count: 1
    ttl: "120s"
  hypothesis:
    description: >-
      The activator is the request-buffering proxy that holds traffic
      during scale-from-zero. Killing one activator pod should not
      affect inference traffic since the other replica handles requests.
      The Deployment controller recreates the pod.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - knative-serving

knative-activator-all-pods-kill¶

Type: PodKill
Danger Level: high
Component: activator

Killing both activator pods simultaneously causes a complete traffic blackout for scale-from-zero and request buffering. Active inference services with running pods may still serve directly, but any service in scale-to-zero state becomes unreachable. After Deployment recreates both pods, traffic routing should resume.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-activator-all-pods-kill
spec:
  tier: 4
  target:
    operator: knative-serving
    component: activator
    resource: Deployment/activator
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: activator
        namespace: knative-serving
        conditionType: Available
    timeout: "60s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=activator
    count: 2
    ttl: "120s"
  hypothesis:
    description: >-
      Killing both activator pods simultaneously causes a complete
      traffic blackout for scale-from-zero and request buffering.
      Active inference services with running pods may still serve
      directly, but any service in scale-to-zero state becomes
      unreachable. After Deployment recreates both pods, traffic
      routing should resume.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-activator-network-partition¶

Type: NetworkPartition
Danger Level: medium
Component: activator

Isolating activator pods from the network blocks all inference requests routed through the activator. This simulates a network failure between the ingress layer and the activator. The autoscaler loses visibility into request counts, potentially causing erratic scaling decisions. After partition is lifted, traffic routing should resume.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-activator-network-partition
spec:
  tier: 3
  target:
    operator: knative-serving
    component: activator
    resource: Deployment/activator
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: activator
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      labelSelector: app=activator
      direction: ingress
    ttl: "60s"
  hypothesis:
    description: >-
      Isolating activator pods from the network blocks all inference
      requests routed through the activator. This simulates a network
      failure between the ingress layer and the activator. The
      autoscaler loses visibility into request counts, potentially
      causing erratic scaling decisions. After partition is lifted,
      traffic routing should resume.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-activator-label-stomping¶

Type: LabelStomping
Danger Level: medium
Component: activator

Removing the app=activator label from pods disconnects them from the activator Service endpoints. Traffic can no longer reach the activator. The Deployment controller should recreate pods with correct labels or the existing pods should be re-labeled by the operator.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-activator-label-stomping
spec:
  tier: 3
  target:
    operator: knative-serving
    component: activator
    resource: Deployment/activator
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: activator
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: LabelStomping
    parameters:
      apiVersion: apps/v1
      kind: Deployment
      name: activator
      labelKey: app
      action: overwrite
    ttl: "120s"
  hypothesis:
    description: >-
      Removing the app=activator label from pods disconnects them from the
      activator Service endpoints. Traffic can no longer reach the activator.
      The Deployment controller should recreate pods with correct labels or
      the existing pods should be re-labeled by the operator.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-activator-quota-exhaustion¶

Type: QuotaExhaustion
Danger Level: medium
Component: activator

A restrictive ResourceQuota in the knative-serving namespace prevents the activator from restarting after a failure. If the pods are evicted or crash, new pods cannot be scheduled until the quota is relaxed.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-activator-quota-exhaustion
spec:
  tier: 3
  target:
    operator: knative-serving
    component: activator
    resource: Deployment/activator
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: activator
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: QuotaExhaustion
    parameters:
      quotaName: chaos-quota-activator
      pods: "0"
      cpu: "0"
      memory: "0"
    ttl: "60s"
  hypothesis:
    description: >-
      A restrictive ResourceQuota in the knative-serving namespace prevents
      the activator from restarting after a failure. If the pods are evicted
      or crash, new pods cannot be scheduled until the quota is relaxed.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-activator-rbac-revoke¶

Type: RBACRevoke
Danger Level: high
Component: activator

Revoking the activator's ClusterRoleBinding removes its ability to read Services, Endpoints, and other resources. The activator can't proxy traffic correctly without these permissions. After RBAC is restored, the activator should resume normal operation without restart.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-activator-rbac-revoke
spec:
  tier: 4
  target:
    operator: knative-serving
    component: activator
    resource: Deployment/activator
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: activator
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: RBACRevoke
    dangerLevel: high
    parameters:
      bindingName: knative-serving-activator-cluster
      bindingType: ClusterRoleBinding
    ttl: "120s"
  hypothesis:
    description: >-
      Revoking the activator's ClusterRoleBinding removes its ability to
      read Services, Endpoints, and other resources. The activator can't
      proxy traffic correctly without these permissions. After RBAC is
      restored, the activator should resume normal operation without restart.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true

autoscaler¶

knative-autoscaler-pod-kill¶

Type: PodKill
Danger Level: medium
Component: autoscaler

Killing a single autoscaler pod temporarily pauses scale decisions. The remaining replica should take over via leader election. Workloads continue running but may not scale correctly during the brief outage.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-autoscaler-pod-kill
spec:
  tier: 3
  target:
    operator: knative-serving
    component: autoscaler
    resource: Deployment/autoscaler
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: autoscaler
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=autoscaler
    count: 1
    ttl: "120s"
  hypothesis:
    description: >-
      Killing a single autoscaler pod temporarily pauses scale decisions.
      The remaining replica should take over via leader election. Workloads
      continue running but may not scale correctly during the brief outage.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-autoscaler-all-pods-kill¶

Type: PodKill
Danger Level: high
Component: autoscaler

Killing both autoscaler pods simultaneously stops all scale-to-zero and scale-up decisions. Running inference pods continue serving but no new scaling events are processed. After recovery, the autoscaler reads current metrics and resumes scaling.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-autoscaler-all-pods-kill
spec:
  tier: 4
  target:
    operator: knative-serving
    component: autoscaler
    resource: Deployment/autoscaler
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: autoscaler
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=autoscaler
    count: 2
    ttl: "120s"
  hypothesis:
    description: >-
      Killing both autoscaler pods simultaneously stops all scale-to-zero
      and scale-up decisions. Running inference pods continue serving but
      no new scaling events are processed. After recovery, the autoscaler
      reads current metrics and resumes scaling.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-autoscaler-network-partition¶

Type: NetworkPartition
Danger Level: medium
Component: autoscaler

Isolating the autoscaler from the network blocks metric collection from activator pods and prevents scale decisions from being communicated. Running pods continue but autoscaling stalls. This is particularly interesting because the activator's health check depends on a WebSocket to the autoscaler.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-autoscaler-network-partition
spec:
  tier: 3
  target:
    operator: knative-serving
    component: autoscaler
    resource: Deployment/autoscaler
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: autoscaler
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      direction: ingress
      labelSelector: app=autoscaler
    ttl: "120s"
  hypothesis:
    description: >-
      Isolating the autoscaler from the network blocks metric collection
      from activator pods and prevents scale decisions from being communicated.
      Running pods continue but autoscaling stalls. This is particularly
      interesting because the activator's health check depends on a WebSocket
      to the autoscaler.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-autoscaler-label-stomping¶

Type: LabelStomping
Danger Level: medium
Component: autoscaler

Removing the app=autoscaler label disconnects pods from the autoscaler Service. The Deployment controller should recreate pods with the correct labels.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-autoscaler-label-stomping
spec:
  tier: 3
  target:
    operator: knative-serving
    component: autoscaler
    resource: Deployment/autoscaler
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: autoscaler
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: LabelStomping
    parameters:
      apiVersion: apps/v1
      kind: Deployment
      name: autoscaler
      labelKey: app
      action: overwrite
    ttl: "120s"
  hypothesis:
    description: >-
      Removing the app=autoscaler label disconnects pods from the autoscaler
      Service. The Deployment controller should recreate pods with the correct
      labels.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-autoscaler-quota-exhaustion¶

Type: QuotaExhaustion
Danger Level: medium
Component: autoscaler

A restrictive ResourceQuota prevents the autoscaler from restarting.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-autoscaler-quota-exhaustion
spec:
  tier: 3
  target:
    operator: knative-serving
    component: autoscaler
    resource: Deployment/autoscaler
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: autoscaler
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: QuotaExhaustion
    parameters:
      quotaName: chaos-quota-autoscaler
      pods: "0"
      cpu: "0"
      memory: "0"
    ttl: "60s"
  hypothesis:
    description: >-
      A restrictive ResourceQuota prevents the autoscaler from restarting.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

autoscaler-hpa¶

knative-autoscaler-hpa-pod-kill¶

Type: PodKill
Danger Level: medium
Component: autoscaler-hpa

Killing an autoscaler-hpa pod temporarily pauses HPA-based scaling. The other replica takes over via leader election.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-autoscaler-hpa-pod-kill
spec:
  tier: 3
  target:
    operator: knative-serving
    component: autoscaler-hpa
    resource: Deployment/autoscaler-hpa
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: autoscaler-hpa
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=autoscaler-hpa
    count: 1
    ttl: "120s"
  hypothesis:
    description: >-
      Killing an autoscaler-hpa pod temporarily pauses HPA-based scaling.
      The other replica takes over via leader election.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-autoscaler-hpa-network-partition¶

Type: NetworkPartition
Danger Level: medium
Component: autoscaler-hpa

Isolating the HPA autoscaler blocks its ability to receive metrics and update HPA resources. Existing HPA targets maintain their current replica count but cannot adjust.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-autoscaler-hpa-network-partition
spec:
  tier: 3
  target:
    operator: knative-serving
    component: autoscaler-hpa
    resource: Deployment/autoscaler-hpa
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: autoscaler-hpa
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      direction: ingress
      labelSelector: app=autoscaler-hpa
    ttl: "120s"
  hypothesis:
    description: >-
      Isolating the HPA autoscaler blocks its ability to receive metrics
      and update HPA resources. Existing HPA targets maintain their current
      replica count but cannot adjust.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

controller¶

knative-controller-pod-kill¶

Type: PodKill
Danger Level: low
Component: controller

Killing one Knative Serving controller pod should not affect existing inference services since the other replica takes over leader election. New Knative Service creation may briefly stall during leader transition.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-controller-pod-kill
spec:
  tier: 1
  target:
    operator: knative-serving
    component: controller
    resource: Deployment/controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: controller
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=controller
    count: 1
    ttl: "120s"
  hypothesis:
    description: >-
      Killing one Knative Serving controller pod should not affect
      existing inference services since the other replica takes over
      leader election. New Knative Service creation may briefly stall
      during leader transition.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - knative-serving

knative-controller-all-pods-kill¶

Type: PodKill
Danger Level: high
Component: controller

Killing both controller pods stops all Knative Service reconciliation. Existing services continue running but new deployments, updates, and route changes are not processed. After recovery, leader election completes and reconciliation resumes.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-controller-all-pods-kill
spec:
  tier: 4
  target:
    operator: knative-serving
    component: controller
    resource: Deployment/controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: controller
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=controller
    count: 2
    ttl: "120s"
  hypothesis:
    description: >-
      Killing both controller pods stops all Knative Service reconciliation.
      Existing services continue running but new deployments, updates, and
      route changes are not processed. After recovery, leader election
      completes and reconciliation resumes.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-controller-network-partition¶

Type: NetworkPartition
Danger Level: medium
Component: controller

Isolating the controller from the network prevents it from reading or updating Knative resources. Reconciliation stalls. Existing services continue running. After the partition lifts, the controller should resume reconciliation.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-controller-network-partition
spec:
  tier: 3
  target:
    operator: knative-serving
    component: controller
    resource: Deployment/controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: controller
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      direction: ingress
      labelSelector: app=controller
    ttl: "120s"
  hypothesis:
    description: >-
      Isolating the controller from the network prevents it from reading
      or updating Knative resources. Reconciliation stalls. Existing services
      continue running. After the partition lifts, the controller should
      resume reconciliation.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-controller-label-stomping¶

Type: LabelStomping
Danger Level: medium
Component: controller

Removing the app=controller label from pods. The Deployment controller should recreate pods with correct labels.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-controller-label-stomping
spec:
  tier: 3
  target:
    operator: knative-serving
    component: controller
    resource: Deployment/controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: controller
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: LabelStomping
    parameters:
      apiVersion: apps/v1
      kind: Deployment
      name: controller
      labelKey: app
      action: overwrite
    ttl: "120s"
  hypothesis:
    description: >-
      Removing the app=controller label from pods. The Deployment controller
      should recreate pods with correct labels.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-controller-quota-exhaustion¶

Type: QuotaExhaustion
Danger Level: medium
Component: controller

A restrictive ResourceQuota prevents the controller from restarting after failure.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-controller-quota-exhaustion
spec:
  tier: 3
  target:
    operator: knative-serving
    component: controller
    resource: Deployment/controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: controller
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: QuotaExhaustion
    parameters:
      quotaName: chaos-quota-controller
      pods: "0"
      cpu: "0"
      memory: "0"
    ttl: "60s"
  hypothesis:
    description: >-
      A restrictive ResourceQuota prevents the controller from restarting
      after failure.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-controller-rbac-revoke¶

Type: RBACRevoke
Danger Level: high
Component: controller

Revoking the controller's admin ClusterRoleBinding removes its ability to manage Knative resources (Services, Routes, Configurations, Revisions). All reconciliation fails with authorization errors. After RBAC is restored, reconciliation should resume.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-controller-rbac-revoke
spec:
  tier: 4
  target:
    operator: knative-serving
    component: controller
    resource: Deployment/controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: controller
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: RBACRevoke
    dangerLevel: high
    parameters:
      bindingName: knative-serving-controller-admin
      bindingType: ClusterRoleBinding
    ttl: "120s"
  hypothesis:
    description: >-
      Revoking the controller's admin ClusterRoleBinding removes its ability
      to manage Knative resources (Services, Routes, Configurations, Revisions).
      All reconciliation fails with authorization errors. After RBAC is
      restored, reconciliation should resume.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true

webhook¶

knative-webhook-pod-kill¶

Type: PodKill
Danger Level: low
Component: webhook

Killing one Knative webhook pod should not block Knative Service operations since the other replica handles validation/mutation. Existing running services are unaffected.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-webhook-pod-kill
spec:
  tier: 1
  target:
    operator: knative-serving
    component: webhook
    resource: Deployment/webhook
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: webhook
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=webhook
    count: 1
    ttl: "120s"
  hypothesis:
    description: >-
      Killing one Knative webhook pod should not block Knative Service
      operations since the other replica handles validation/mutation.
      Existing running services are unaffected.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - knative-serving

knative-webhook-all-pods-kill¶

Type: PodKill
Danger Level: high
Component: webhook

Killing both webhook pods blocks all Knative Service creation and modification. The API server cannot validate or mutate Knative resources. Existing services are unaffected but cannot be updated.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-webhook-all-pods-kill
spec:
  tier: 5
  target:
    operator: knative-serving
    component: webhook
    resource: Deployment/webhook
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: webhook
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=webhook
    count: 2
    ttl: "120s"
  hypothesis:
    description: >-
      Killing both webhook pods blocks all Knative Service creation and
      modification. The API server cannot validate or mutate Knative resources.
      Existing services are unaffected but cannot be updated.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-webhook-network-partition¶

Type: NetworkPartition
Danger Level: medium
Component: webhook

Isolating webhook pods makes the webhook Service unreachable. The API server gets timeout errors when validating Knative resources. With failurePolicy=Fail (default), all Knative operations are blocked.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-webhook-network-partition
spec:
  tier: 3
  target:
    operator: knative-serving
    component: webhook
    resource: Deployment/webhook
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: webhook
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      direction: ingress
      labelSelector: app=webhook
    ttl: "120s"
  hypothesis:
    description: >-
      Isolating webhook pods makes the webhook Service unreachable. The API
      server gets timeout errors when validating Knative resources. With
      failurePolicy=Fail (default), all Knative operations are blocked.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-webhook-cert-corrupt¶

Type: ConfigDrift
Danger Level: high
Component: webhook

Corrupting the Knative webhook TLS certificate should cause webhook validation to fail with TLS errors. This blocks creation and modification of all Knative Service resources. Existing services continue running but cannot be updated. The webhook or cert-manager should detect and regenerate the certificate.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-webhook-cert-corrupt
spec:
  tier: 3
  target:
    operator: knative-serving
    component: webhook
    resource: Secret/webhook-certs
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: webhook
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: ConfigDrift
    dangerLevel: high
    parameters:
      name: webhook-certs
      key: server-cert.pem
      value: "Y2hhb3MtY29ycnVwdGVk"
      resourceType: Secret
    ttl: "60s"
  hypothesis:
    description: >-
      Corrupting the Knative webhook TLS certificate should cause webhook
      validation to fail with TLS errors. This blocks creation and
      modification of all Knative Service resources. Existing services
      continue running but cannot be updated. The webhook or cert-manager
      should detect and regenerate the certificate.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-webhook-label-stomping¶

Type: LabelStomping
Danger Level: medium
Component: webhook

Removing the app=webhook label disconnects pods from webhook Service endpoints, making the webhook unreachable for API server calls.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-webhook-label-stomping
spec:
  tier: 3
  target:
    operator: knative-serving
    component: webhook
    resource: Deployment/webhook
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: webhook
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: LabelStomping
    parameters:
      apiVersion: apps/v1
      kind: Deployment
      name: webhook
      labelKey: app
      action: overwrite
    ttl: "120s"
  hypothesis:
    description: >-
      Removing the app=webhook label disconnects pods from webhook Service
      endpoints, making the webhook unreachable for API server calls.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-webhook-quota-exhaustion¶

Type: QuotaExhaustion
Danger Level: medium
Component: webhook

A restrictive ResourceQuota prevents the webhook from restarting.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-webhook-quota-exhaustion
spec:
  tier: 3
  target:
    operator: knative-serving
    component: webhook
    resource: Deployment/webhook
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: webhook
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: QuotaExhaustion
    parameters:
      quotaName: chaos-quota-webhook
      pods: "0"
      cpu: "0"
      memory: "0"
    ttl: "60s"
  hypothesis:
    description: >-
      A restrictive ResourceQuota prevents the webhook from restarting.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving

knative-webhook-webhook-disrupt¶

Type: WebhookDisrupt
Danger Level: high
Component: webhook

Changing the validating webhook's failurePolicy from Fail to Ignore bypasses all Knative resource validation. Invalid resources can be created. After restoration, validation resumes.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-webhook-webhook-disrupt
spec:
  tier: 4
  target:
    operator: knative-serving
    component: webhook
    resource: Deployment/webhook
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: webhook
        namespace: knative-serving
        conditionType: Available
    timeout: "30s"
  injection:
    type: WebhookDisrupt
    dangerLevel: high
    parameters:
      webhookName: validation.webhook.serving.knative.dev
      action: setFailurePolicy
      value: Ignore
    ttl: "60s"
  hypothesis:
    description: >-
      Changing the validating webhook's failurePolicy from Fail to Ignore
      bypasses all Knative resource validation. Invalid resources can be
      created. After restoration, validation resumes.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true

kourier-gateway¶

knative-kourier-gateway-pod-kill¶

Type: PodKill
Danger Level: medium
Component: kourier-gateway

The Kourier gateway is the Envoy-based ingress proxy for all Knative Serving traffic. Killing one gateway pod should shift traffic to the other replica. Brief request failures may occur during the transition. After pod restart, traffic routing resumes normally.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-kourier-gateway-pod-kill
spec:
  tier: 2
  target:
    operator: knative-serving
    component: kourier-gateway
    resource: Deployment/3scale-kourier-gateway
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: 3scale-kourier-gateway
        namespace: knative-serving-ingress
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=3scale-kourier-gateway
    count: 1
    ttl: "120s"
  hypothesis:
    description: >-
      The Kourier gateway is the Envoy-based ingress proxy for all
      Knative Serving traffic. Killing one gateway pod should shift
      traffic to the other replica. Brief request failures may occur
      during the transition. After pod restart, traffic routing
      resumes normally.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true
    allowedNamespaces:
      - knative-serving-ingress

knative-kourier-all-pods-kill¶

Type: PodKill
Danger Level: high
Component: kourier-gateway

Killing both Kourier gateway pods simultaneously causes a complete inference traffic blackout. No external requests can reach any Knative Service. This is the most impactful single fault for inference availability. The Deployment controller should recreate both pods and Envoy should reload its config from the control plane. During the outage, all inference requests fail.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-kourier-all-pods-kill
spec:
  tier: 5
  target:
    operator: knative-serving
    component: kourier-gateway
    resource: Deployment/3scale-kourier-gateway
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: 3scale-kourier-gateway
        namespace: knative-serving-ingress
        conditionType: Available
    timeout: "60s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=3scale-kourier-gateway
    count: 2
    ttl: "120s"
  hypothesis:
    description: >-
      Killing both Kourier gateway pods simultaneously causes a
      complete inference traffic blackout. No external requests can
      reach any Knative Service. This is the most impactful single
      fault for inference availability. The Deployment controller
      should recreate both pods and Envoy should reload its config
      from the control plane. During the outage, all inference
      requests fail.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving-ingress

knative-kourier-network-partition¶

Type: NetworkPartition
Danger Level: medium
Component: kourier-gateway

Network-isolating the Kourier gateway pods blocks all inference traffic at the ingress layer. Unlike pod kill, the pods remain running but cannot accept connections. This simulates a network failure between the OpenShift router and the Knative ingress. After the partition lifts, Envoy should resume serving traffic without needing a restart.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-kourier-network-partition
spec:
  tier: 3
  target:
    operator: knative-serving
    component: kourier-gateway
    resource: Deployment/3scale-kourier-gateway
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: 3scale-kourier-gateway
        namespace: knative-serving-ingress
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      labelSelector: app=3scale-kourier-gateway
      direction: ingress
    ttl: "60s"
  hypothesis:
    description: >-
      Network-isolating the Kourier gateway pods blocks all inference
      traffic at the ingress layer. Unlike pod kill, the pods remain
      running but cannot accept connections. This simulates a network
      failure between the OpenShift router and the Knative ingress.
      After the partition lifts, Envoy should resume serving traffic
      without needing a restart.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving-ingress

knative-kourier-gateway-label-stomping¶

Type: LabelStomping
Danger Level: medium
Component: kourier-gateway

Removing the app=3scale-kourier-gateway label disconnects gateway pods from the gateway Service. External traffic cannot reach any Knative Service.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-kourier-gateway-label-stomping
spec:
  tier: 3
  target:
    operator: knative-serving
    component: kourier-gateway
    resource: Deployment/3scale-kourier-gateway
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: 3scale-kourier-gateway
        namespace: knative-serving-ingress
        conditionType: Available
    timeout: "30s"
  injection:
    type: LabelStomping
    parameters:
      apiVersion: apps/v1
      kind: Deployment
      name: 3scale-kourier-gateway
      labelKey: app
      action: overwrite
    ttl: "120s"
  hypothesis:
    description: >-
      Removing the app=3scale-kourier-gateway label disconnects gateway pods
      from the gateway Service. External traffic cannot reach any Knative
      Service.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving-ingress

knative-kourier-gateway-quota-exhaustion¶

Type: QuotaExhaustion
Danger Level: medium
Component: kourier-gateway

A restrictive ResourceQuota prevents the kourier gateway from restarting.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-kourier-gateway-quota-exhaustion
spec:
  tier: 3
  target:
    operator: knative-serving
    component: kourier-gateway
    resource: Deployment/3scale-kourier-gateway
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: 3scale-kourier-gateway
        namespace: knative-serving-ingress
        conditionType: Available
    timeout: "30s"
  injection:
    type: QuotaExhaustion
    parameters:
      quotaName: chaos-quota-kourier-gateway
      pods: "0"
      cpu: "0"
      memory: "0"
    ttl: "60s"
  hypothesis:
    description: >-
      A restrictive ResourceQuota prevents the kourier gateway from restarting.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving-ingress

net-kourier-controller¶

knative-net-kourier-controller-pod-kill¶

Type: PodKill
Danger Level: medium
Component: net-kourier-controller

Killing a net-kourier-controller pod temporarily pauses Envoy route programming. Existing routes continue working but new Knative Service routes are not programmed. The other replica takes over.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-net-kourier-controller-pod-kill
spec:
  tier: 3
  target:
    operator: knative-serving
    component: net-kourier-controller
    resource: Deployment/net-kourier-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: net-kourier-controller
        namespace: knative-serving-ingress
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=net-kourier-controller
    count: 1
    ttl: "120s"
  hypothesis:
    description: >-
      Killing a net-kourier-controller pod temporarily pauses Envoy route
      programming. Existing routes continue working but new Knative Service
      routes are not programmed. The other replica takes over.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true
    allowedNamespaces:
      - knative-serving-ingress

knative-net-kourier-controller-all-pods-kill¶

Type: PodKill
Danger Level: high
Component: net-kourier-controller

Killing both net-kourier-controller pods stops all Envoy route programming. Existing routes continue serving but no new routes are added. After recovery, the controller re-syncs and programs missing routes.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-net-kourier-controller-all-pods-kill
spec:
  tier: 4
  target:
    operator: knative-serving
    component: net-kourier-controller
    resource: Deployment/net-kourier-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: net-kourier-controller
        namespace: knative-serving-ingress
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=net-kourier-controller
    count: 2
    ttl: "120s"
  hypothesis:
    description: >-
      Killing both net-kourier-controller pods stops all Envoy route
      programming. Existing routes continue serving but no new routes are
      added. After recovery, the controller re-syncs and programs missing
      routes.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving-ingress

knative-net-kourier-controller-network-partition¶

Type: NetworkPartition
Danger Level: medium
Component: net-kourier-controller

Isolating the net-kourier-controller prevents it from receiving API server events and updating Envoy configuration. Existing routes remain but new Knative Services cannot be exposed.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-net-kourier-controller-network-partition
spec:
  tier: 3
  target:
    operator: knative-serving
    component: net-kourier-controller
    resource: Deployment/net-kourier-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: net-kourier-controller
        namespace: knative-serving-ingress
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      direction: ingress
      labelSelector: app=net-kourier-controller
    ttl: "120s"
  hypothesis:
    description: >-
      Isolating the net-kourier-controller prevents it from receiving API
      server events and updating Envoy configuration. Existing routes remain
      but new Knative Services cannot be exposed.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true
    allowedNamespaces:
      - knative-serving-ingress

knative-net-kourier-controller-rbac-revoke¶

Type: RBACRevoke
Danger Level: high
Component: net-kourier-controller

Revoking the net-kourier ClusterRoleBinding prevents the controller from reading Ingress resources and programming Envoy. New routes cannot be created. After RBAC is restored, the controller should re-sync.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: knative-net-kourier-controller-rbac-revoke
spec:
  tier: 4
  target:
    operator: knative-serving
    component: net-kourier-controller
    resource: Deployment/net-kourier-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: net-kourier-controller
        namespace: knative-serving-ingress
        conditionType: Available
    timeout: "30s"
  injection:
    type: RBACRevoke
    dangerLevel: high
    parameters:
      bindingName: net-kourier
      bindingType: ClusterRoleBinding
    ttl: "120s"
  hypothesis:
    description: >-
      Revoking the net-kourier ClusterRoleBinding prevents the controller
      from reading Ingress resources and programming Envoy. New routes cannot
      be created. After RBAC is restored, the controller should re-sync.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 2
    allowDangerous: true