cert-manager Failure Modes¶

Coverage¶

Injection Type	Danger	Experiment	Description
PodKill	low	controller/controller-pod-kill.yaml	Killing the cert-manager controller pod should trigger a Deployment rollout that...
NetworkPartition	medium	controller/controller-network-partition.yaml	Isolating the cert-manager controller from the API server should cause certificat...
LabelStomping	high	controller/label-stomping.yaml	When a label used for resource discovery is overwritten on the controller Deploy...
QuotaExhaustion	high	controller/quota-exhaustion.yaml	When a ResourceQuota with zero limits is applied to cert-manager, no new pods ca...
RBACRevoke	high	controller/rbac-revoke.yaml	Revoking the cert-manager controller's ClusterRoleBinding for issuers should cau...
PodKill	low	webhook/pod-kill.yaml	When a webhook pod is killed, the Deployment controller should restart it and th...
NetworkPartition	medium	webhook/network-partition.yaml	When webhook pods are network-isolated via a deny-all NetworkPolicy, the compone...
LabelStomping	high	webhook/label-stomping.yaml	When a label used for resource discovery is overwritten on the webhook Deploymen...
QuotaExhaustion	high	webhook/quota-exhaustion.yaml	When a ResourceQuota with zero limits is applied to cert-manager, no new pods ca...
ConfigDrift	high	webhook/webhook-cert-corrupt.yaml	When the webhook webhook TLS certificate is corrupted, the webhook server should...
PodKill	low	cainjector/pod-kill.yaml	When a cainjector pod is killed, the Deployment controller should restart it and...
NetworkPartition	medium	cainjector/network-partition.yaml	When cainjector pods are network-isolated via a deny-all NetworkPolicy, the comp...
LabelStomping	high	cainjector/label-stomping.yaml	When a label used for resource discovery is overwritten on the cainjector Deploy...
QuotaExhaustion	high	cainjector/quota-exhaustion.yaml	When a ResourceQuota with zero limits is applied to cert-manager, no new pods ca...

Experiment Details¶

controller¶

cert-manager-controller-pod-kill¶

Type: PodKill
Danger Level: low
Component: cert-manager-controller

Killing the cert-manager controller pod should trigger a Deployment rollout that recreates it. Existing certificates remain valid since they are stored as Secrets. Pending certificate requests will stall until the controller recovers.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: cert-manager-controller-pod-kill
spec:
  tier: 1
  target:
    operator: cert-manager
    component: cert-manager-controller
    resource: Deployment/cert-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: cert-manager
        namespace: cert-manager
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app.kubernetes.io/name=cert-manager
    count: 1
    ttl: "300s"
  hypothesis:
    description: >-
      Killing the cert-manager controller pod should trigger a Deployment
      rollout that recreates it. Existing certificates remain valid since
      they are stored as Secrets. Pending certificate requests will stall
      until the controller recovers.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - cert-manager

cert-manager-controller-network-partition¶

Type: NetworkPartition
Danger Level: medium
Component: cert-manager-controller

Isolating the cert-manager controller from the API server should cause certificate reconciliation to stall. Existing certificates and Secrets remain valid. After the partition is lifted, the controller should reconnect and process any backlogged certificate requests.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: cert-manager-controller-network-partition
spec:
  tier: 2
  target:
    operator: cert-manager
    component: cert-manager-controller
    resource: Deployment/cert-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: cert-manager
        namespace: cert-manager
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      labelSelector: app.kubernetes.io/name=cert-manager
      direction: ingress
    ttl: "120s"
  hypothesis:
    description: >-
      Isolating the cert-manager controller from the API server should cause
      certificate reconciliation to stall. Existing certificates and Secrets
      remain valid. After the partition is lifted, the controller should
      reconnect and process any backlogged certificate requests.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - cert-manager

controller-label-stomping¶

Type: LabelStomping
Danger Level: high
Component: cert-manager-controller

When a label used for resource discovery is overwritten on the controller Deployment, the operator should detect the label drift and restore the correct label value.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: controller-label-stomping
spec:
  tier: 3
  target:
    operator: cert-manager
    component: cert-manager-controller
    resource: Deployment/cert-manager
  steadyState:
    checks:
      - type: resourceExists
        apiVersion: apps/v1
        kind: Deployment
        name: cert-manager
        namespace: cert-manager
    timeout: "30s"
  injection:
    type: LabelStomping
    dangerLevel: high
    parameters:
      apiVersion: apps/v1
      kind: Deployment
      name: cert-manager
      labelKey: app.kubernetes.io/name
      action: overwrite
    ttl: "300s"
  hypothesis:
    description: >-
      When a label used for resource discovery is overwritten on the
      controller Deployment, the operator should detect the label drift
      and restore the correct label value.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true
    allowedNamespaces:
      - cert-manager

controller-quota-exhaustion¶

Type: QuotaExhaustion
Danger Level: high
Component: cert-manager-controller

When a ResourceQuota with zero limits is applied to cert-manager, no new pods can be created. The controller should handle quota exhaustion gracefully. The chaos framework removes the quota via TTL-based cleanup.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: controller-quota-exhaustion
spec:
  tier: 5
  target:
    operator: cert-manager
    component: cert-manager-controller
    resource: ResourceQuota/chaos-quota-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: cert-manager
        namespace: cert-manager
        conditionType: Available
    timeout: "30s"
  injection:
    type: QuotaExhaustion
    dangerLevel: high
    parameters:
      quotaName: chaos-quota-controller
      pods: "0"
      cpu: "0"
      memory: "0"
    ttl: "60s"
  hypothesis:
    description: >-
      When a ResourceQuota with zero limits is applied to cert-manager,
      no new pods can be created. The controller should handle quota
      exhaustion gracefully. The chaos framework removes the quota via
      TTL-based cleanup.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true
    allowedNamespaces:
      - cert-manager

cert-manager-rbac-revoke¶

Type: RBACRevoke
Danger Level: high
Component: cert-manager-controller

Revoking the cert-manager controller's ClusterRoleBinding for issuers should cause certificate issuance to fail with RBAC errors. The controller pod remains running but cannot reconcile Issuer resources. After rollback, the ClusterRoleBinding is restored and pending issuance resumes.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: cert-manager-rbac-revoke
spec:
  tier: 4
  target:
    operator: cert-manager
    component: cert-manager-controller
    resource: ClusterRoleBinding/cert-manager-controller-issuers
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: cert-manager
        namespace: cert-manager
        conditionType: Available
    timeout: "30s"
  injection:
    type: RBACRevoke
    dangerLevel: high
    parameters:
      bindingName: cert-manager-controller-issuers
      bindingType: ClusterRoleBinding
    ttl: "120s"
  hypothesis:
    description: >-
      Revoking the cert-manager controller's ClusterRoleBinding for issuers
      should cause certificate issuance to fail with RBAC errors. The controller
      pod remains running but cannot reconcile Issuer resources. After rollback,
      the ClusterRoleBinding is restored and pending issuance resumes.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true

webhook¶

webhook-pod-kill¶

Type: PodKill
Danger Level: low
Component: webhook

When a webhook pod is killed, the Deployment controller should restart it and the component should recover within the recovery timeout.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: webhook-pod-kill
spec:
  tier: 1
  target:
    operator: cert-manager
    component: webhook
    resource: Deployment/cert-manager-webhook
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: cert-manager-webhook
        namespace: cert-manager
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app.kubernetes.io/name=webhook
    ttl: "30s"
  hypothesis:
    description: >-
      When a webhook pod is killed, the Deployment controller should
      restart it and the component should recover within the recovery timeout.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - cert-manager

webhook-network-partition¶

Type: NetworkPartition
Danger Level: medium
Component: webhook

When webhook pods are network-isolated via a deny-all NetworkPolicy, the component should detect the partition and recover once connectivity is restored.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: webhook-network-partition
spec:
  tier: 2
  target:
    operator: cert-manager
    component: webhook
    resource: Deployment/cert-manager-webhook
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: cert-manager-webhook
        namespace: cert-manager
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      labelSelector: app.kubernetes.io/name=webhook
    ttl: "60s"
  hypothesis:
    description: >-
      When webhook pods are network-isolated via a deny-all NetworkPolicy,
      the component should detect the partition and recover once connectivity
      is restored.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - cert-manager

webhook-label-stomping¶

Type: LabelStomping
Danger Level: high
Component: webhook

When a label used for resource discovery is overwritten on the webhook Deployment, the operator should detect the label drift and restore the correct label value.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: webhook-label-stomping
spec:
  tier: 3
  target:
    operator: cert-manager
    component: webhook
    resource: Deployment/cert-manager-webhook
  steadyState:
    checks:
      - type: resourceExists
        apiVersion: apps/v1
        kind: Deployment
        name: cert-manager-webhook
        namespace: cert-manager
    timeout: "30s"
  injection:
    type: LabelStomping
    dangerLevel: high
    parameters:
      apiVersion: apps/v1
      kind: Deployment
      name: cert-manager-webhook
      labelKey: app.kubernetes.io/name
      action: overwrite
    ttl: "300s"
  hypothesis:
    description: >-
      When a label used for resource discovery is overwritten on the
      webhook Deployment, the operator should detect the label drift
      and restore the correct label value.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true
    allowedNamespaces:
      - cert-manager

webhook-quota-exhaustion¶

Type: QuotaExhaustion
Danger Level: high
Component: webhook

When a ResourceQuota with zero limits is applied to cert-manager, no new pods can be created. The webhook should handle quota exhaustion gracefully. The chaos framework removes the quota via TTL-based cleanup.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: webhook-quota-exhaustion
spec:
  tier: 5
  target:
    operator: cert-manager
    component: webhook
    resource: ResourceQuota/chaos-quota-webhook
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: cert-manager-webhook
        namespace: cert-manager
        conditionType: Available
    timeout: "30s"
  injection:
    type: QuotaExhaustion
    dangerLevel: high
    parameters:
      quotaName: chaos-quota-webhook
      pods: "0"
      cpu: "0"
      memory: "0"
    ttl: "60s"
  hypothesis:
    description: >-
      When a ResourceQuota with zero limits is applied to cert-manager,
      no new pods can be created. The webhook should handle quota
      exhaustion gracefully. The chaos framework removes the quota via
      TTL-based cleanup.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true
    allowedNamespaces:
      - cert-manager

webhook-webhook-cert-corrupt¶

Type: ConfigDrift
Danger Level: high
Component: webhook

When the webhook webhook TLS certificate is corrupted, the webhook server should fail to serve and the controller should detect and regenerate the certificate.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: webhook-webhook-cert-corrupt
spec:
  tier: 2
  target:
    operator: cert-manager
    component: webhook
    resource: Secret/cert-manager-webhook-ca
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: cert-manager-webhook
        namespace: cert-manager
        conditionType: Available
    timeout: "30s"
  injection:
    type: ConfigDrift
    dangerLevel: high
    parameters:
      name: cert-manager-webhook-ca
      key: tls.crt
      value: "Y2hhb3MtY29ycnVwdGVk"
      resourceType: Secret
    ttl: "60s"
  hypothesis:
    description: >-
      When the webhook webhook TLS certificate is corrupted, the
      webhook server should fail to serve and the controller should
      detect and regenerate the certificate.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true
    allowedNamespaces:
      - cert-manager

cainjector¶

cainjector-pod-kill¶

Type: PodKill
Danger Level: low
Component: cainjector

When a cainjector pod is killed, the Deployment controller should restart it and the component should recover within the recovery timeout.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: cainjector-pod-kill
spec:
  tier: 1
  target:
    operator: cert-manager
    component: cainjector
    resource: Deployment/cert-manager-cainjector
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: cert-manager-cainjector
        namespace: cert-manager
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app.kubernetes.io/name=cainjector
    ttl: "30s"
  hypothesis:
    description: >-
      When a cainjector pod is killed, the Deployment controller should
      restart it and the component should recover within the recovery timeout.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - cert-manager

cainjector-network-partition¶

Type: NetworkPartition
Danger Level: medium
Component: cainjector

When cainjector pods are network-isolated via a deny-all NetworkPolicy, the component should detect the partition and recover once connectivity is restored.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: cainjector-network-partition
spec:
  tier: 2
  target:
    operator: cert-manager
    component: cainjector
    resource: Deployment/cert-manager-cainjector
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: cert-manager-cainjector
        namespace: cert-manager
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      labelSelector: app.kubernetes.io/name=cainjector
    ttl: "60s"
  hypothesis:
    description: >-
      When cainjector pods are network-isolated via a deny-all NetworkPolicy,
      the component should detect the partition and recover once connectivity
      is restored.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - cert-manager

cainjector-label-stomping¶

Type: LabelStomping
Danger Level: high
Component: cainjector

When a label used for resource discovery is overwritten on the cainjector Deployment, the operator should detect the label drift and restore the correct label value.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: cainjector-label-stomping
spec:
  tier: 3
  target:
    operator: cert-manager
    component: cainjector
    resource: Deployment/cert-manager-cainjector
  steadyState:
    checks:
      - type: resourceExists
        apiVersion: apps/v1
        kind: Deployment
        name: cert-manager-cainjector
        namespace: cert-manager
    timeout: "30s"
  injection:
    type: LabelStomping
    dangerLevel: high
    parameters:
      apiVersion: apps/v1
      kind: Deployment
      name: cert-manager-cainjector
      labelKey: app.kubernetes.io/name
      action: overwrite
    ttl: "300s"
  hypothesis:
    description: >-
      When a label used for resource discovery is overwritten on the
      cainjector Deployment, the operator should detect the label drift
      and restore the correct label value.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true
    allowedNamespaces:
      - cert-manager

cainjector-quota-exhaustion¶

Type: QuotaExhaustion
Danger Level: high
Component: cainjector

When a ResourceQuota with zero limits is applied to cert-manager, no new pods can be created. The cainjector should handle quota exhaustion gracefully. The chaos framework removes the quota via TTL-based cleanup.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: cainjector-quota-exhaustion
spec:
  tier: 5
  target:
    operator: cert-manager
    component: cainjector
    resource: ResourceQuota/chaos-quota-cainjector
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: cert-manager-cainjector
        namespace: cert-manager
        conditionType: Available
    timeout: "30s"
  injection:
    type: QuotaExhaustion
    dangerLevel: high
    parameters:
      quotaName: chaos-quota-cainjector
      pods: "0"
      cpu: "0"
      memory: "0"
    ttl: "60s"
  hypothesis:
    description: >-
      When a ResourceQuota with zero limits is applied to cert-manager,
      no new pods can be created. The cainjector should handle quota
      exhaustion gracefully. The chaos framework removes the quota via
      TTL-based cleanup.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true
    allowedNamespaces:
      - cert-manager