cert-manager Failure Modes¶
Coverage¶
| Injection Type | Danger | Experiment | Description |
|---|---|---|---|
| PodKill | low | controller/controller-pod-kill.yaml | Killing the cert-manager controller pod should trigger a Deployment rollout that... |
| NetworkPartition | medium | controller/controller-network-partition.yaml | Isolating the cert-manager controller from the API server should cause certificat... |
| LabelStomping | high | controller/label-stomping.yaml | When a label used for resource discovery is overwritten on the controller Deploy... |
| QuotaExhaustion | high | controller/quota-exhaustion.yaml | When a ResourceQuota with zero limits is applied to cert-manager, no new pods ca... |
| RBACRevoke | high | controller/rbac-revoke.yaml | Revoking the cert-manager controller's ClusterRoleBinding for issuers should cau... |
| PodKill | low | webhook/pod-kill.yaml | When a webhook pod is killed, the Deployment controller should restart it and th... |
| NetworkPartition | medium | webhook/network-partition.yaml | When webhook pods are network-isolated via a deny-all NetworkPolicy, the compone... |
| LabelStomping | high | webhook/label-stomping.yaml | When a label used for resource discovery is overwritten on the webhook Deploymen... |
| QuotaExhaustion | high | webhook/quota-exhaustion.yaml | When a ResourceQuota with zero limits is applied to cert-manager, no new pods ca... |
| ConfigDrift | high | webhook/webhook-cert-corrupt.yaml | When the webhook webhook TLS certificate is corrupted, the webhook server should... |
| PodKill | low | cainjector/pod-kill.yaml | When a cainjector pod is killed, the Deployment controller should restart it and... |
| NetworkPartition | medium | cainjector/network-partition.yaml | When cainjector pods are network-isolated via a deny-all NetworkPolicy, the comp... |
| LabelStomping | high | cainjector/label-stomping.yaml | When a label used for resource discovery is overwritten on the cainjector Deploy... |
| QuotaExhaustion | high | cainjector/quota-exhaustion.yaml | When a ResourceQuota with zero limits is applied to cert-manager, no new pods ca... |
Experiment Details¶
controller¶
cert-manager-controller-pod-kill¶
- Type: PodKill
- Danger Level: low
- Component: cert-manager-controller
Killing the cert-manager controller pod should trigger a Deployment rollout that recreates it. Existing certificates remain valid since they are stored as Secrets. Pending certificate requests will stall until the controller recovers.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: cert-manager-controller-pod-kill
spec:
tier: 1
target:
operator: cert-manager
component: cert-manager-controller
resource: Deployment/cert-manager
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: cert-manager
namespace: cert-manager
conditionType: Available
timeout: "30s"
injection:
type: PodKill
parameters:
labelSelector: app.kubernetes.io/name=cert-manager
count: 1
ttl: "300s"
hypothesis:
description: >-
Killing the cert-manager controller pod should trigger a Deployment
rollout that recreates it. Existing certificates remain valid since
they are stored as Secrets. Pending certificate requests will stall
until the controller recovers.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- cert-manager
cert-manager-controller-network-partition¶
- Type: NetworkPartition
- Danger Level: medium
- Component: cert-manager-controller
Isolating the cert-manager controller from the API server should cause certificate reconciliation to stall. Existing certificates and Secrets remain valid. After the partition is lifted, the controller should reconnect and process any backlogged certificate requests.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: cert-manager-controller-network-partition
spec:
tier: 2
target:
operator: cert-manager
component: cert-manager-controller
resource: Deployment/cert-manager
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: cert-manager
namespace: cert-manager
conditionType: Available
timeout: "30s"
injection:
type: NetworkPartition
parameters:
labelSelector: app.kubernetes.io/name=cert-manager
direction: ingress
ttl: "120s"
hypothesis:
description: >-
Isolating the cert-manager controller from the API server should cause
certificate reconciliation to stall. Existing certificates and Secrets
remain valid. After the partition is lifted, the controller should
reconnect and process any backlogged certificate requests.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- cert-manager
controller-label-stomping¶
- Type: LabelStomping
- Danger Level: high
- Component: cert-manager-controller
When a label used for resource discovery is overwritten on the controller Deployment, the operator should detect the label drift and restore the correct label value.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: controller-label-stomping
spec:
tier: 3
target:
operator: cert-manager
component: cert-manager-controller
resource: Deployment/cert-manager
steadyState:
checks:
- type: resourceExists
apiVersion: apps/v1
kind: Deployment
name: cert-manager
namespace: cert-manager
timeout: "30s"
injection:
type: LabelStomping
dangerLevel: high
parameters:
apiVersion: apps/v1
kind: Deployment
name: cert-manager
labelKey: app.kubernetes.io/name
action: overwrite
ttl: "300s"
hypothesis:
description: >-
When a label used for resource discovery is overwritten on the
controller Deployment, the operator should detect the label drift
and restore the correct label value.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowDangerous: true
allowedNamespaces:
- cert-manager
controller-quota-exhaustion¶
- Type: QuotaExhaustion
- Danger Level: high
- Component: cert-manager-controller
When a ResourceQuota with zero limits is applied to cert-manager, no new pods can be created. The controller should handle quota exhaustion gracefully. The chaos framework removes the quota via TTL-based cleanup.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: controller-quota-exhaustion
spec:
tier: 5
target:
operator: cert-manager
component: cert-manager-controller
resource: ResourceQuota/chaos-quota-controller
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: cert-manager
namespace: cert-manager
conditionType: Available
timeout: "30s"
injection:
type: QuotaExhaustion
dangerLevel: high
parameters:
quotaName: chaos-quota-controller
pods: "0"
cpu: "0"
memory: "0"
ttl: "60s"
hypothesis:
description: >-
When a ResourceQuota with zero limits is applied to cert-manager,
no new pods can be created. The controller should handle quota
exhaustion gracefully. The chaos framework removes the quota via
TTL-based cleanup.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowDangerous: true
allowedNamespaces:
- cert-manager
cert-manager-rbac-revoke¶
- Type: RBACRevoke
- Danger Level: high
- Component: cert-manager-controller
Revoking the cert-manager controller's ClusterRoleBinding for issuers should cause certificate issuance to fail with RBAC errors. The controller pod remains running but cannot reconcile Issuer resources. After rollback, the ClusterRoleBinding is restored and pending issuance resumes.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: cert-manager-rbac-revoke
spec:
tier: 4
target:
operator: cert-manager
component: cert-manager-controller
resource: ClusterRoleBinding/cert-manager-controller-issuers
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: cert-manager
namespace: cert-manager
conditionType: Available
timeout: "30s"
injection:
type: RBACRevoke
dangerLevel: high
parameters:
bindingName: cert-manager-controller-issuers
bindingType: ClusterRoleBinding
ttl: "120s"
hypothesis:
description: >-
Revoking the cert-manager controller's ClusterRoleBinding for issuers
should cause certificate issuance to fail with RBAC errors. The controller
pod remains running but cannot reconcile Issuer resources. After rollback,
the ClusterRoleBinding is restored and pending issuance resumes.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowDangerous: true
webhook¶
webhook-pod-kill¶
- Type: PodKill
- Danger Level: low
- Component: webhook
When a webhook pod is killed, the Deployment controller should restart it and the component should recover within the recovery timeout.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: webhook-pod-kill
spec:
tier: 1
target:
operator: cert-manager
component: webhook
resource: Deployment/cert-manager-webhook
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: cert-manager-webhook
namespace: cert-manager
conditionType: Available
timeout: "30s"
injection:
type: PodKill
parameters:
labelSelector: app.kubernetes.io/name=webhook
ttl: "30s"
hypothesis:
description: >-
When a webhook pod is killed, the Deployment controller should
restart it and the component should recover within the recovery timeout.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- cert-manager
webhook-network-partition¶
- Type: NetworkPartition
- Danger Level: medium
- Component: webhook
When webhook pods are network-isolated via a deny-all NetworkPolicy, the component should detect the partition and recover once connectivity is restored.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: webhook-network-partition
spec:
tier: 2
target:
operator: cert-manager
component: webhook
resource: Deployment/cert-manager-webhook
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: cert-manager-webhook
namespace: cert-manager
conditionType: Available
timeout: "30s"
injection:
type: NetworkPartition
parameters:
labelSelector: app.kubernetes.io/name=webhook
ttl: "60s"
hypothesis:
description: >-
When webhook pods are network-isolated via a deny-all NetworkPolicy,
the component should detect the partition and recover once connectivity
is restored.
recoveryTimeout: 180s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- cert-manager
webhook-label-stomping¶
- Type: LabelStomping
- Danger Level: high
- Component: webhook
When a label used for resource discovery is overwritten on the webhook Deployment, the operator should detect the label drift and restore the correct label value.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: webhook-label-stomping
spec:
tier: 3
target:
operator: cert-manager
component: webhook
resource: Deployment/cert-manager-webhook
steadyState:
checks:
- type: resourceExists
apiVersion: apps/v1
kind: Deployment
name: cert-manager-webhook
namespace: cert-manager
timeout: "30s"
injection:
type: LabelStomping
dangerLevel: high
parameters:
apiVersion: apps/v1
kind: Deployment
name: cert-manager-webhook
labelKey: app.kubernetes.io/name
action: overwrite
ttl: "300s"
hypothesis:
description: >-
When a label used for resource discovery is overwritten on the
webhook Deployment, the operator should detect the label drift
and restore the correct label value.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowDangerous: true
allowedNamespaces:
- cert-manager
webhook-quota-exhaustion¶
- Type: QuotaExhaustion
- Danger Level: high
- Component: webhook
When a ResourceQuota with zero limits is applied to cert-manager, no new pods can be created. The webhook should handle quota exhaustion gracefully. The chaos framework removes the quota via TTL-based cleanup.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: webhook-quota-exhaustion
spec:
tier: 5
target:
operator: cert-manager
component: webhook
resource: ResourceQuota/chaos-quota-webhook
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: cert-manager-webhook
namespace: cert-manager
conditionType: Available
timeout: "30s"
injection:
type: QuotaExhaustion
dangerLevel: high
parameters:
quotaName: chaos-quota-webhook
pods: "0"
cpu: "0"
memory: "0"
ttl: "60s"
hypothesis:
description: >-
When a ResourceQuota with zero limits is applied to cert-manager,
no new pods can be created. The webhook should handle quota
exhaustion gracefully. The chaos framework removes the quota via
TTL-based cleanup.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowDangerous: true
allowedNamespaces:
- cert-manager
webhook-webhook-cert-corrupt¶
- Type: ConfigDrift
- Danger Level: high
- Component: webhook
When the webhook webhook TLS certificate is corrupted, the webhook server should fail to serve and the controller should detect and regenerate the certificate.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: webhook-webhook-cert-corrupt
spec:
tier: 2
target:
operator: cert-manager
component: webhook
resource: Secret/cert-manager-webhook-ca
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: cert-manager-webhook
namespace: cert-manager
conditionType: Available
timeout: "30s"
injection:
type: ConfigDrift
dangerLevel: high
parameters:
name: cert-manager-webhook-ca
key: tls.crt
value: "Y2hhb3MtY29ycnVwdGVk"
resourceType: Secret
ttl: "60s"
hypothesis:
description: >-
When the webhook webhook TLS certificate is corrupted, the
webhook server should fail to serve and the controller should
detect and regenerate the certificate.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowDangerous: true
allowedNamespaces:
- cert-manager
cainjector¶
cainjector-pod-kill¶
- Type: PodKill
- Danger Level: low
- Component: cainjector
When a cainjector pod is killed, the Deployment controller should restart it and the component should recover within the recovery timeout.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: cainjector-pod-kill
spec:
tier: 1
target:
operator: cert-manager
component: cainjector
resource: Deployment/cert-manager-cainjector
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: cert-manager-cainjector
namespace: cert-manager
conditionType: Available
timeout: "30s"
injection:
type: PodKill
parameters:
labelSelector: app.kubernetes.io/name=cainjector
ttl: "30s"
hypothesis:
description: >-
When a cainjector pod is killed, the Deployment controller should
restart it and the component should recover within the recovery timeout.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- cert-manager
cainjector-network-partition¶
- Type: NetworkPartition
- Danger Level: medium
- Component: cainjector
When cainjector pods are network-isolated via a deny-all NetworkPolicy, the component should detect the partition and recover once connectivity is restored.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: cainjector-network-partition
spec:
tier: 2
target:
operator: cert-manager
component: cainjector
resource: Deployment/cert-manager-cainjector
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: cert-manager-cainjector
namespace: cert-manager
conditionType: Available
timeout: "30s"
injection:
type: NetworkPartition
parameters:
labelSelector: app.kubernetes.io/name=cainjector
ttl: "60s"
hypothesis:
description: >-
When cainjector pods are network-isolated via a deny-all NetworkPolicy,
the component should detect the partition and recover once connectivity
is restored.
recoveryTimeout: 180s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- cert-manager
cainjector-label-stomping¶
- Type: LabelStomping
- Danger Level: high
- Component: cainjector
When a label used for resource discovery is overwritten on the cainjector Deployment, the operator should detect the label drift and restore the correct label value.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: cainjector-label-stomping
spec:
tier: 3
target:
operator: cert-manager
component: cainjector
resource: Deployment/cert-manager-cainjector
steadyState:
checks:
- type: resourceExists
apiVersion: apps/v1
kind: Deployment
name: cert-manager-cainjector
namespace: cert-manager
timeout: "30s"
injection:
type: LabelStomping
dangerLevel: high
parameters:
apiVersion: apps/v1
kind: Deployment
name: cert-manager-cainjector
labelKey: app.kubernetes.io/name
action: overwrite
ttl: "300s"
hypothesis:
description: >-
When a label used for resource discovery is overwritten on the
cainjector Deployment, the operator should detect the label drift
and restore the correct label value.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowDangerous: true
allowedNamespaces:
- cert-manager
cainjector-quota-exhaustion¶
- Type: QuotaExhaustion
- Danger Level: high
- Component: cainjector
When a ResourceQuota with zero limits is applied to cert-manager, no new pods can be created. The cainjector should handle quota exhaustion gracefully. The chaos framework removes the quota via TTL-based cleanup.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: cainjector-quota-exhaustion
spec:
tier: 5
target:
operator: cert-manager
component: cainjector
resource: ResourceQuota/chaos-quota-cainjector
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: cert-manager-cainjector
namespace: cert-manager
conditionType: Available
timeout: "30s"
injection:
type: QuotaExhaustion
dangerLevel: high
parameters:
quotaName: chaos-quota-cainjector
pods: "0"
cpu: "0"
memory: "0"
ttl: "60s"
hypothesis:
description: >-
When a ResourceQuota with zero limits is applied to cert-manager,
no new pods can be created. The cainjector should handle quota
exhaustion gracefully. The chaos framework removes the quota via
TTL-based cleanup.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowDangerous: true
allowedNamespaces:
- cert-manager