Skip to content

Failure Modes Overview

Overview of all failure injection types available in Operator Chaos.

Quick Reference

Type Danger Description
ClientFault Low Injects errors, latency, or throttling into operator API calls via SDK integration.
ConfigDrift Low Modifies a key in a ConfigMap or Secret to test configuration reconciliation.
PodKill Low Force-deletes pods matching a label selector with zero grace period.
CRDMutation Medium Mutates a spec field on a custom resource instance to test reconciliation of CR state.
FinalizerBlock Medium Adds a stuck finalizer to a resource to test deletion handling and cleanup logic.
LabelStomping Medium Modifies or removes labels on operator-managed resources to test label-based reconciliation.
NetworkPartition Medium Creates a deny-all NetworkPolicy isolating pods matching a label selector from all ingress and egress traffic.
OwnerRefOrphan Medium Removes ownerReferences from operator-managed resources to test re-adoption logic.
QuotaExhaustion Medium Creates a restrictive ResourceQuota to test operator behavior under resource pressure.
CrashLoopInject High Patches a Deployment's container command to a nonexistent binary, causing CrashLoopBackOff.
DeploymentScaleZero High Scales a Deployment to zero replicas to test recovery and replica count reconciliation.
ImageCorrupt High Patches a Deployment's container image to an invalid registry, causing ImagePullBackOff.
LeaderElectionDisrupt High Deletes a Lease object to force leader re-election and test election resilience.
NamespaceDeletion High Deletes an entire namespace to test whether the operator recreates it and its managed resources.
PDBBlock High Creates a PodDisruptionBudget with maxUnavailable=0 to block all voluntary evictions.
RBACRevoke High Clears all subjects from a ClusterRoleBinding or RoleBinding to test RBAC resilience.
ResourceDeletion High Deletes an arbitrary namespaced resource to test whether the operator recreates it.
SecretDeletion High Deletes a Secret to test whether the operator detects the loss and recreates it.
WebhookDisrupt High Modifies failure policies on a ValidatingWebhookConfiguration to test webhook resilience.
WebhookLatency High Deploys a slow admission webhook to add latency to API server requests for specific resources.

Decision Tree

Which failure mode should I use?

graph TD
    A[What are you testing?] --> B{Pod lifecycle?}
    B -->|Yes| C[PodKill]
    A --> D{Network resilience?}
    D -->|Yes| E[NetworkPartition]
    A --> F{Config reconciliation?}
    F -->|Yes| G[ConfigDrift]
    A --> H{CR spec handling?}
    H -->|Yes| I[CRDMutation]
    A --> J{Webhook resilience?}
    J -->|Yes| K[WebhookDisrupt]
    A --> L{Permission handling?}
    L -->|Yes| M[RBACRevoke]
    A --> N{Deletion/cleanup?}
    N -->|Yes| O[FinalizerBlock]
    A --> P{API error handling?}
    P -->|Yes| Q[ClientFault]
    A --> R{Ownership/adoption?}
    R -->|Yes| S[OwnerRefOrphan]
    A --> T{Resource pressure?}
    T -->|Yes| U[QuotaExhaustion]
    A --> V{API latency?}
    V -->|Yes| W[WebhookLatency]
    A --> X{Secret resilience?}
    X -->|Yes| Y[SecretDeletion]
    A --> Z{Rollout handling?}
    Z -->|CrashLoopBackOff| AA[CrashLoopInject]
    Z -->|ImagePullBackOff| AB[ImageCorrupt]
    A --> AC{Scale enforcement?}
    AC -->|Yes| AD[DeploymentScaleZero]
    A --> AE{Eviction blocking?}
    AE -->|Yes| AF[PDBBlock]
    A --> AG{Leader election?}
    AG -->|Yes| AH[LeaderElectionDisrupt]
    A --> AI{Resource recreation?}
    AI -->|Yes| AJ[ResourceDeletion]

Coverage by Component

Active Components (RHOAI 3.x / ODH)

Component CRDMut Client CfgDrift Finalizer LblStomp NsDel NetPart OwnerRef PodKill Quota RBAC WebhookD WebhookL SecDel ScaleZ LeaderE Crash ImgCorr ResDel PDB Total
dashboard - - - - - - - - - - - - - 7
data-science-pipelines - - - - - - - - - - - - - - - 5
feast - - - - - - - - - - - - - - - - 4
kserve - - - - - - - - - - 10
llamastack - - - - - - - - - - - - - - - - 4
model-registry - - - - - - - - - - - - - - 6
odh-model-controller - - - 17
opendatahub-operator - - - - - - - - - - - - - - - 5
ray - - - - - - - - - - - - - - - - 4
training-operator - - - - - - - - - - - - - - - - 4
trustyai - - - - - - - - - - - - - - - - - 3
workbenches - - - - - - - - - - - - - - - - 4

External Dependencies

Component CRDMut Client CfgDrift Finalizer LblStomp NsDel NetPart OwnerRef PodKill Quota RBAC WebhookD WebhookL SecDel ScaleZ LeaderE Crash ImgCorr ResDel PDB Total
cert-manager - - - - - - - - - - 10
knative-serving - - - - - - - - - 11
service-mesh - - - - - 15

Removed/Replaced (RHOAI 3.x)

Experiments still available for ODH or RHOAI 2.x testing.

Component CRDMut Client CfgDrift Finalizer LblStomp NsDel NetPart OwnerRef PodKill Quota RBAC WebhookD WebhookL SecDel ScaleZ LeaderE Crash ImgCorr ResDel PDB Total Status
codeflare - - - - - - - - - - - - - - - - 4 Removed in 3.0
kueue - - - - - - - - - - - - - - - 5 Replaced by RH Kueue
modelmesh - - - - - - - - - - - - - - - 5 Removed in 3.0