Skip to content

codeflare Failure Modes

Coverage

Injection Type Danger Experiment Description
ConfigDrift high config-drift.yaml When the codeflare operator configuration is corrupted, new cluster configuratio...
FinalizerBlock low finalizer-block.yaml When a stuck finalizer prevents the codeflare-operator Deployment from being del...
NetworkPartition medium network-partition.yaml When the codeflare-operator is network-partitioned from the API server, AppWrapp...
PodKill low pod-kill.yaml When the codeflare-operator pod is killed, existing Ray clusters remain unaffect...
RBACRevoke high rbac-revoke.yaml When the codeflare-operator ClusterRoleBinding subjects are revoked, the operato...
WebhookDisrupt high webhook-disrupt.yaml When the AppWrapper validating webhook failurePolicy is weakened from Fail to Ig...

Experiment Details

codeflare-config-drift

  • Type: ConfigDrift
  • Danger Level: high
  • Component: codeflare-operator-manager

When the codeflare operator configuration is corrupted, new cluster configurations receive wrong parameters. Existing Ray clusters remain unaffected. The operator should detect the drift and reconcile the correct configuration.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: codeflare-config-drift
spec:
  tier: 2
  target:
    operator: codeflare
    component: codeflare-operator-manager
    resource: ConfigMap/codeflare-operator-config
  steadyState:
    checks:
      - type: resourceExists
        apiVersion: v1
        kind: ConfigMap
        name: codeflare-operator-config
        namespace: opendatahub
    timeout: "30s"
  injection:
    type: ConfigDrift
    dangerLevel: high
    parameters:
      name: codeflare-operator-config
      key: config.yaml
      value: '{"ray":{"defaultClusterSize":"-1","workerImage":"invalid:broken"}}'
      resourceType: ConfigMap
    ttl: "300s"
  hypothesis:
    description: >-
      When the codeflare operator configuration is corrupted, new cluster
      configurations receive wrong parameters. Existing Ray clusters remain
      unaffected. The operator should detect the drift and reconcile the
      correct configuration.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub
    allowDangerous: true

codeflare-finalizer-block

  • Type: FinalizerBlock
  • Danger Level: low
  • Component: codeflare-operator-manager

When a stuck finalizer prevents the codeflare-operator Deployment from being deleted, the operator lifecycle should handle the Terminating state gracefully. The chaos framework removes the finalizer via TTL-based cleanup after 300s.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: codeflare-finalizer-block
spec:
  tier: 3
  target:
    operator: codeflare
    component: codeflare-operator-manager
    resource: Deployment/codeflare-operator-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: codeflare-operator-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: FinalizerBlock
    parameters:
      apiVersion: apps/v1
      kind: Deployment
      name: codeflare-operator-manager
      finalizer: chaos.operatorchaos.io/block-test
    ttl: "300s"
  hypothesis:
    description: >-
      When a stuck finalizer prevents the codeflare-operator Deployment
      from being deleted, the operator lifecycle should handle the Terminating
      state gracefully. The chaos framework removes the finalizer via
      TTL-based cleanup after 300s.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

codeflare-network-partition

  • Type: NetworkPartition
  • Danger Level: medium
  • Component: codeflare-operator-manager

When the codeflare-operator is network-partitioned from the API server, AppWrapper reconciliation stops. Existing Ray clusters continue running. Once the partition is removed, reconciliation resumes without manual intervention.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: codeflare-network-partition
spec:
  tier: 2
  target:
    operator: codeflare
    component: codeflare-operator-manager
    resource: Deployment/codeflare-operator-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: codeflare-operator-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      labelSelector: control-plane=manager,app.kubernetes.io/name=codeflare-operator
    ttl: "300s"
  hypothesis:
    description: >-
      When the codeflare-operator is network-partitioned from the API
      server, AppWrapper reconciliation stops. Existing Ray clusters
      continue running. Once the partition is removed, reconciliation
      resumes without manual intervention.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

codeflare-pod-kill

  • Type: PodKill
  • Danger Level: low
  • Component: codeflare-operator-manager

When the codeflare-operator pod is killed, existing Ray clusters remain unaffected. New AppWrapper submissions queue until the operator recovers within the recovery timeout.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: codeflare-pod-kill
spec:
  tier: 1
  target:
    operator: codeflare
    component: codeflare-operator-manager
    resource: Deployment/codeflare-operator-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: codeflare-operator-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: control-plane=manager,app.kubernetes.io/name=codeflare-operator
    count: 1
    ttl: "300s"
  hypothesis:
    description: >-
      When the codeflare-operator pod is killed, existing Ray clusters
      remain unaffected. New AppWrapper submissions queue until the
      operator recovers within the recovery timeout.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

codeflare-rbac-revoke

  • Type: RBACRevoke
  • Danger Level: high
  • Component: codeflare-operator-manager

When the codeflare-operator ClusterRoleBinding subjects are revoked, the operator can no longer manage AppWrapper resources. API calls return 403 errors. Once permissions are restored, normal operation resumes without restart.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: codeflare-rbac-revoke
spec:
  tier: 4
  target:
    operator: codeflare
    component: codeflare-operator-manager
    resource: ClusterRoleBinding/codeflare-operator-manager-rolebinding
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: codeflare-operator-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: RBACRevoke
    dangerLevel: high
    parameters:
      bindingName: codeflare-operator-manager-rolebinding
      bindingType: ClusterRoleBinding
    ttl: "60s"
  hypothesis:
    description: >-
      When the codeflare-operator ClusterRoleBinding subjects are revoked,
      the operator can no longer manage AppWrapper resources. API calls
      return 403 errors. Once permissions are restored, normal operation
      resumes without restart.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true

codeflare-webhook-disrupt

  • Type: WebhookDisrupt
  • Danger Level: high
  • Component: codeflare-operator-manager

When the AppWrapper validating webhook failurePolicy is weakened from Fail to Ignore, invalid AppWrapper resources may be admitted to the cluster. The chaos framework restores the original failurePolicy via TTL-based cleanup after 60s.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: codeflare-webhook-disrupt
spec:
  tier: 4
  target:
    operator: codeflare
    component: codeflare-operator-manager
    resource: ValidatingWebhookConfiguration/vappwrapper.codeflare.dev
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: codeflare-operator-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: WebhookDisrupt
    dangerLevel: high
    parameters:
      webhookName: vappwrapper.codeflare.dev
      webhookType: validating
      action: setFailurePolicy
      value: Ignore
    ttl: "60s"
  hypothesis:
    description: >-
      When the AppWrapper validating webhook failurePolicy is weakened from
      Fail to Ignore, invalid AppWrapper resources may be admitted to the
      cluster. The chaos framework restores the original failurePolicy via
      TTL-based cleanup after 60s.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true