ray Failure Modes¶

Coverage¶

Injection Type	Danger	Experiment	Description
FinalizerBlock	low	finalizer-block.yaml	When a stuck finalizer prevents a RayCluster from being deleted, the controller ...
NetworkPartition	medium	network-partition.yaml	When the ray-operator is network-partitioned from the API server, cluster scalin...
PodKill	low	pod-kill.yaml	When the ray-operator pod is killed, existing RayClusters keep running and servi...
RBACRevoke	high	rbac-revoke.yaml	When the ray-operator ClusterRoleBinding subjects are revoked, the controller ca...
WebhookDisrupt	high	webhook-disrupt.yaml	When the RayCluster validating webhook failurePolicy is weakened from Fail to Ig...

Experiment Details¶

ray-finalizer-block¶

Type: FinalizerBlock
Danger Level: low
Component: ray-operator-controller-manager

When a stuck finalizer prevents a RayCluster from being deleted, the controller should handle the Terminating state gracefully and not leak associated head or worker pods. The chaos framework removes the finalizer via TTL-based cleanup after 300s.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: ray-finalizer-block
spec:
  tier: 3
  target:
    operator: ray
    component: ray-operator-controller-manager
    resource: RayCluster
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: ray-operator-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: FinalizerBlock
    parameters:
      apiVersion: ray.io/v1
      kind: RayCluster
      name: test-raycluster
      finalizer: ray.io/finalizer
    ttl: "300s"
  hypothesis:
    description: >-
      When a stuck finalizer prevents a RayCluster from being deleted, the
      controller should handle the Terminating state gracefully and not leak
      associated head or worker pods. The chaos framework removes the
      finalizer via TTL-based cleanup after 300s.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

ray-network-partition¶

Type: NetworkPartition
Danger Level: medium
Component: ray-operator-controller-manager

When the ray-operator is network-partitioned from the API server, cluster scaling and health monitoring stops. Existing RayClusters continue running workloads. Once the partition is removed, reconciliation resumes without manual intervention.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: ray-network-partition
spec:
  tier: 2
  target:
    operator: ray
    component: ray-operator-controller-manager
    resource: Deployment/ray-operator-controller-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: ray-operator-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      labelSelector: control-plane=controller-manager,app.kubernetes.io/name=kuberay-operator
    ttl: "300s"
  hypothesis:
    description: >-
      When the ray-operator is network-partitioned from the API server,
      cluster scaling and health monitoring stops. Existing RayClusters
      continue running workloads. Once the partition is removed,
      reconciliation resumes without manual intervention.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

ray-pod-kill¶

Type: PodKill
Danger Level: low
Component: ray-operator-controller-manager

When the ray-operator pod is killed, existing RayClusters keep running and serving workloads. New cluster requests queue until the controller recovers within the recovery timeout.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: ray-pod-kill
spec:
  tier: 1
  target:
    operator: ray
    component: ray-operator-controller-manager
    resource: Deployment/ray-operator-controller-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: ray-operator-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: control-plane=controller-manager,app.kubernetes.io/name=kuberay-operator
    count: 1
    ttl: "300s"
  hypothesis:
    description: >-
      When the ray-operator pod is killed, existing RayClusters keep
      running and serving workloads. New cluster requests queue until
      the controller recovers within the recovery timeout.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

ray-rbac-revoke¶

Type: RBACRevoke
Danger Level: high
Component: ray-operator-controller-manager

When the ray-operator ClusterRoleBinding subjects are revoked, the controller can no longer manage RayCluster resources. API calls return 403 errors. Once permissions are restored, normal operation resumes without restart.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: ray-rbac-revoke
spec:
  tier: 4
  target:
    operator: ray
    component: ray-operator-controller-manager
    resource: ClusterRoleBinding/ray-operator-manager-rolebinding
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: ray-operator-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: RBACRevoke
    dangerLevel: high
    parameters:
      bindingName: ray-operator-manager-rolebinding
      bindingType: ClusterRoleBinding
    ttl: "60s"
  hypothesis:
    description: >-
      When the ray-operator ClusterRoleBinding subjects are revoked, the
      controller can no longer manage RayCluster resources. API calls
      return 403 errors. Once permissions are restored, normal operation
      resumes without restart.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true

ray-webhook-disrupt¶

Type: WebhookDisrupt
Danger Level: high
Component: ray-operator-controller-manager

When the RayCluster validating webhook failurePolicy is weakened from Fail to Ignore, invalid RayCluster resources may be admitted to the cluster. The chaos framework restores the original failurePolicy via TTL-based cleanup after 60s.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: ray-webhook-disrupt
spec:
  tier: 4
  target:
    operator: ray
    component: ray-operator-controller-manager
    resource: ValidatingWebhookConfiguration/vraycluster.ray.io
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: ray-operator-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: WebhookDisrupt
    dangerLevel: high
    parameters:
      webhookName: vraycluster.ray.io
      webhookType: validating
      action: setFailurePolicy
      value: Ignore
    ttl: "60s"
  hypothesis:
    description: >-
      When the RayCluster validating webhook failurePolicy is weakened from
      Fail to Ignore, invalid RayCluster resources may be admitted to the
      cluster. The chaos framework restores the original failurePolicy via
      TTL-based cleanup after 60s.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true