training-operator Failure Modes¶
Coverage¶
| Injection Type | Danger | Experiment | Description |
|---|---|---|---|
| FinalizerBlock | low | finalizer-block.yaml | When a stuck finalizer prevents a PyTorchJob from being deleted, the controller ... |
| NetworkPartition | medium | network-partition.yaml | When the training-operator is network-partitioned from the API server, job statu... |
| PodKill | low | pod-kill.yaml | When the training-operator pod is killed, running training jobs continue via wor... |
| RBACRevoke | high | rbac-revoke.yaml | When the training-operator ClusterRoleBinding subjects are revoked, the controll... |
| WebhookDisrupt | high | webhook-disrupt.yaml | When the PyTorchJob validating webhook failurePolicy is weakened from Fail to Ig... |
Experiment Details¶
training-operator-finalizer-block¶
- Type: FinalizerBlock
- Danger Level: low
- Component: training-operator-controller-manager
When a stuck finalizer prevents a PyTorchJob from being deleted, the controller should handle the Terminating state gracefully and not leak associated worker pods or services. The chaos framework removes the finalizer via TTL-based cleanup after 300s.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: training-operator-finalizer-block
spec:
tier: 3
target:
operator: training-operator
component: training-operator-controller-manager
resource: PyTorchJob
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: training-operator-controller-manager
namespace: opendatahub
conditionType: Available
timeout: "30s"
injection:
type: FinalizerBlock
parameters:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
name: test-pytorchjob
finalizer: training-operator.kubeflow.org/finalizer
ttl: "300s"
hypothesis:
description: >-
When a stuck finalizer prevents a PyTorchJob from being deleted, the
controller should handle the Terminating state gracefully and not leak
associated worker pods or services. The chaos framework removes the
finalizer via TTL-based cleanup after 300s.
recoveryTimeout: 180s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- opendatahub
training-operator-network-partition¶
- Type: NetworkPartition
- Danger Level: medium
- Component: training-operator-controller-manager
When the training-operator is network-partitioned from the API server, job status stops updating but running worker pods continue training. Once the partition is removed, reconciliation resumes without manual intervention.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: training-operator-network-partition
spec:
tier: 2
target:
operator: training-operator
component: training-operator-controller-manager
resource: Deployment/training-operator-controller-manager
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: training-operator-controller-manager
namespace: opendatahub
conditionType: Available
timeout: "30s"
injection:
type: NetworkPartition
parameters:
labelSelector: control-plane=controller-manager,app.kubernetes.io/name=training-operator
ttl: "300s"
hypothesis:
description: >-
When the training-operator is network-partitioned from the API server,
job status stops updating but running worker pods continue training.
Once the partition is removed, reconciliation resumes without manual
intervention.
recoveryTimeout: 180s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- opendatahub
training-operator-pod-kill¶
- Type: PodKill
- Danger Level: low
- Component: training-operator-controller-manager
When the training-operator pod is killed, running training jobs continue via worker pods. New job submissions queue until the controller recovers within the recovery timeout.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: training-operator-pod-kill
spec:
tier: 1
target:
operator: training-operator
component: training-operator-controller-manager
resource: Deployment/training-operator-controller-manager
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: training-operator-controller-manager
namespace: opendatahub
conditionType: Available
timeout: "30s"
injection:
type: PodKill
parameters:
labelSelector: control-plane=controller-manager,app.kubernetes.io/name=training-operator
count: 1
ttl: "300s"
hypothesis:
description: >-
When the training-operator pod is killed, running training jobs
continue via worker pods. New job submissions queue until the
controller recovers within the recovery timeout.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- opendatahub
training-operator-rbac-revoke¶
- Type: RBACRevoke
- Danger Level: high
- Component: training-operator-controller-manager
When the training-operator ClusterRoleBinding subjects are revoked, the controller can no longer manage PyTorchJob resources. API calls return 403 errors. Once permissions are restored, normal operation resumes without restart.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: training-operator-rbac-revoke
spec:
tier: 4
target:
operator: training-operator
component: training-operator-controller-manager
resource: ClusterRoleBinding/training-operator-manager-rolebinding
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: training-operator-controller-manager
namespace: opendatahub
conditionType: Available
timeout: "30s"
injection:
type: RBACRevoke
dangerLevel: high
parameters:
bindingName: training-operator-manager-rolebinding
bindingType: ClusterRoleBinding
ttl: "60s"
hypothesis:
description: >-
When the training-operator ClusterRoleBinding subjects are revoked,
the controller can no longer manage PyTorchJob resources. API calls
return 403 errors. Once permissions are restored, normal operation
resumes without restart.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowDangerous: true
training-operator-webhook-disrupt¶
- Type: WebhookDisrupt
- Danger Level: high
- Component: training-operator-controller-manager
When the PyTorchJob validating webhook failurePolicy is weakened from Fail to Ignore, invalid PyTorchJob resources may be admitted to the cluster. The chaos framework restores the original failurePolicy via TTL-based cleanup after 60s.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: training-operator-webhook-disrupt
spec:
tier: 4
target:
operator: training-operator
component: training-operator-controller-manager
resource: ValidatingWebhookConfiguration/vpytorchjob.kubeflow.org
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: training-operator-controller-manager
namespace: opendatahub
conditionType: Available
timeout: "30s"
injection:
type: WebhookDisrupt
dangerLevel: high
parameters:
webhookName: vpytorchjob.kubeflow.org
webhookType: validating
action: setFailurePolicy
value: Ignore
ttl: "60s"
hypothesis:
description: >-
When the PyTorchJob validating webhook failurePolicy is weakened from
Fail to Ignore, invalid PyTorchJob resources may be admitted to the
cluster. The chaos framework restores the original failurePolicy via
TTL-based cleanup after 60s.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowDangerous: true