Skip to content

data-science-pipelines Failure Modes

Coverage

Injection Type Danger Experiment Description
ConfigDrift high config-drift.yaml When the dspo-config ConfigMap is corrupted with invalid configuration, the data...
FinalizerBlock low finalizer-block.yaml When a stuck finalizer prevents a DataSciencePipelinesApplication from being del...
NetworkPartition medium network-partition.yaml When the DSPO pod is network-partitioned from the API server, it should lose its...
PodKill low pod-kill.yaml When the data-science-pipelines-operator pod is killed, Kubernetes should recrea...
RBACRevoke high rbac-revoke.yaml When the DSPO ClusterRoleBinding subjects are revoked, the operator should lose ...
WebhookDisrupt high webhook-disrupt.yaml When the pipeline version validating webhook failurePolicy is weakened from Fail...

Experiment Details

data-science-pipelines-config-drift

  • Type: ConfigDrift
  • Danger Level: high
  • Component: data-science-pipelines-operator

When the dspo-config ConfigMap is corrupted with invalid configuration, the data-science-pipelines operator should detect the misconfiguration and either fall back to defaults or surface clear error conditions rather than silently failing.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: data-science-pipelines-config-drift
spec:
  tier: 2
  target:
    operator: data-science-pipelines
    component: data-science-pipelines-operator
    resource: ConfigMap/dspo-config
  steadyState:
    checks:
      - type: resourceExists
        apiVersion: v1
        kind: ConfigMap
        name: dspo-config
        namespace: opendatahub
    timeout: "30s"
  injection:
    type: ConfigDrift
    dangerLevel: high
    parameters:
      name: dspo-config
      key: config
      value: '{"INVALID_KEY":"INVALID_VALUE"}'
      resourceType: ConfigMap
    ttl: "300s"
  hypothesis:
    description: >-
      When the dspo-config ConfigMap is corrupted with invalid configuration,
      the data-science-pipelines operator should detect the misconfiguration
      and either fall back to defaults or surface clear error conditions
      rather than silently failing.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub
    allowDangerous: true

data-science-pipelines-finalizer-block

  • Type: FinalizerBlock
  • Danger Level: low
  • Component: data-science-pipelines-operator

When a stuck finalizer prevents a DataSciencePipelinesApplication from being deleted, the DSPO should handle the Terminating state gracefully, report the blocked deletion in its status, and not leak associated pipeline resources. The chaos framework removes the finalizer via TTL-based cleanup after 300s.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: data-science-pipelines-finalizer-block
spec:
  tier: 3
  target:
    operator: data-science-pipelines
    component: data-science-pipelines-operator
    resource: DataSciencePipelinesApplication
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: data-science-pipelines-operator-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: FinalizerBlock
    # IMPORTANT: "test-dspa" is a placeholder. Replace it with the name of an
    # actual DataSciencePipelinesApplication resource deployed in the target
    # namespace before running this experiment.
    parameters:
      apiVersion: datasciencepipelinesapplications.opendatahub.io/v1alpha1
      kind: DataSciencePipelinesApplication
      name: test-dspa
      finalizer: datasciencepipelinesapplications.opendatahub.io/finalizer
    ttl: "300s"
  hypothesis:
    description: >-
      When a stuck finalizer prevents a DataSciencePipelinesApplication
      from being deleted, the DSPO should handle the Terminating state
      gracefully, report the blocked deletion in its status, and not leak
      associated pipeline resources. The chaos framework removes the
      finalizer via TTL-based cleanup after 300s.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

data-science-pipelines-network-partition

  • Type: NetworkPartition
  • Danger Level: medium
  • Component: data-science-pipelines-operator

When the DSPO pod is network-partitioned from the API server, it should lose its leader lease and stop reconciling. Once the partition is removed, the operator should re-acquire the lease and resume normal operation without duplicate pipeline runs.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: data-science-pipelines-network-partition
spec:
  tier: 2
  target:
    operator: data-science-pipelines
    component: data-science-pipelines-operator
    resource: Deployment/data-science-pipelines-operator-controller-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: data-science-pipelines-operator-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      labelSelector: app.kubernetes.io/name=data-science-pipelines-operator
    ttl: "300s"
  hypothesis:
    description: >-
      When the DSPO pod is network-partitioned from the API server, it
      should lose its leader lease and stop reconciling. Once the partition
      is removed, the operator should re-acquire the lease and resume
      normal operation without duplicate pipeline runs.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

data-science-pipelines-pod-kill

  • Type: PodKill
  • Danger Level: low
  • Component: data-science-pipelines-operator

When the data-science-pipelines-operator pod is killed, Kubernetes should recreate it within the recovery timeout. The operator should resume reconciling DataSciencePipelinesApplication resources without data loss or pipeline run interruption.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: data-science-pipelines-pod-kill
spec:
  tier: 1
  target:
    operator: data-science-pipelines
    component: data-science-pipelines-operator
    resource: Deployment/data-science-pipelines-operator-controller-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: data-science-pipelines-operator-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app.kubernetes.io/name=data-science-pipelines-operator
    count: 1
    ttl: "300s"
  hypothesis:
    description: >-
      When the data-science-pipelines-operator pod is killed, Kubernetes
      should recreate it within the recovery timeout. The operator should
      resume reconciling DataSciencePipelinesApplication resources without
      data loss or pipeline run interruption.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

data-science-pipelines-rbac-revoke

  • Type: RBACRevoke
  • Danger Level: high
  • Component: data-science-pipelines-operator

When the DSPO ClusterRoleBinding subjects are revoked, the operator should lose its ability to reconcile DataSciencePipelinesApplication resources across namespaces and surface permission-denied errors. Once permissions are restored, reconciliation should resume.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: data-science-pipelines-rbac-revoke
spec:
  tier: 4
  target:
    operator: data-science-pipelines
    component: data-science-pipelines-operator
    resource: ClusterRoleBinding/data-science-pipelines-operator-manager-rolebinding
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: data-science-pipelines-operator-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: RBACRevoke
    dangerLevel: high
    parameters:
      bindingName: data-science-pipelines-operator-manager-rolebinding
      bindingType: ClusterRoleBinding
    ttl: "60s"
  hypothesis:
    description: >-
      When the DSPO ClusterRoleBinding subjects are revoked, the operator
      should lose its ability to reconcile DataSciencePipelinesApplication
      resources across namespaces and surface permission-denied errors.
      Once permissions are restored, reconciliation should resume.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true

data-science-pipelines-webhook-disrupt

  • Type: WebhookDisrupt
  • Danger Level: high
  • Component: ds-pipelines-webhook

When the pipeline version validating webhook failurePolicy is weakened from Fail to Ignore, invalid pipeline versions can bypass admission validation. The chaos framework restores the original failurePolicy via TTL-based cleanup after 60s.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: data-science-pipelines-webhook-disrupt
spec:
  tier: 4
  target:
    operator: data-science-pipelines
    component: ds-pipelines-webhook
    resource: ValidatingWebhookConfiguration/validating.pipelineversions.pipelines.kubeflow.org
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: ds-pipelines-webhook
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: WebhookDisrupt
    dangerLevel: high
    parameters:
      webhookName: validating.pipelineversions.pipelines.kubeflow.org
      action: setFailurePolicy
      value: Ignore
    ttl: "60s"
  hypothesis:
    description: >-
      When the pipeline version validating webhook failurePolicy is
      weakened from Fail to Ignore, invalid pipeline versions can bypass
      admission validation. The chaos framework restores the original
      failurePolicy via TTL-based cleanup after 60s.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true