Skip to content

ResourceDeletion

Danger Level: High

Deletes an arbitrary namespaced Kubernetes resource after backing it up, then restores it on revert if the operator has not recreated it.

Spec Fields

Field Type Required Default Description
apiVersion string Yes - API version of the resource (e.g., v1, apps/v1)
kind string Yes - Kind of the resource (e.g., Service, ConfigMap)
name string Yes - Name of the resource to delete
namespace string No experiment namespace Namespace of the target resource (defaults to the experiment's namespace)
ttl duration No 300s Auto-cleanup duration

How It Works

ResourceDeletion uses an unstructured client to get any namespaced Kubernetes resource by apiVersion/kind/name, serialize it to JSON, store the backup in a Secret (named chaos-backup-resource-<kind>-<name>), and delete the original. On revert, it checks whether the operator has already recreated the resource. If not, it restores from backup with server-managed fields cleared.

API calls: 1. Get the resource using unstructured client 2. Create (or Update) a backup Secret containing the serialized JSON 3. Delete the resource 4. On revert: check if the resource exists. If recreated by the operator, skip restore. If not, Create from backup (clearing UID, resourceVersion, managedFields, status, finalizers, generation). Then Delete the backup Secret.

Cleanup: If the operator recreated the resource, only the backup Secret is cleaned up. If the operator did not recreate it, the resource is restored from backup with all server-managed fields cleared to allow a clean creation.

Crash safety: The backup Secret persists after a crash. Use operator-chaos clean to find orphaned backups by the managed-by label and manually restore if needed.

Kind-specific safety checks:

  • Secret: deny-list blocks system-critical Secrets (pull-secret, SA tokens, system-critical prefixes)
  • Service: deny-list blocks critical Services (kubernetes, openshift-apiserver, dns-default, kube-dns)
  • ServiceAccount: deny-list blocks critical ServiceAccounts (default, deployer, builder, pipeline)
  • Cluster-scoped kinds: blocked entirely (Namespace, Node, ClusterRole, ClusterRoleBinding, CRD, PersistentVolume). ResourceDeletion only works with namespaced resources.
  • Protected namespaces (kube-system, openshift-*) are rejected

Disruption Rubric

Expected behavior on a healthy operator: The operator detects the missing resource (via watches or periodic reconciliation) and recreates it from its desired state. For Services, the operator should recreate the Service with the same selectors and ports. The Deployment should remain Available throughout if the deleted resource is not critical to pod health.

Contract violation indicators: - Operator does not recreate the deleted resource within recoveryTimeout (indicates the resource is not managed by reconciliation) - Operator recreates the resource but with incorrect spec (indicates drift between desired state and actual recreation logic) - Operator enters CrashLoopBackOff when the resource disappears (indicates fatal dependency with no error handling) - Dependent resources (e.g., Endpoints for a Service) are not recreated

Collateral damage risks: - Varies by kind. Service deletion breaks network connectivity to the operator but pods remain running. ConfigMap deletion may cause pod restarts if mounted as a volume. Secret deletion may break TLS. - For Service deletion specifically: existing TCP connections may persist (via conntrack), but new connections fail until the Service is recreated - On clusters with NetworkPolicies referencing the Service, deletion may break network rules

Recovery expectations: - Recovery time: 10-60 seconds for operator-reconciled resources. Services are typically recreated within one reconciliation cycle. - Reconcile cycles: 1-2 (detection, recreation) - What "recovered" means: resource exists with correct spec, and all dependent resources (Endpoints, etc.) are functional

Cross-Component Results

Component Experiment Danger Description
odh-model-controller odh-model-controller-resource-deletion-service high Deleting the metrics Service tests whether the operator recreates it. The Deployment remains Available since the Service is not critical to pod health.
kserve kserve-resource-deletion-service high Deleting the kserve-controller-manager metrics Service tests whether the operator recreates it. The Deployment remains Available.
knative-serving knative-serving-controller-resource-deletion-service high Deleting the knative-serving controller Service tests whether the operator recreates it. The Deployment remains Available.
cert-manager cert-manager-resource-deletion-service high Deleting the cert-manager Service tests whether the operator recreates it. The Deployment remains Available.
service-mesh istiod-resource-deletion-service high Deleting the istiod Service tests whether the operator recreates it. The Deployment remains Available.