Skip to content

LeaderElectionDisrupt

Danger Level: High

Deletes a Lease object to force leader re-election, then cleans up the backup on revert.

Spec Fields

Field Type Required Default Description
name string Yes - Name of the Lease to delete
ttl duration No 300s Auto-cleanup duration

How It Works

LeaderElectionDisrupt gets the target Lease, serializes it to JSON, stores the backup in a ConfigMap (named chaos-backup-lease-<name>), and deletes the Lease. This forces all candidates in the leader election to compete for a new Lease, triggering re-election.

API calls: 1. Get the target Lease 2. Create (or Update) a backup ConfigMap with the serialized Lease JSON 3. Delete the Lease to trigger re-election 4. On revert: check if the Lease was recreated by the controller. If not, restore from backup. Then Delete the backup ConfigMap.

Cleanup: The revert intentionally does NOT restore a stale Lease if the controller has already re-elected. It only restores the Lease from backup if no new Lease exists, preventing conflicts with a legitimately re-elected leader. The backup ConfigMap is always cleaned up.

Crash safety: The backup ConfigMap persists after a crash. Use operator-chaos clean to find orphaned backups by the managed-by label. In most cases, the controller will have already re-created the Lease on its own.

Disruption Rubric

Expected behavior on a healthy operator: The controller detects the missing Lease within one lease renewal interval (typically 2-10 seconds). All candidates attempt to acquire a new Lease. One wins the election and resumes reconciliation. The brief gap in reconciliation should not cause data loss or corruption.

Contract violation indicators: - No new Lease is created within 60 seconds (indicates broken leader election configuration) - Multiple controllers believe they are leader simultaneously (split-brain, indicates missing fencing) - Reconciliation does not resume after re-election (indicates the new leader failed to initialize) - Controller enters CrashLoopBackOff after losing the Lease (indicates fatal error handling on Lease loss)

Collateral damage risks: - Minimal for the Lease deletion itself. The disruption window is typically under 10 seconds. - If the operator has in-flight writes when the Lease disappears, those writes may be duplicated by the new leader (at-least-once semantics) - On single-replica Deployments, the same pod re-acquires the Lease, so the disruption is just the re-election delay

Recovery expectations: - Recovery time: 2-15 seconds for most controllers using client-go leader election with default settings (LeaseDuration=15s, RenewDeadline=10s, RetryPeriod=2s) - Reconcile cycles: 1 (new leader starts reconciliation from current state) - What "recovered" means: a new Lease exists with a valid holder, and reconciliation is actively processing

Cross-Component Results

Component Experiment Danger Description
odh-model-controller odh-model-controller-leader-election-disrupt high Deleting the leader Lease forces re-election. The controller re-acquires the Lease and resumes reconciliation within seconds.
kserve kserve-leader-election-disrupt high Deleting the kserve-controller-manager Lease forces re-election. The controller re-acquires the Lease and resumes reconciliation within seconds.
cert-manager cert-manager-leader-election-disrupt high Deleting the cert-manager Lease forces re-election. The controller re-acquires the Lease and resumes reconciliation within seconds.
knative-serving knative-serving-controller-leader-election-disrupt high Deleting the knative-serving controller Lease forces re-election. The controller re-acquires the Lease and resumes reconciliation within seconds.