LeaderElectionDisrupt¶

Danger Level: High

Deletes a Lease object to force leader re-election, then cleans up the backup on revert.

Spec Fields¶

Field	Type	Required	Default	Description
`name`	`string`	Yes	-	Name of the Lease to delete
`ttl`	`duration`	No	`300s`	Auto-cleanup duration

How It Works¶

LeaderElectionDisrupt gets the target Lease, serializes it to JSON, stores the backup in a ConfigMap (named chaos-backup-lease-<name>), and deletes the Lease. This forces all candidates in the leader election to compete for a new Lease, triggering re-election.

API calls: 1. Get the target Lease 2. Create (or Update) a backup ConfigMap with the serialized Lease JSON 3. Delete the Lease to trigger re-election 4. On revert: check if the Lease was recreated by the controller. If not, restore from backup. Then Delete the backup ConfigMap.

Cleanup: The revert intentionally does NOT restore a stale Lease if the controller has already re-elected. It only restores the Lease from backup if no new Lease exists, preventing conflicts with a legitimately re-elected leader. The backup ConfigMap is always cleaned up.

Crash safety: The backup ConfigMap persists after a crash. Use operator-chaos clean to find orphaned backups by the managed-by label. In most cases, the controller will have already re-created the Lease on its own.

Disruption Rubric¶

Expected behavior on a healthy operator: The controller detects the missing Lease within one lease renewal interval (typically 2-10 seconds). All candidates attempt to acquire a new Lease. One wins the election and resumes reconciliation. The brief gap in reconciliation should not cause data loss or corruption.

Contract violation indicators: - No new Lease is created within 60 seconds (indicates broken leader election configuration) - Multiple controllers believe they are leader simultaneously (split-brain, indicates missing fencing) - Reconciliation does not resume after re-election (indicates the new leader failed to initialize) - Controller enters CrashLoopBackOff after losing the Lease (indicates fatal error handling on Lease loss)

Collateral damage risks: - Minimal for the Lease deletion itself. The disruption window is typically under 10 seconds. - If the operator has in-flight writes when the Lease disappears, those writes may be duplicated by the new leader (at-least-once semantics) - On single-replica Deployments, the same pod re-acquires the Lease, so the disruption is just the re-election delay

Recovery expectations: - Recovery time: 2-15 seconds for most controllers using client-go leader election with default settings (LeaseDuration=15s, RenewDeadline=10s, RetryPeriod=2s) - Reconcile cycles: 1 (new leader starts reconciliation from current state) - What "recovered" means: a new Lease exists with a valid holder, and reconciliation is actively processing

Cross-Component Results¶

Component	Experiment	Danger	Description
odh-model-controller	odh-model-controller-leader-election-disrupt	high	Deleting the leader Lease forces re-election. The controller re-acquires the Lease and resumes reconciliation within seconds.
kserve	kserve-leader-election-disrupt	high	Deleting the kserve-controller-manager Lease forces re-election. The controller re-acquires the Lease and resumes reconciliation within seconds.
cert-manager	cert-manager-leader-election-disrupt	high	Deleting the cert-manager Lease forces re-election. The controller re-acquires the Lease and resumes reconciliation within seconds.
knative-serving	knative-serving-controller-leader-election-disrupt	high	Deleting the knative-serving controller Lease forces re-election. The controller re-acquires the Lease and resumes reconciliation within seconds.