kueue Validation Results¶

Test Platform¶

Platform: ROSA HyperShift 4.20.11
Kueue Version: Red Hat Build of Kueue Operator (OLM-managed)
Test Date: 2026-05-06

Results¶

kueue-operator (Red Hat Build of Kueue)¶

All 11 kueue-operator experiments validated on 2026-05-06. All experiments passed with perfect resilience, demonstrating robust operator-layer recovery.

Experiment	Injection	Verdict	Recovery Time	Reconcile Cycles
crd-mutation	CRDMutation	Resilient	1.3s	1
label-stomping	LabelStomping	Resilient	1.3s	1
leader-lease-corrupt	CRDMutation	Resilient	0.9s	1
network-partition	NetworkPartition	Resilient	1.3s	1
olm-csv-owned-crd-corrupt	CRDMutation	Resilient	1.3s	1
olm-subscription-approval-flip	CRDMutation	Resilient	1.2s	1
olm-subscription-channel-corrupt	CRDMutation	Resilient	1.2s	1
ownerref-orphan	OwnerRefOrphan	Resilient	1.3s	1
pod-kill	PodKill	Resilient	1.8s	1
quota-exhaustion	QuotaExhaustion	Resilient	1.3s	1
rbac-revoke	RBACRevoke	Resilient	1.4s	1

kueue-operand¶

Not yet tested on this cluster. The kueue-operand experiments test workload admission resources (ClusterQueues, LocalQueues, ResourceFlavors, Workloads) and require a full Kueue deployment with test workloads.

Legacy DSC-managed Kueue¶

Not yet tested on this cluster. The legacy experiments target ODH/RHOAI 2.x deployments where Kueue was managed as a DSC component in the opendatahub namespace.

Key Findings¶

Perfect Operator Resilience¶

The Red Hat Build of Kueue Operator demonstrated flawless resilience across all 11 operator-layer experiments. Recovery times were consistently under 2 seconds, with most experiments recovering in 1.3s or less. Every experiment completed in a single reconcile cycle, indicating efficient detection and recovery logic.

Sub-2s Recovery Window¶

The entire operator layer recovers from faults in under 2 seconds:

Fastest: leader-lease-corrupt (0.9s)
Slowest: pod-kill (1.8s)
Median: 1.3s

This tight recovery window is critical for production environments where workload admission delays directly impact user experience.

OLM Integration Resilience¶

The operator handled OLM-specific faults gracefully:

CSV corruption: Operator deployment remained available despite invalid CSV metadata
Subscription corruption: Channel and approval changes did not affect the running operator
CRD mutation: API server confusion from CRD name corruption was transient

These results validate that the OLM-managed deployment model is production-ready and resilient to operator lifecycle faults.

Deployment-Scale Recovery¶

The operator deployment runs with 2 replicas and uses leader election. Pod kill and network partition experiments validated that:

Leader election completes within 1.8s after pod failure
Network isolation does not cause split-brain scenarios
Quota exhaustion is detected and recovered from gracefully

Outstanding Validation Work¶

The kueue-operand and legacy experiments remain untested. The operand experiments are particularly important as they validate workload admission resilience (the core Kueue functionality), not just operator-layer recovery.