Skip to content

kueue Validation Results

Test Platform

  • Platform: ROSA HyperShift 4.20.11
  • Kueue Version: Red Hat Build of Kueue Operator (OLM-managed)
  • Test Date: 2026-05-06

Results

kueue-operator (Red Hat Build of Kueue)

All 11 kueue-operator experiments validated on 2026-05-06. All experiments passed with perfect resilience, demonstrating robust operator-layer recovery.

Experiment Injection Verdict Recovery Time Reconcile Cycles
crd-mutation CRDMutation Resilient 1.3s 1
label-stomping LabelStomping Resilient 1.3s 1
leader-lease-corrupt CRDMutation Resilient 0.9s 1
network-partition NetworkPartition Resilient 1.3s 1
olm-csv-owned-crd-corrupt CRDMutation Resilient 1.3s 1
olm-subscription-approval-flip CRDMutation Resilient 1.2s 1
olm-subscription-channel-corrupt CRDMutation Resilient 1.2s 1
ownerref-orphan OwnerRefOrphan Resilient 1.3s 1
pod-kill PodKill Resilient 1.8s 1
quota-exhaustion QuotaExhaustion Resilient 1.3s 1
rbac-revoke RBACRevoke Resilient 1.4s 1

kueue-operand

Not yet tested on this cluster. The kueue-operand experiments test workload admission resources (ClusterQueues, LocalQueues, ResourceFlavors, Workloads) and require a full Kueue deployment with test workloads.

Legacy DSC-managed Kueue

Not yet tested on this cluster. The legacy experiments target ODH/RHOAI 2.x deployments where Kueue was managed as a DSC component in the opendatahub namespace.

Key Findings

Perfect Operator Resilience

The Red Hat Build of Kueue Operator demonstrated flawless resilience across all 11 operator-layer experiments. Recovery times were consistently under 2 seconds, with most experiments recovering in 1.3s or less. Every experiment completed in a single reconcile cycle, indicating efficient detection and recovery logic.

Sub-2s Recovery Window

The entire operator layer recovers from faults in under 2 seconds:

  • Fastest: leader-lease-corrupt (0.9s)
  • Slowest: pod-kill (1.8s)
  • Median: 1.3s

This tight recovery window is critical for production environments where workload admission delays directly impact user experience.

OLM Integration Resilience

The operator handled OLM-specific faults gracefully:

  • CSV corruption: Operator deployment remained available despite invalid CSV metadata
  • Subscription corruption: Channel and approval changes did not affect the running operator
  • CRD mutation: API server confusion from CRD name corruption was transient

These results validate that the OLM-managed deployment model is production-ready and resilient to operator lifecycle faults.

Deployment-Scale Recovery

The operator deployment runs with 2 replicas and uses leader election. Pod kill and network partition experiments validated that:

  • Leader election completes within 1.8s after pod failure
  • Network isolation does not cause split-brain scenarios
  • Quota exhaustion is detected and recovered from gracefully

Outstanding Validation Work

The kueue-operand and legacy experiments remain untested. The operand experiments are particularly important as they validate workload admission resilience (the core Kueue functionality), not just operator-layer recovery.