Deploying Red Hat AI Inference Server: Distributed Inference with llm-d (EA2)

Product: Red Hat AI Inference Server (RHAIIS) Version: 3.4 EA2 Platforms: Azure Kubernetes Service (AKS), CoreWeave Kubernetes Service (CKS)

Executive Summary

This guide provides step-by-step instructions for deploying Distributed Inference with llm-d for Red Hat AI Inference Server using the RHAII Helm chart (rhai-on-xks-chart). The Helm chart deploys the RHAI operator and a cloud-specific manager, which together automatically provision all required infrastructure including cert-manager, Istio, and LeaderWorkerSet.

Key capabilities:

Single-command installation using Helm from OCI registry
Automatic infrastructure provisioning via the cloud manager
Intelligent request routing using the Endpoint Picker Processor (EPP)
Disaggregated serving with prefill-decode separation
Cache-aware routing for prefix KV cache optimization
Mutual TLS (mTLS) for secure pod-to-pod communication
Gateway API integration for standard Kubernetes ingress

1. Prerequisites

1.1 Kubernetes Cluster Requirements

Requirement	Specification
Kubernetes version	1.28 or later
Supported platforms	AKS, CKS (CoreWeave)
GPU nodes	NVIDIA A10, A100, or H100 (for GPU workloads)
NVIDIA device plugin	Installed and configured

1.2 Client Tools

Tool	Minimum Version	Purpose
`kubectl`	1.28+	Kubernetes CLI
`helm`	3.17+	Helm package manager

1.3 Registry Authentication

RHAIIS images are hosted on registry.redhat.io and quay.io/rhoai. Both registries require authentication.

Procedure:

Create a pull secret file with credentials for all required registries:

# Red Hat registry (for vLLM and dependency operator images)
podman login registry.redhat.io --authfile ~/pull-secret.json

# quay.io (for RHOAI operator and KServe component images)
podman login quay.io --authfile ~/pull-secret.json

Verify the pull secret covers all required registries:

cat ~/pull-secret.json | jq -r '.auths | keys[]'
# Should include: quay.io, registry.redhat.io

Log in to the Helm OCI registry (required to pull the chart):

helm registry login quay.io

Note: Registry Service Accounts (from https://access.redhat.com/terms-based-registry/) do not expire and are recommended for production deployments.

1.4 GPU Node Pool Configuration

For GPU-accelerated inference, ensure your cluster has GPU nodes with the NVIDIA device plugin installed.

Azure Kubernetes Service (AKS):

For AKS cluster provisioning with GPU nodes, see the AKS Infrastructure Guide.

CoreWeave Kubernetes Service (CKS):

CoreWeave clusters include the NVIDIA device plugin by default. Select the appropriate GPU type when provisioning your cluster.

Verification:

kubectl get nodes -l nvidia.com/gpu.present=true
kubectl describe nodes | grep -A5 "nvidia.com/gpu"

2. Architecture Overview

The RHAII Helm chart deploys the RHAI operator and a cloud-specific manager. The cloud manager automatically provisions infrastructure dependencies.

2.1 Deployed Components

Component	Namespace	Description
RHAI Operator	`redhat-ods-operator`	Manages KServe controller and inference components
Cloud Manager	`rhai-cloudmanager-system`	Provisions infrastructure dependencies
KServe LLMISvc Controller	`redhat-ods-applications`	Manages LLMInferenceService lifecycle
cert-manager Operator	`cert-manager-operator`	cert-manager operator
cert-manager	`cert-manager`	TLS certificate management
Istio (Sail Operator)	`istio-system`	Gateway API implementation and mTLS
LWS Operator	`openshift-lws-operator`	Multi-node inference support

2.2 Component Interaction

                  +----------------------+
  Client -------- |   Inference Gateway  |
                  |   (Istio / Envoy)    |
                  +----------+-----------+
                             |
                  +----------v-----------+
                  |   EPP Scheduler      |
                  |   (picks optimal     |
                  |    replica)           |
                  +----------+-----------+
                             |
                  +----------v-----------+
                  |   vLLM Pod (GPU)     |
                  |   (serves model)     |
                  +----------------------+

2.3 Bootstrap Sequence

The cloud manager orchestrates the following bootstrap sequence automatically:

Component	Action
Cloud Manager	Starts provisioning dependencies
RHAI Operator	Waits for webhook certificate
cert-manager	Operator and controller start
Webhook certificate	Issued by cert-manager
RHAI Operator	Starts (certificate volume mounted)
Istio, LWS	Operators start
Serve LLMISvc Controller	Deployed by RHAI Operator
All components	Running

Note: The RHAI operator pods display FailedMount warnings during the first 60-90 seconds. This is expected behavior while cert-manager starts and issues the webhook certificate.

3. Installing the RHAII Operator

For detailed Helm chart configuration options, see the RHAII Helm Chart README.

3.1 Create Values File

Create a rhoai-values.yaml with RHOAI EA2 image overrides:

azure:
  enabled: true
  cloudManager:
    image: quay.io/rhoai/odh-rhel9-operator:rhoai-3.4-ea.2

coreweave:
  enabled: false
  cloudManager:
    image: quay.io/rhoai/odh-rhel9-operator:rhoai-3.4-ea.2

rhaiOperator:
  image: quay.io/rhoai/odh-rhel9-operator:rhoai-3.4-ea.2
  relatedImages:
  - name: RELATED_IMAGE_ODH_KSERVE_AGENT_IMAGE
    value: quay.io/rhoai/odh-kserve-agent-rhel9:rhoai-3.4-ea.2
  - name: RELATED_IMAGE_ODH_KSERVE_CONTROLLER_IMAGE
    value: quay.io/rhoai/odh-kserve-controller-rhel9:rhoai-3.4-ea.2
  - name: RELATED_IMAGE_ODH_KSERVE_LLMISVC_CONTROLLER_IMAGE
    value: quay.io/rhoai/odh-kserve-llmisvc-controller-rhel9:rhoai-3.4-ea.2
  - name: RELATED_IMAGE_ODH_KSERVE_ROUTER_IMAGE
    value: quay.io/rhoai/odh-kserve-router-rhel9:rhoai-3.4-ea.2
  - name: RELATED_IMAGE_ODH_KSERVE_STORAGE_INITIALIZER_IMAGE
    value: quay.io/rhoai/odh-kserve-storage-initializer-rhel9:rhoai-3.4-ea.2
  - name: RELATED_IMAGE_RHAIIS_VLLM_CUDA_IMAGE
    value: registry.redhat.io/rhaii-early-access/vllm-cuda-rhel9:3.4.0-ea.2
  - name: RELATED_IMAGE_RHAIIS_VLLM_ROCM_IMAGE
    value: registry.redhat.io/rhaii-early-access/vllm-rocm-rhel9:3.4.0-ea.2
  - name: RELATED_IMAGE_RHAIIS_VLLM_SPYRE_IMAGE
    value: registry.redhat.io/rhaii-early-access/vllm-spyre-rhel9:3.4.0-ea.2
  - name: RELATED_IMAGE_ODH_LLM_D_INFERENCE_SCHEDULER_IMAGE
    value: quay.io/rhoai/odh-llm-d-inference-scheduler-rhel9:rhoai-3.4-ea.2
  - name: RELATED_IMAGE_ODH_LLM_D_ROUTING_SIDECAR_IMAGE
    value: quay.io/rhoai/odh-llm-d-routing-sidecar-rhel9:rhoai-3.4-ea.2
  - name: RELATED_IMAGE_OSE_KUBE_RBAC_PROXY_IMAGE
    value: quay.io/rhoai/odh-kube-auth-proxy-rhel9:rhoai-3.4-ea.2
  - name: RELATED_IMAGE_ODH_LLM_D_KV_CACHE_IMAGE
    value: quay.io/rhoai/odh-llm-d-kv-cache-rhel9:rhoai-3.4-ea.2

Note: vLLM images use registry.redhat.io/rhaii-early-access/ with tag format 3.4.0-ea.2. All other RHOAI images use quay.io/rhoai/ with tag format rhoai-3.4-ea.2. These images will eventually be replaced with registry.redhat.io digest-pinned references.

For CoreWeave, set azure.enabled: false and coreweave.enabled: true.

3.2 Install on Azure Kubernetes Service

helm upgrade rhaii oci://quay.io/rhoai/rhai-on-xks-chart \
  --install --create-namespace \
  --namespace rhaii \
  -f rhoai-values.yaml \
  --set-file imagePullSecret.dockerConfigJson=~/pull-secret.json

3.3 Install on CoreWeave Kubernetes Service

helm upgrade rhaii oci://quay.io/rhoai/rhai-on-xks-chart \
  --install --create-namespace \
  --namespace rhaii \
  -f rhoai-values.yaml \
  --set coreweave.enabled=true --set azure.enabled=false \
  --set-file imagePullSecret.dockerConfigJson=~/pull-secret.json

Important: Do NOT use --wait. The chart uses post-install hook Jobs that need CRDs to register first, and the RHAI operator depends on cert-manager to start. Using --wait may cause the installation to time out. Important: Always include --set-file imagePullSecret.dockerConfigJson=... in the initial install command. Running without it first and adding it later can cause image pull failures in dependency namespaces.

3.4 Verify Operator Deployment

Wait approximately 2 minutes for the bootstrap sequence to complete, then verify all components:

# RHAI Operator (3 replicas)
kubectl get pods -n redhat-ods-operator

# Cloud Manager
kubectl get pods -n rhai-cloudmanager-system

# KServe LLMISvc Controller
kubectl get pods -n redhat-ods-applications

# cert-manager
kubectl get pods -n cert-manager

# Istio
kubectl get pods -n istio-system

# LWS Operator
kubectl get pods -n openshift-lws-operator

All pods should show Running status with all containers ready.

Verify the RHOAI image versions:

# Operator image
kubectl get deploy rhai-operator -n redhat-ods-operator \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

# vLLM image env vars
kubectl get deploy rhai-operator -n redhat-ods-operator \
  -o jsonpath='{range .spec.template.spec.containers[0].env[*]}{.name}={.value}{"\n"}{end}' \
  | grep VLLM

4. Configuring the Inference Gateway

The inference gateway provides external access to LLMInferenceService endpoints via the Gateway API.

4.1 Set Up CA Bundle

Extract the CA certificate from cert-manager and create a ConfigMap for mTLS trust between inference components:

# Extract CA cert from cert-manager secret
kubectl get secret opendatahub-ca -n cert-manager \
  -o jsonpath='{.data.ca\.crt}' | base64 -d > /tmp/ca.crt

# Create CA bundle ConfigMap
kubectl create configmap rhaii-ca-bundle \
  --from-file=ca.crt=/tmp/ca.crt \
  -n redhat-ods-applications \
  --dry-run=client -o yaml | kubectl apply -f -

4.2 Create Gateway ConfigMap

Configure the gateway pod to mount the CA bundle for mTLS trust. The service annotation is AKS-specific — it switches the Azure Load Balancer health probe to TCP so it can reach the Istio gateway on port 80. Omit it on CoreWeave:

kubectl apply -f - <<'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
  name: inference-gateway-config
  namespace: redhat-ods-applications
data:
  deployment: |
    spec:
      template:
        spec:
          volumes:
          - name: rhaii-ca-bundle
            configMap:
              name: rhaii-ca-bundle
          containers:
          - name: istio-proxy
            volumeMounts:
            - name: rhaii-ca-bundle
              mountPath: /var/run/secrets/opendatahub
              readOnly: true
  service: |
    metadata:
      annotations:
        service.beta.kubernetes.io/port_80_health-probe_protocol: tcp
EOF

4.3 Create Gateway

kubectl apply -f - <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway  # DO NOT CHANGE THIS VALUE
  namespace: redhat-ods-applications
spec:
  gatewayClassName: istio
  listeners:
    - name: http
      port: 80
      protocol: HTTP
      allowedRoutes:
        namespaces:
          from: All
  infrastructure:
    labels:
      serving.kserve.io/gateway: kserve-ingress-gateway
    parametersRef:
      group: ""
      kind: ConfigMap
      name: inference-gateway-config
EOF

Important: The gateway must be named inference-gateway. This name is configured in the LLMInferenceServiceConfig templates and used by the controller when router.gateway: {} is empty.

4.4 Verify Gateway Deployment

kubectl get gateway -n redhat-ods-applications

Expected output:

NAME                CLASS   ADDRESS         PROGRAMMED   AGE
inference-gateway   istio   20.xx.xx.xx     True         1m

Verify the gateway pod is running:

kubectl get pods -n redhat-ods-applications \
  -l gateway.networking.k8s.io/gateway-name=inference-gateway

Troubleshooting: If the gateway shows Programmed: False, check istiod logs: kubectl logs deploy/istiod -n istio-system | grep gateway. A common cause is a missing ConfigMap referenced by parametersRef.

5. Deploying an LLM Inference Service

5.1 Create the Application Namespace

export NAMESPACE=llm-inference
kubectl create namespace $NAMESPACE

5.2 Copy Pull Secret to Application Namespace

The rhaii-pull-secret is only created in chart-managed namespaces. Copy it to your application namespace:

kubectl get secret rhaii-pull-secret -n redhat-ods-applications -o json | \
  jq 'del(.metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp,
      .metadata.annotations, .metadata.labels, .metadata.ownerReferences) |
      .metadata.namespace = "'$NAMESPACE'"' | \
  kubectl apply -f -

5.3 Deploy the LLMInferenceService

EA2 requires container name stubs in the scheduler template when providing imagePullSecrets. Without these stubs, the controller replaces the entire template and produces an empty containers list.

kubectl apply -n $NAMESPACE -f - <<'EOF'
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: single-gpu
spec:
  model:
    uri: hf://Qwen/Qwen3-0.6B
    name: Qwen/Qwen3-0.6B
  replicas: 1
  router:
    scheduler:
      template:
        imagePullSecrets:
        - name: rhaii-pull-secret
        containers:
        - name: main
        - name: tokenizer
    route: {}
    gateway: {}
  template:
    imagePullSecrets:
    - name: rhaii-pull-secret
    containers:
      - name: main
        resources:
          limits:
            cpu: '4'
            memory: 32Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: '2'
            memory: 16Gi
            nvidia.com/gpu: "1"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
            scheme: HTTPS
          initialDelaySeconds: 120
          periodSeconds: 30
          timeoutSeconds: 30
          failureThreshold: 5
EOF

5.4 Monitor Deployment Progress

kubectl get llmisvc -n $NAMESPACE -w

The service is ready when the READY column shows True. Model download and loading typically takes 3-5 minutes depending on network speed and model size.

6. Verifying the Deployment

6.1 Check Service Status

kubectl get llmisvc -n $NAMESPACE

Expected output:

NAME         URL                                                  READY   AGE
single-gpu   http://20.xx.xx.xx/llm-inference/single-gpu          True    5m

6.2 Check Pod Status

kubectl get pods -n $NAMESPACE

All pods should show Running status:

NAME                                                          READY   STATUS    AGE
single-gpu-kserve-xxxxxxxxx-xxxxx                             1/1     Running   5m
single-gpu-kserve-router-scheduler-xxxxxxxxx-xxxxx            2/2     Running   5m

6.3 Test Inference

SERVICE_URL=$(kubectl get llmisvc single-gpu -n $NAMESPACE \
  -o jsonpath='{.status.url}')

curl -X POST "${SERVICE_URL}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [{"role": "user", "content": "What is Kubernetes?"}],
    "max_tokens": 100
  }'

7. Sample Manifests

Ready-to-use LLMInferenceService manifests are available in the llm-d-conformance-manifests repository (branch 3.4-ea2).

Available Manifests

Manifest	Description	GPUs	Features
`single-gpu.yaml`	Single GPU with EPP scheduler	1	Scheduler with container stubs
`single-gpu-smoke.yaml`	Minimal smoke test	1	Low resource requests
`single-gpu-no-scheduler.yaml`	K8s native routing (no EPP)	1	No scheduler
`cache-aware.yaml`	Prefix KV cache-aware routing	2	`scheduler.config.inline` with `precise-prefix-cache-scorer`
`pd.yaml`	Prefill/Decode disaggregation	3+	NixlConnector KV transfer
`moe.yaml`	MoE with DP/EP	8	RDMA/RoCE, multi-node

Quick Deploy

# Clone the manifests
git clone -b 3.4-ea2 https://github.com/aneeshkp/llm-d-conformance-manifests.git
cd llm-d-conformance-manifests

# Deploy single GPU test
kubectl apply -n $NAMESPACE -f single-gpu.yaml

# Or deploy smoke test (lower resources)
kubectl apply -n $NAMESPACE -f single-gpu-smoke.yaml

EA2 Key Differences from EA1

Feature	EA1	EA2
Pull secret name	`redhat-pull-secret`	`rhaii-pull-secret`
Scheduler config	verbose `args[]`	`scheduler.config.inline`
Cache scorer plugin	`prefix-cache-scorer`	`precise-prefix-cache-scorer`
Scheduler template	Not required	Container stubs required with `imagePullSecrets`

8. Troubleshooting

8.1 RHAI Operator Pods Stuck in ContainerCreating

Symptom: The rhai-operator pods remain in ContainerCreating state.

Cause: The operator mounts a webhook certificate secret that cert-manager issues. This is expected for 1-2 minutes during initial deployment.

Resolution:

Wait for cert-manager to start and issue the certificate:

kubectl get certificate -n redhat-ods-operator

If the certificate does not appear after 5 minutes, check the cloud manager logs:

kubectl logs deployment/azure-cloud-manager-operator \
  -n rhai-cloudmanager-system --tail=30

8.2 Dependency Pods Show ImagePullBackOff

Symptom: Pods in openshift-lws-operator, cert-manager, or istio-system show ImagePullBackOff.

Cause: The pull secret credentials are invalid or don't cover the required registries.

Resolution:

Verify the pull secret works locally:

podman pull registry.redhat.io/ubi9/ubi-minimal --authfile ~/pull-secret.json

If the credentials are invalid, update the pull secret and re-run helm upgrade to push the updated secret to all namespaces. Then restart failing pods.

8.3 Gateway Shows Programmed: False

Symptom: kubectl get gateway -n redhat-ods-applications shows Programmed: False.

Cause: Missing ConfigMap referenced by parametersRef, or the CA bundle ConfigMap does not exist.

Resolution:

Check istiod logs:

kubectl logs deploy/istiod -n istio-system | grep gateway

Ensure both ConfigMaps exist:

kubectl get configmap inference-gateway-config rhaii-ca-bundle \
  -n redhat-ods-applications

8.4 Scheduler Deployment Fails with "containers: Required value"

Symptom: The LLMInferenceService shows SchedulerReconcileError and the controller logs show:

Deployment.apps "xxx-kserve-router-scheduler" is invalid: spec.template.spec.containers: Required value

Cause: The scheduler template in the LLMInferenceService spec has imagePullSecrets but no container name stubs. KServe replaces the entire template, resulting in empty containers.

Resolution:

Add container stubs to the scheduler template:

router:
  scheduler:
    template:
      imagePullSecrets:
      - name: rhaii-pull-secret
      containers:
      - name: main
      - name: tokenizer

See the conformance manifests for working examples.

8.5 LLMInferenceService Shows RefsInvalid

Symptom: The LLMInferenceService status shows RefsInvalid with message about non-existent gateway.

Cause: The gateway name does not match what the controller expects. When router.gateway: {} is empty, it defaults to inference-gateway in redhat-ods-applications.

Resolution:

Either create a gateway named inference-gateway (see Section 4), or specify the gateway explicitly:

router:
  gateway:
    refs:
    - name: my-gateway-name
      namespace: redhat-ods-applications

9. Uninstall

9.1 Delete LLM Inference Services

kubectl delete llmisvc --all -n llm-inference
kubectl delete namespace llm-inference

9.2 Uninstall the RHAII Operator

helm uninstall rhaii -n rhaii

CRDs are not removed on uninstall. To remove them manually:

kubectl delete crd kserves.components.platform.opendatahub.io
kubectl delete crd azurekubernetesengines.infrastructure.opendatahub.io
kubectl delete crd llminferenceservices.serving.kserve.io
kubectl delete crd llminferenceserviceconfigs.serving.kserve.io

Appendix: Component Reference

Namespaces

Namespace	Owner	Description
`rhaii`	Helm	Helm release metadata
`redhat-ods-operator`	RHAI Operator	Operator deployment and webhooks
`redhat-ods-applications`	RHAI Operator	KServe controller, inference gateway
`rhai-cloudmanager-system`	Helm	Cloud manager operator
`cert-manager-operator`	Cloud Manager	cert-manager operator deployment
`cert-manager`	Cloud Manager	cert-manager controller and webhooks
`istio-system`	Cloud Manager	Istio control plane
`openshift-lws-operator`	Cloud Manager	LeaderWorkerSet operator

API Versions

API	Group	Version	Status
LLMInferenceService	`serving.kserve.io`	v1alpha2	Alpha
LLMInferenceServiceConfig	`serving.kserve.io`	v1alpha2	Alpha
InferencePool	`inference.networking.k8s.io`	v1	GA
Gateway	`gateway.networking.k8s.io`	v1	GA

Support

For assistance with Red Hat AI Inference Server deployments, contact Red Hat Support or consult the product documentation.

Additional Resources:

RHAII Helm Chart README — Helm chart configuration and values reference
LLM-D Conformance Manifests (EA2) — Ready-to-use LLMInferenceService manifests
Deploying on AKS/CoreWeave — EA1 (Helmfile) — EA1 deployment using helmfile with individual operator charts
KServe LLMInferenceService Samples — Example inference service configurations

Deploying Red Hat AI Inference Server: Distributed Inference with llm-d (EA2)

Executive Summary

Table of Contents

1. Prerequisites

1.1 Kubernetes Cluster Requirements

1.2 Client Tools

1.3 Registry Authentication

1.4 GPU Node Pool Configuration

2. Architecture Overview

2.1 Deployed Components

2.2 Component Interaction

2.3 Bootstrap Sequence

3. Installing the RHAII Operator

3.1 Create Values File

3.2 Install on Azure Kubernetes Service

3.3 Install on CoreWeave Kubernetes Service

3.4 Verify Operator Deployment

4. Configuring the Inference Gateway

4.1 Set Up CA Bundle

4.2 Create Gateway ConfigMap

4.3 Create Gateway

4.4 Verify Gateway Deployment

5. Deploying an LLM Inference Service

5.1 Create the Application Namespace

5.2 Copy Pull Secret to Application Namespace

5.3 Deploy the LLMInferenceService

5.4 Monitor Deployment Progress

6. Verifying the Deployment

6.1 Check Service Status

6.2 Check Pod Status

6.3 Test Inference

7. Sample Manifests

Available Manifests

Quick Deploy

EA2 Key Differences from EA1

8. Troubleshooting

8.1 RHAI Operator Pods Stuck in ContainerCreating

8.2 Dependency Pods Show ImagePullBackOff

8.3 Gateway Shows Programmed: False

8.4 Scheduler Deployment Fails with "containers: Required value"

8.5 LLMInferenceService Shows RefsInvalid

9. Uninstall

9.1 Delete LLM Inference Services

9.2 Uninstall the RHAII Operator

Appendix: Component Reference

Namespaces

API Versions

Support