Skip to content

Deploying Red Hat AI Inference Server: Distributed Inference with llm-d

Product: Red Hat AI Inference Server (RHAIIS) Version: 3.4 Platforms: Azure Kubernetes Service (AKS), CoreWeave Kubernetes Service (CKS)

Deploying on OpenShift? See the Deploying on OpenShift guide instead.


Executive Summary

This guide provides step-by-step instructions for deploying Distributed Inference with llm-d for Red Hat AI Inference Server. Distributed Inference with llm-d Server enables enterprise-grade Large Language Model (LLM) inference with features including:

  • Intelligent request routing using the Endpoint Picker Processor (EPP)
  • Disaggregated serving with prefill-decode separation for optimal throughput
  • Multi-node inference for large models using LeaderWorkerSet
  • Mutual TLS (mTLS) for secure communication between components
  • Gateway API integration for standard Kubernetes ingress

Table of Contents

  1. Prerequisites
  2. Preflight Validation
  3. Architecture Overview
  4. Deploying All Components
  5. Configuring the Inference Gateway
  6. Deploying an LLM Inference Service
  7. Verifying the Deployment
  8. Optional: Enabling Monitoring
  9. Optional: Enabling RHCL (API Gateway, Auth, Rate Limiting)
  10. Collecting Debug Information
  11. Troubleshooting
  12. Appendix: Component Versions

1. Prerequisites

1.1 Kubernetes Cluster Requirements

Requirement Specification
Kubernetes Version 1.33+ or later
Supported Platforms AKS, CKS (CoreWeave)
GPU Nodes Supported NVIDIA/AMD GPU nodes
GPU Device Plugin Installed and configured (NVIDIA, AMD)

1.2 Client Tools

Install the following tools on your workstation:

Tool Minimum Version Purpose
kubectl 1.33+ Kubernetes CLI
helm 3.17+ Helm package manager
helmfile 0.160+ Declarative Helm deployments

1.3 Red Hat Registry Authentication

RHAIIS images are hosted on registry.redhat.io and require authentication.

Procedure:

  1. Navigate to the Red Hat Registry Service Accounts page: https://access.redhat.com/terms-based-registry/

  2. Click New Service Account and create a new service account.

  3. Note the generated username (format: 12345678|account-name) and password.

  4. Authenticate with the registry:

podman login registry.redhat.io

Enter the service account username and password when prompted.

  1. Verify authentication:
# Verify access to Sail Operator image
podman pull registry.redhat.io/openshift-service-mesh/istio-sail-operator-bundle:3.2

# Verify access to RHAIIS vLLM image
podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:latest

Credentials are stored automatically in ~/.config/containers/auth.json after successful login.

Note: Registry Service Accounts do not expire and are recommended for production deployments.

1.4 GPU Node Pool Configuration

For GPU-accelerated inference, ensure your cluster has GPU nodes with the NVIDIA device plugin installed.

Azure Kubernetes Service (AKS)

For AKS cluster provisioning with GPU nodes, see: - AKS Provisioning Scripts - Automated cluster creation with GPU Operator - AKS Infrastructure Guide - Manual setup instructions

CoreWeave Kubernetes Service (CKS)

CoreWeave clusters come with GPU nodes pre-configured. Select the appropriate GPU type when provisioning your cluster:

GPU Type Use Case
NVIDIA A100 80GB Large models (70B+), high throughput
NVIDIA A100 40GB Medium models (7B-30B)
NVIDIA H100 80GB Maximum performance, largest models

CoreWeave GPU nodes include the NVIDIA device plugin by default.

Verification:

kubectl get nodes -l nvidia.com/gpu.present=true
kubectl describe nodes | grep -A5 "nvidia.com/gpu"

Run the preflight validation checks to verify your cluster is properly configured:

# Build the validation container
cd validation && make container

# Run preflight checks against your cluster
make run

The preflight tool automatically detects your cloud provider and validates:

Check When it passes
Cloud provider Cluster is reachable and provider detected (pre-deployment)
Instance type Supported GPU instance types are present (pre-deployment)
GPU availability GPU drivers and node labels found (pre-deployment)
cert-manager CRDs After make deploy-all
Sail Operator CRDs After make deploy-all
LWS Operator CRDs After make deploy-all
KServe CRDs After make deploy-all

Tip: Run before deploying to verify cluster readiness (cloud provider, GPU, instance types). Run again after deployment to confirm all CRDs are installed. See Section 6.4 for full post-deployment validation.

See the Preflight Validation README for configuration options and standalone usage.


2. Architecture Overview

Distributed Inference with llm-d on Red Hat AI Inference consists of the following components:

Component Version Description
cert-manager 1.15.2 Manages TLS certificates for mTLS between components
Istio (Sail Operator) 3.2.1 / 1.27.x Provides Gateway API implementation for inference routing
LeaderWorkerSet (LWS) 1.0 Enables multi-node inference for large models
KServe Controller 0.15 (chart 3.4.0-ea.1) Manages LLMInferenceService lifecycle
Gateway API 1.4.0 Routes external traffic to inference endpoints (also compatible with 1.3.0+)

Component Interaction

graph LR
    Client --> Gateway["Gateway<br/>(Istio)"]

    subgraph Kubernetes Cluster
        Gateway --> EPP["EPP<br/>Scheduler"]
        EPP --> vLLM["vLLM Pods<br/>(Model)"]
        cm["cert-manager"] -. mTLS .-> EPP
        cm["cert-manager"] -. mTLS .-> vLLM
    end

3. Deploying All Components

3.1 Clone the Deployment Repository

git clone https://github.com/opendatahub-io/rhaii-on-xks.git
cd rhaii-on-xks

3.2 Deploy

Deploy cert-manager, Istio (Sail Operator), LeaderWorkerSet, and KServe:

make deploy-all

Note: To deploy components individually, use make deploy-cert-manager, make deploy-istio, make deploy-lws, and make deploy-kserve.

3.3 Verify Infrastructure Deployment

make status

Expected output:

=== Deployment Status ===
cert-manager-operator:
NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-operator-xxxxxxxxx-xxxxx      1/1     Running   0          5m

cert-manager:
NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-xxxxxxxxx-xxxxx               1/1     Running   0          5m
cert-manager-cainjector-xxxxxxxxx-xxxxx    1/1     Running   0          5m
cert-manager-webhook-xxxxxxxxx-xxxxx       1/1     Running   0          5m

istio:
NAME                                       READY   STATUS    RESTARTS   AGE
istiod-xxxxxxxxx-xxxxx                     1/1     Running   0          5m

lws-operator:
NAME                                       READY   STATUS    RESTARTS   AGE
lws-controller-manager-xxxxxxxxx-xxxxx     1/1     Running   0          5m

=== API Versions ===
InferencePool API: v1 (inference.networking.k8s.io)
Istio version: v1.27.5

TLS Certificates: The default configuration uses a self-signed CA for internal mTLS between inference components (router, scheduler, vLLM). This is sufficient for most deployments as the certificates are only used for pod-to-pod communication within the cluster. If your organization requires certificates issued by a corporate PKI, replace the opendatahub-selfsigned-issuer with a cert-manager ClusterIssuer backed by your CA (e.g., Vault, AWS ACM PCA, or an external PKI). See the KServe Chart README - cert-manager PKI Setup for details. The KServe chart version is configured in values.yaml (kserveChartVersion). See the KServe Chart README for chart details and cert-manager PKI prerequisites.


4. Configuring the Inference Gateway

4.1 Create the Gateway

Run the gateway setup script:

./scripts/setup-gateway.sh

This script: 1. Copies the CA bundle from cert-manager to the opendatahub namespace 2. Creates an Istio Gateway with the CA bundle mounted for mTLS 3. Configures the Gateway pod with registry authentication

4.2 Verify Gateway Deployment

kubectl get gateway -n opendatahub

Expected output:

NAME                CLASS   ADDRESS         PROGRAMMED   AGE
inference-gateway   istio   20.xx.xx.xx     True         1m

Verify the Gateway pod is running:

kubectl get pods -n opendatahub -l gateway.networking.k8s.io/gateway-name=inference-gateway

4.3 AKS: Fix Load Balancer Health Probe

On AKS, external traffic to the inference gateway on port 80 may time out due to the Azure Load Balancer using an HTTP health probe that fails against the Istio gateway. This is handled automatically by setup-gateway.sh on AKS.

If you need to apply it manually (e.g., after recreating the Gateway):

kubectl annotate svc inference-gateway-istio -n opendatahub \
  "service.beta.kubernetes.io/port_80_health-probe_protocol=tcp" \
  --overwrite

Note: The port number in the annotation must match the Gateway listener port (80 here, as configured in setup-gateway.sh). If the Gateway is deleted and recreated without re-running setup-gateway.sh, the annotation will be lost and must be reapplied. See Azure LB Health Probe Workaround for full details.


5. Deploying an LLM Inference Service

5.1 Create the Application Namespace

export NAMESPACE=llm-inference
kubectl create namespace $NAMESPACE

5.2 Configure Registry Authentication

Copy the pull secret to your application namespace:

kubectl get secret redhat-pull-secret -n istio-system -o json | \
  jq 'del(.metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp, .metadata.annotations, .metadata.labels) | .metadata.namespace = "'$NAMESPACE'"' | \
  kubectl create -f -

Configure the default ServiceAccount:

kubectl patch serviceaccount default -n $NAMESPACE \
  -p '{"imagePullSecrets": [{"name": "redhat-pull-secret"}]}'

5.3 Deploy the LLMInferenceService

Create the LLMInferenceService resource:

kubectl apply -n $NAMESPACE -f - <<'EOF'
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: qwen2-7b-instruct
spec:
  model:
    name: Qwen/Qwen2.5-7B-Instruct
    uri: hf://Qwen/Qwen2.5-7B-Instruct
  replicas: 1
  router:
    gateway: {}
    route: {}
    scheduler: {}
  template:
    tolerations:
    - key: "nvidia.com/gpu"
      operator: "Equal"
      value: "present"
      effect: "NoSchedule"
    containers:
    - name: main
      resources:
        limits:
          cpu: "4"
          memory: 32Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "2"
          memory: 16Gi
          nvidia.com/gpu: "1"
      livenessProbe:
        httpGet:
          path: /health
          port: 8000
          scheme: HTTPS
        initialDelaySeconds: 120
        periodSeconds: 30
        timeoutSeconds: 30
        failureThreshold: 5
EOF

5.4 Monitor Deployment Progress

Watch the LLMInferenceService status:

kubectl get llmisvc -n $NAMESPACE -w

The service is ready when the READY column shows True.


6. Verifying the Deployment

6.1 Check Service Status

kubectl get llmisvc -n $NAMESPACE

Expected output:

NAME                READY   URL                                    AGE
qwen2-7b-instruct   True    http://20.xx.xx.xx/llm-inference/...   5m

6.2 Check Pod Status

kubectl get pods -n $NAMESPACE

All pods should show Running status with 1/1 or 2/2 ready containers.

6.3 Test Inference

Retrieve the service URL:

SERVICE_URL=$(kubectl get llmisvc qwen2-7b-instruct -n $NAMESPACE -o jsonpath='{.status.url}')
echo $SERVICE_URL

Send a test request:

curl -X POST "${SERVICE_URL}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [{"role": "user", "content": "What is Kubernetes?"}],
    "max_tokens": 100
  }'

6.4 Run Preflight Validation

Run the full validation suite to confirm all components are properly installed:

cd validation && make container && make run

All checks should show PASSED:

cloud_provider   PASSED
instance_type    PASSED
gpu_availability PASSED
crd_certmanager  PASSED
crd_sailoperator PASSED
crd_lwsoperator  PASSED
crd_kserve       PASSED

If any checks fail, review the suggested actions in the output. See the Preflight Validation README for configuration options.


7. Optional: Enabling Monitoring

Monitoring is disabled by default. Enable it if you need: - Grafana dashboards for inference metrics - Workload Variant Autoscaler (WVA) for auto-scaling

7.1 Prerequisites

Install Prometheus with ServiceMonitor/PodMonitor CRD support. See the Monitoring Setup Guide for platform-specific instructions.

7.2 Enable Monitoring in KServe

kubectl set env deployment/kserve-controller-manager \
  -n opendatahub \
  LLMISVC_MONITORING_DISABLED=false

When enabled, KServe automatically creates PodMonitor resources for vLLM pods.

7.3 Verify

# Check PodMonitors created by KServe
kubectl get podmonitors -n <llmisvc-namespace>

8. Optional: Enabling RHCL (API Gateway, Auth, Rate Limiting)

Red Hat Connectivity Link (RHCL) provides API gateway authentication, authorization, and rate limiting for inference endpoints. It deploys the Kuadrant operator stack (Kuadrant, Authorino, Limitador) and is disabled by default.

8.1 Enable RHCL

Edit values.yaml to enable:

rhclOperator:
  enabled: true
  operators:
    dns:
      enabled: false  # Set to true if you need DNS policy management

8.2 Deploy

Deploy RHCL after the baseline stack is running:

cd charts/rhcl && helmfile apply

This will: 1. Create namespaces (kuadrant-operators, kuadrant-system) 2. Deploy 3 operators (Kuadrant, Authorino, Limitador) 3. Install 14 CRDs 4. Create a Kuadrant instance and wait for it to be ready

Note: RHCL requires cert-manager and Istio (sail-operator) to be deployed first. Run make deploy-all before enabling RHCL.

Note: When deploying standalone via cd charts/rhcl && helmfile apply, the chart reads its own charts/rhcl/values.yaml directly. The root values.yaml rhclOperator settings only apply when deploying via the root helmfile (helmfile apply from the repo root).

8.3 Verify

# Check operators
kubectl get pods -n kuadrant-operators

# Check Kuadrant instance
kubectl get kuadrant -n kuadrant-system

# Check sub-operator instances
kubectl get authorino,limitador -n kuadrant-system

All operators should show 1/1 Running and the Kuadrant CR should show Ready.

8.4 Creating Policies

After RHCL is deployed, you can create policies targeting your Gateways and HTTPRoutes:

AuthPolicy (API key authentication on an HTTPRoute):

apiVersion: kuadrant.io/v1
kind: AuthPolicy
metadata:
  name: my-auth
  namespace: my-app
spec:
  targetRef:
    group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: my-route
  rules:
    authentication:
      api-key-auth:
        apiKey:
          selector: {}
        credentials:
          queryString:
            name: apikey

RateLimitPolicy (rate limiting on a Gateway):

apiVersion: kuadrant.io/v1
kind: RateLimitPolicy
metadata:
  name: my-ratelimit
  namespace: my-gateway-ns
spec:
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: my-gateway
  limits:
    default-limits:
      rates:
        - limit: 10
          window: 10s

8.5 Running Integration Tests

./test/conformance/verify-rhcl-deployment.sh

This deploys a test Gateway, HTTPRoute, AuthPolicy, and RateLimitPolicy, validates enforcement, and cleans up.

8.6 Disabling RHCL

Set rhclOperator.enabled: false in values.yaml, or uninstall:

cd charts/rhcl && helmfile destroy

For detailed configuration, see charts/rhcl/README.md.


9. Collecting Debug Information

If you encounter issues during or after deployment, collect diagnostic data for troubleshooting:

./scripts/collect-debug-info.sh

This produces a directory containing logs, resource status, certificate info, and warning events for all components. To package for sharing with Red Hat support:

tar -czf rhaii-debug.tar.gz -C /tmp rhaii-on-xks-debug-*

What is collected:

Category Details
Cluster Kubernetes version, node info, Helm releases
cert-manager Operator/controller/webhook logs, certificates, issuers
Istio Sail operator/istiod logs, gateways, HTTPRoutes, InferencePools
LWS Operator logs, LeaderWorkerSets, webhook configs
KServe Controller logs, LLMInferenceServices, gateway status
Events Warning/error events from all operator namespaces

See the full guide: Collecting Debug Information


10. Troubleshooting

10.1 Controller Pod Stuck in ContainerCreating

Symptom: The kserve-controller-manager pod remains in ContainerCreating state.

Cause: The webhook certificate has not been issued by cert-manager.

Resolution:

Verify the cert-manager PKI resources are applied (the KServe chart expects opendatahub-ca-issuer ClusterIssuer):

kubectl get clusterissuer opendatahub-ca-issuer
kubectl get certificate -n cert-manager

# If missing, re-run the deployment
make deploy-kserve

10.2 Gateway Pod Shows ErrImagePull

Symptom: The Gateway pod fails with ErrImagePull or ImagePullBackOff.

Cause: The Gateway ServiceAccount does not have registry authentication configured.

Resolution:

kubectl get secret redhat-pull-secret -n istio-system -o json | \
  jq 'del(.metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp, .metadata.annotations, .metadata.labels) | .metadata.namespace = "opendatahub"' | \
  kubectl create -f -

kubectl patch sa inference-gateway-istio -n opendatahub \
  -p '{"imagePullSecrets": [{"name": "redhat-pull-secret"}]}'

kubectl delete pod -n opendatahub -l gateway.networking.k8s.io/gateway-name=inference-gateway

10.3 LLMInferenceService Pod Shows FailedScheduling

Symptom: The inference pod shows FailedScheduling with message "Insufficient nvidia.com/gpu".

Cause: No GPU nodes are available or the pod lacks required tolerations.

Resolution:

  1. Verify GPU nodes are available:

    kubectl get nodes -l nvidia.com/gpu.present=true
    

  2. Check node taints:

    kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.taints}{"\n"}{end}'
    

  3. Add matching tolerations to the LLMInferenceService spec (see Section 5.3).

10.4 Webhook Validation Errors During Deployment

Symptom: Deployment fails with "no endpoints available for service" webhook errors.

Cause: Webhooks are registered before the controller is ready.

Resolution:

# Delete stale webhooks
kubectl delete validatingwebhookconfiguration \
  llminferenceservice.serving.kserve.io \
  llminferenceserviceconfig.serving.kserve.io \
  --ignore-not-found

# Re-deploy KServe
make deploy-kserve

Appendix: Component Versions

Component Version Container Image
cert-manager Operator 1.15.2 registry.redhat.io/cert-manager/cert-manager-operator-rhel9
Sail Operator (Istio) 3.2.1 registry.redhat.io/openshift-service-mesh/istio-sail-operator-bundle:3.2
Istio 1.27.x Dynamic resolution via v1.27-latest
LeaderWorkerSet 1.0 registry.k8s.io/lws/lws-controller
KServe Controller 0.15 (chart 3.4.0-ea.1) registry.redhat.io (via charts/kserve/)
Gateway API 1.4.0 Also compatible with 1.3.0+
vLLM (CUDA) 3.4.0-ea.1 registry.redhat.io/rhaiis/vllm-cuda-rhel9
vLLM (ROCm) 3.4.0-ea.1 registry.redhat.io/rhaiis/vllm-rocm-rhel9

API Versions

API Group Version Status
InferencePool inference.networking.k8s.io v1 GA
InferenceModel inference.networking.x-k8s.io v1alpha2 Alpha
LLMInferenceService serving.kserve.io v1alpha1 Alpha
Gateway gateway.networking.k8s.io v1 GA

Support

For assistance with Red Hat AI Inference Server deployments, contact Red Hat Support or consult the product documentation.

Additional Resources: