Deploying Red Hat AI Inference Server: Distributed Inference with llm-d on OpenShift

Product: Red Hat AI Inference Server (RHAIIS) Version: 3.4 Platform: OpenShift 4.19+

Overview

This guide covers deploying the inference-only stack (KServe + distributed inference with llm-d) on OpenShift using OLM (Operator Lifecycle Manager) via the odh-gitops Helm chart. This installs only model serving capabilities without the full RHOAI/ODH platform.

1. Prerequisites

1.1 Cluster Requirements

Requirement	Specification
OpenShift Version	4.19 or later
GPU Nodes	Supported NVIDIA/AMD GPU nodes
GPU Operator	NVIDIA GPU Operator on OpenShift installed

1.2 Client Tools

Tool	Minimum Version	Purpose
`oc` or `kubectl`	1.33+	Kubernetes / OpenShift CLI
`helm`	3.17+	Helm package manager

1.3 Permissions

Cluster admin permissions are required to install OLM operators and create cluster-scoped resources.

2. What Gets Installed

The inference-only stack installs a minimal set of operators via OLM:

Operator	Purpose	Namespace
cert-manager	Certificate management and TLS provisioning	`cert-manager-operator`
Leader Worker Set	Distributed inference workflows	`openshift-lws-operator`
Red Hat Connectivity Link (RHCL)	API management (Kuadrant/Authorino)	`kuadrant-system`
ODH/RHOAI Operator	KServe controller and LLMInferenceService lifecycle	`redhat-ods-applications`

The Helm chart also creates: - DSCInitialization (DSCI) with monitoring disabled - DataScienceCluster (DSC) with only KServe set to Managed - GatewayClass (openshift-default) for OpenShift's gateway controller

All other platform components (AI Pipelines, Dashboard, Feast, Kueue, Model Registry, Ray, Trainer, Training Operator, TrustyAI, Workbenches, MLflow, LlamaStack) are set to Removed.

3. Installation

3.1 Clone the Repository

git clone https://github.com/opendatahub-io/odh-gitops.git
cd odh-gitops

3.2 Install Operators (First Helm Run)

The first run installs OLM subscriptions (Namespace, OperatorGroup, Subscription). CRs are skipped because their CRDs do not exist yet.

helm upgrade --install rhoai ./chart \
  -f docs/examples/values-inference-only.yaml \
  -n opendatahub-gitops --create-namespace

3.3 Wait for CRDs

Wait for the operators to install and register their CRDs:

kubectl wait --for=condition=Established \
  crd/leaderworkersetoperators.operator.openshift.io --timeout=300s

kubectl wait --for=condition=Established \
  crd/kuadrants.kuadrant.io --timeout=300s

kubectl wait --for=condition=Established \
  crd/datascienceclusters.datasciencecluster.opendatahub.io --timeout=300s

kubectl wait --for=condition=Established \
  crd/dscinitializations.dscinitialization.opendatahub.io --timeout=300s

3.4 Create CRs (Second Helm Run)

Now that CRDs exist, the second run creates the CR resources (DSCInitialization, DataScienceCluster, Kuadrant, LeaderWorkerSetOperator, etc.):

helm upgrade --install rhoai ./chart \
  -f docs/examples/values-inference-only.yaml \
  -n opendatahub-gitops

4. Enabling Authorino TLS

Warning: This step is required for KServe to function correctly. It must be run after the Kuadrant operator creates the Authorino resource.

KUSTOMIZE_MODE=false ./scripts/prepare-authorino-tls.sh

This script: 1. Waits for the Authorino service to be created 2. Annotates the service to trigger TLS certificate generation 3. Waits for the TLS certificate secret 4. Patches the Authorino CR to enable TLS

5. Verification

5.1 Check Operator CSVs

Verify all operators are installed and in Succeeded phase:

oc get csv -A | grep -E "(cert-manager|leader-worker|rhcl|opendatahub|rhods)"

5.2 Check Authorino TLS

oc get authorino authorino -n kuadrant-system \
  -o jsonpath='{.spec.listener.tls}'

5.3 Check DataScienceCluster Status

oc get datasciencecluster -o jsonpath='{.items[0].status.phase}'

5.4 Check KServe Pods

oc get pods -n redhat-ods-applications

Expected output:

NAME                                         READY   STATUS    RESTARTS   AGE
kserve-controller-manager-xxxxxxxxx-xxxxx    1/1     Running   0          5m
odh-model-controller-xxxxxxxxx-xxxxx         1/1     Running   0          5m

5.5 Comprehensive Verification

Use the provided verification script for a full check of all operator subscriptions and pod readiness:

./scripts/verify-dependencies.sh

6. Deploying an LLM Inference Service

6.1 Create the Application Namespace

export NAMESPACE=llm-inference
oc new-project $NAMESPACE

6.2 Deploy the LLMInferenceService

Create the LLMInferenceService resource:

oc apply -n $NAMESPACE -f - <<'EOF'
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: qwen2-7b-instruct
spec:
  model:
    name: Qwen/Qwen2.5-7B-Instruct
    uri: hf://Qwen/Qwen2.5-7B-Instruct
  replicas: 1
  router:
    gateway: {}
    route: {}
    scheduler: {}
  template:
    tolerations:
    - key: "nvidia.com/gpu"
      operator: "Equal"
      value: "present"
      effect: "NoSchedule"
    containers:
    - name: main
      resources:
        limits:
          cpu: "4"
          memory: 32Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "2"
          memory: 16Gi
          nvidia.com/gpu: "1"
      livenessProbe:
        httpGet:
          path: /health
          port: 8000
          scheme: HTTPS
        initialDelaySeconds: 120
        periodSeconds: 30
        timeoutSeconds: 30
        failureThreshold: 5
EOF

6.3 Monitor Deployment Progress

Watch the LLMInferenceService status:

oc get llmisvc -n $NAMESPACE -w

The service is ready when the READY column shows True.

6.4 Test Inference

Retrieve the service URL:

SERVICE_URL=$(oc get llmisvc qwen2-7b-instruct -n $NAMESPACE -o jsonpath='{.status.url}')
echo $SERVICE_URL

Send a test request:

curl -X POST "${SERVICE_URL}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [{"role": "user", "content": "What is Kubernetes?"}],
    "max_tokens": 100
  }'

7. Troubleshooting

7.1 CRs Not Being Created

If CR resources (DataScienceCluster, Kuadrant, LeaderWorkerSetOperator) are not created after the Helm install:

Verify the CRDs exist:

kubectl get crd datascienceclusters.datasciencecluster.opendatahub.io
kubectl get crd kuadrants.kuadrant.io

Run helm upgrade again — CRs are skipped until their CRDs exist:

helm upgrade --install rhoai ./chart \
  -f docs/examples/values-inference-only.yaml \
  -n opendatahub-gitops

7.2 Authorino TLS Issues

Check the service annotation:

kubectl get svc authorino-authorino-authorization -n kuadrant-system \
  -o jsonpath='{.metadata.annotations}'

Check the TLS secret:

kubectl get secret authorino-server-cert -n kuadrant-system

Verify Authorino CR has TLS enabled:

kubectl get authorino authorino -n kuadrant-system \
  -o jsonpath='{.spec.listener.tls}'

If the secret does not exist, re-run:

KUSTOMIZE_MODE=false ./scripts/prepare-authorino-tls.sh

7.3 Dependencies Not Being Installed

Verify the component requiring it has managementState: Managed (not Removed)
Check that the dependency is not explicitly set to false in the component's dependencies
Verify the top-level dependencies.<name>.enabled is not set to false

Additional Resources

odh-gitops Repository — Full GitOps deployment for RHOAI/ODH
Inference Only Stack Guide — Example Helm values and installation details
Main Deployment Guide — Deploying on AKS, CoreWeave, and other Kubernetes platforms