Skip to content

Observability Dashboard

This document covers the observability stack for the MaaS Platform, including metrics collection, monitoring, and visualization.

Prerequisites

Before deploying the observability stack, ensure the following platform prerequisites are configured. Without these, metrics pipelines for dashboards and showback will not function.

User Workload Monitoring

User Workload Monitoring must be enabled for Prometheus to scrape metrics from MaaS components.

Required for metrics collection

Without User Workload Monitoring enabled, ServiceMonitors deployed by MaaS will not be processed and no metrics will be collected.

Step 1: Create or update the cluster-monitoring-config ConfigMap

The ConfigMap must exist in the openshift-monitoring namespace (not the MaaS application namespace). Cluster admin permissions are required.

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true

Apply with:

kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true
EOF

Step 2: Verify User Workload Monitoring is active

# Check that prometheus-user-workload pods are running
kubectl get pods -n openshift-user-workload-monitoring

# Expected output: prometheus-user-workload-0 and prometheus-user-workload-1 in Running state

If pods are not present, wait a few minutes for the monitoring operator to reconcile.

Kuadrant Observability

The Kuadrant CR must have observability.enabled set to true for the operator to create the necessary PodMonitor that scrapes Limitador metrics.

Required for rate-limiting and usage metrics

Without Kuadrant observability enabled, metrics like authorized_hits, authorized_calls, and limited_calls will not be scraped into Prometheus. Dashboards showing token consumption and rate limiting will have no data.

Step 1: Enable observability on the Kuadrant CR

kubectl patch kuadrant kuadrant -n kuadrant-system --type=merge \
  -p '{"spec":{"observability":{"enable":true}}}'

Or edit the Kuadrant CR directly:

apiVersion: kuadrant.io/v1beta1
kind: Kuadrant
metadata:
  name: kuadrant
  namespace: kuadrant-system
spec:
  observability:
    enable: true  # Required for PodMonitor creation

Step 2: Verify the PodMonitor exists

# Check that the Kuadrant-created PodMonitor exists
kubectl get podmonitor -n kuadrant-system

# Expected: kuadrant-limitador-monitor should be listed

Step 3: Verify Limitador metrics are being scraped

# Query Prometheus for Limitador metrics
curl -sk -H "Authorization: Bearer $(oc whoami -t)" \
  "https://thanos-querier-openshift-monitoring.<cluster>/api/v1/query?query=limitador_up"

# Should return data with limitador_up = 1

Prerequisites Summary

Prerequisite Namespace How to Enable Verification
User Workload Monitoring openshift-monitoring Create cluster-monitoring-config ConfigMap with enableUserWorkload: true kubectl get pods -n openshift-user-workload-monitoring shows running Prometheus pods
Kuadrant Observability kuadrant-system Set spec.observability.enable: true on Kuadrant CR kubectl get podmonitor -n kuadrant-system shows kuadrant-limitador-monitor

Cluster Admin Permissions

Configuring User Workload Monitoring requires cluster admin permissions to create ConfigMaps in the openshift-monitoring namespace. If you don't have these permissions, contact your cluster administrator.

RHOAI Dashboard Observability Tab

The RHOAI Dashboard includes a built-in Observability tab that displays Perses-based dashboards for platform monitoring. This is separate from the MaaS-specific Grafana dashboards described later in this document.

The following must be in place for the Observability tab to work:

  • Cluster Observability Operator (COO) and OpenTelemetry Operator — install both from OperatorHub
  • DSCI monitoring.metrics — see Platform Setup for DSCI configuration
  • observabilityDashboard: true on OdhDashboardConfig — see Feature Flags

Quick verification:

kubectl get csv -A | grep -E 'cluster-observability|opentelemetry'
kubectl get dsciinitialization default-dsci -o jsonpath='{.spec.monitoring}' | jq .
kubectl get pods -n redhat-ods-monitoring | grep perses

For the full setup procedure, see Managing observability (RHOAI 3.4).

Overview

As part of Dev Preview, MaaS Platform includes a basic observability stack that provides insights into system performance, usage patterns, and operational health.

Note

The observability stack will be enhanced in future releases.

The observability stack consists of:

  • Limitador: Rate limiting service that exposes usage and rate-limit metrics (with labels from TelemetryPolicy)
  • Authorino: Authentication/authorization service that exposes auth evaluation metrics (auth_server_*)
  • Istio Telemetry: Adds subscription to gateway latency metrics for per-subscription latency (P50/P95/P99)
  • vLLM / llm-d / Simulator: Expose inference metrics (TTFT, ITL, queue depth, token throughput, KV-cache usage); llm-d also exposes EPP routing metrics
  • Prometheus: Metrics collection and storage (uses OpenShift platform Prometheus)
  • ServiceMonitors: Deployed to configure Prometheus metric scraping
  • Visualization: Grafana dashboards (see Grafana documentation)

Component Metrics Status

Component Exposes Metrics? Scraped into Prometheus? In Dashboards?
Limitador Yes (/metrics) Yes (Kuadrant PodMonitor or MaaS ServiceMonitor) Yes — 16 panels use authorized_hits, authorized_calls, limited_calls, limitador_up
Authorino Yes (/metrics + /server-metrics) Yes — /metrics via Kuadrant operator; /server-metrics via MaaS authorino-server-metrics ServiceMonitor Yes — Auth Evaluation Latency (P50/P95/P99), Auth Success/Deny Rate, plus pod-up check
Istio Gateway Yes (Envoy /stats/prometheus) Yes (istio-gateway-metrics ServiceMonitor) Yes — latency histograms, request counts, error rates
maas-api No — returns 404 on /metrics No Only pod-up check via kube_pod_status_phase
vLLM / llm-d / Simulator Yes (vLLM metrics on /metrics port 8000; llm-d EPP metrics on port 9090) Yes — vLLM metrics via kserve-llm-models ServiceMonitor; EPP metrics require separate scrape config Yes — TTFT, ITL, queue depth, latency, tokens, cache, prompt/generation ratio, queue wait time (EPP metrics not yet in MaaS dashboards)

maas-api Metrics Gap

The maas-api Go service does not expose a /metrics endpoint. Metrics such as API key creation rate, token issuance rate, model discovery latency, and request handler durations are not available in Prometheus. Adding Prometheus instrumentation (e.g. promhttp handler + application-specific counters/histograms) to the Go service is a recommended future improvement.

Installation

There are two ways to enable deployment-based observability:

  1. Operator-managed (recommended): Enable via Tenant CR
  2. Kustomize-based: Deploy manifests directly

Option 1: Operator-Managed Telemetry

When using the ODH/RHOAI operator, telemetry can be enabled via the Tenant CR (self-bootstrapped by maas-controller in the models-as-a-service namespace):

apiVersion: maas.opendatahub.io/v1alpha1
kind: Tenant
metadata:
  name: default-tenant
  namespace: models-as-a-service
spec:
  telemetry:
    enabled: true  # Enable TelemetryPolicy and Istio Telemetry
    metrics:
      captureOrganization: true
      captureUser: false      # Disabled by default (GDPR)
      captureGroup: false     # High cardinality
      captureModelUsage: true

Or patch an existing CR:

kubectl patch tenant default-tenant -n models-as-a-service --type=merge \
  -p '{"spec":{"telemetry":{"enabled":true}}}'

What the Tenant reconciler creates when telemetry.enabled: true:

Resource Namespace Purpose
TelemetryPolicy (maas-telemetry) Gateway namespace Adds user, subscription, model labels to Limitador usage metrics
Istio Telemetry (latency-per-subscription) Gateway namespace Adds subscription label to gateway latency metrics

Prerequisites for Operator-Managed Telemetry

The Tenant reconciler telemetry feature requires:

  • OpenShift Service Mesh (Istio) 2.4+ — for Istio Telemetry CRD
  • Kuadrant/RHCL — for TelemetryPolicy CRD and AuthPolicy header injection
  • Gateway deployed — Telemetry targets the gateway via selector

The Tenant reconciler checks for CRD availability before creating resources. If a CRD is not present, that resource is silently skipped.

AuthPolicy Header Dependency

The Istio Telemetry reads the subscription value from the X-MaaS-Subscription header, which must be injected by AuthPolicy:

response:
  success:
    headers:
      X-MaaS-Subscription:
        plain:
          expression: 'auth.metadata.apiKeyValidation.subscription'

Without this header injection, the subscription label on latency metrics will be empty.

Verify the feature is working:

# Check Istio Telemetry was created
kubectl get telemetry -n openshift-ingress latency-per-subscription

# Query Prometheus for subscription label
curl -sk -H "Authorization: Bearer $(oc whoami -t)" \
  "https://thanos-querier-openshift-monitoring.<cluster>/api/v1/label/subscription/values"

Option 2: Kustomize-Based Installation

Development/Testing Only

This is not the standard customer install. The production path is operator-managed via ODH/RHOAI (Option 1 above).

These Kustomize entrypoints exist so the team can install, iterate, and test observability without a full operator-driven deployment.

Kustomize Entrypoint Map

The observability stack is defined across multiple Kustomize directories. Each can be built and applied independently:

Entrypoint (kustomize build target) What it deploys Operator-owned equivalent
deployment/base/observability/ TelemetryPolicy + Istio Telemetry (conditional ServiceMonitors applied only via script) Operator installs TelemetryPolicy as part of the MaaS stack; Kuadrant operator owns ServiceMonitors when spec.observability.enable: true
deployment/components/observability/grafana/ GrafanaDashboard CRs (Platform Admin, AI Engineer) Operator does not manage Grafana dashboards; same CRs are used in both paths
deployment/components/observability/prometheus/ Standalone Prometheus + RBAC + ServiceMonitors in llm-observability namespace OpenShift User Workload Monitoring (built-in Prometheus) — operator path relies on this instead
deployment/components/observability/observability/ Aggregator that pulls in prometheus/ above Same as above — this is a convenience wrapper
deployment/components/observability/observability/dashboards/ Perses PersesDashboard + Prometheus datasource No operator equivalent — Perses is optional in both paths

Base observability resources (deployment/base/observability/):

Resource Purpose
TelemetryPolicy (gateway-telemetry-policy.yaml) Adds user, subscription, and model labels to Limitador metrics. The model label (from responseBodyJSON) is available on authorized_hits; authorized_calls and limited_calls carry user and subscription.
Istio Telemetry (istio-gateway-telemetry.yaml) Adds subscription label to gateway latency (istio_request_duration_milliseconds_bucket) via the X-MaaS-Subscription header injected by the controller-generated AuthPolicy. Enables per-subscription latency tracking (P50/P95/P99).

Conditional resources (not in kustomization.yaml, applied only by install-observability.sh):

Resource Applied when…
limitador-servicemonitor.yaml No Kuadrant PodMonitor scraping Limitador /metrics
authorino-server-metrics-servicemonitor.yaml No existing monitor scraping Authorino /server-metrics
istio-gateway-service.yaml Gateway deployment exists in openshift-ingress
istio-gateway-servicemonitor.yaml Gateway deployment exists (paired with the Service above)

Dry-Run Verification

Validate that every kustomization builds cleanly without applying to a cluster:

# Base telemetry (TelemetryPolicy + Istio Telemetry)
kustomize build deployment/base/observability \
  | kubectl apply --dry-run=client -f -

# Grafana dashboards
kustomize build deployment/components/observability/grafana \
  | kubectl apply --dry-run=client -f -

# Standalone Prometheus stack
kustomize build deployment/components/observability/prometheus \
  | kubectl apply --dry-run=client -f -

# Prometheus aggregator (same content as prometheus/ above)
kustomize build deployment/components/observability/observability \
  | kubectl apply --dry-run=client -f -

# Perses dashboards
kustomize build deployment/components/observability/observability/dashboards \
  | kubectl apply --dry-run=client -f -

CI Validation

CI runs scripts/ci/validate-manifests.sh on every PR that touches deployment/**, which runs kustomize build against all kustomization.yaml files. Files with kind: Component are skipped (they must be included via a parent's components: field), but directories whose kustomization.yaml declares kind: Kustomization can build standalone and are validated by CI dry-run.

Deploy to a Cluster

Quick deployment (recommended):

# Deploys base telemetry + conditional ServiceMonitors
./scripts/observability/install-observability.sh [--namespace NAMESPACE]

Manual deployment (step-by-step):

# 1. Base telemetry (requires Gateway + AuthPolicy to exist first)
kustomize build deployment/base/observability | kubectl apply -f -

# 2. Conditional ServiceMonitors (Limitador, Authorino, Gateway, LLM models)
#    Use the install script — it detects existing Kuadrant monitors to avoid duplicates:
./scripts/observability/install-observability.sh

# 3. Grafana dashboards (discovers Grafana instance cluster-wide)
./scripts/observability/install-grafana-dashboards.sh

# 4. (Optional) Standalone Prometheus — only if NOT using OpenShift User Workload Monitoring
kustomize build deployment/components/observability/observability | kubectl apply -f -

# 5. (Optional) Perses dashboards
kustomize build deployment/components/observability/observability/dashboards | kubectl apply -f -

When using the full deployment script, this is applied automatically:

./scripts/deploy.sh

Prerequisites

  • Tools: kubectl, kustomize, jq, yq must be installed
  • Cluster state: Gateway, AuthPolicy (gateway-auth-policy), and subscription selection must be deployed first. The AuthPolicy injects X-MaaS-Subscription, which Istio Telemetry reads to label latency by subscription. Without it, the subscription label on gateway latency will be empty.
  • Namespace: Use --namespace if your MaaS API is deployed to a namespace other than maas-api (e.g. --namespace opendatahub)

Operator vs Kustomize Drift Reference

When updating manifests in deployment/base/observability/ or deployment/components/observability/, check whether the operator path produces equivalent resources. Drift between the two is expected in some areas (e.g., the standalone Prometheus stack has no operator equivalent) but should be tracked for the telemetry CRs and ServiceMonitors.

Resource Kustomize source Operator creates? Notes
TelemetryPolicy base/observability/gateway-telemetry-policy.yaml Yes Keep in sync — labels must match
Istio Telemetry base/observability/istio-gateway-telemetry.yaml Yes Keep in sync — header extraction must match
Limitador ServiceMonitor base/observability/limitador-servicemonitor.yaml Kuadrant PodMonitor when observability.enable: true Script skips ours if Kuadrant's exists
Authorino /server-metrics base/observability/authorino-server-metrics-servicemonitor.yaml Not yet (Kuadrant only scrapes /metrics) MaaS supplements the operator gap
Istio Gateway ServiceMonitor base/observability/istio-gateway-servicemonitor.yaml No MaaS-only; applied conditionally
Grafana Dashboards components/observability/grafana/ No Same CRs used in both paths
Standalone Prometheus components/observability/prometheus/ No (uses OpenShift UWM instead) Dev/test only — not for production
Perses Dashboards components/observability/observability/dashboards/ No Optional in both paths

Keeping Docs Accurate

When you change any YAML under deployment/base/observability/ or deployment/components/observability/:

  1. Verify the build still passes: kustomize build <dir> | kubectl apply --dry-run=client -f -
  2. Update the entrypoint map table above if you add, remove, or rename an entrypoint
  3. Update this document if the change affects user-facing instructions (metrics, ServiceMonitor behavior, dashboard content)

Metrics Collection

Limitador Metrics

Limitador exposes the following Prometheus metrics (verified against Limitador source code):

Core Limitador Metrics

Metric Type Labels Description
limitador_up Gauge Limitador is running (1 = up)
datastore_partitioned Gauge Limitador is partitioned from backing datastore (0 = healthy)
datastore_latency Histogram Latency to the underlying counter datastore

MaaS Usage Metrics (Limitador + TelemetryPolicy)

When Kuadrant TelemetryPolicy and TokenRateLimitPolicy are applied, Limitador exposes these counters with custom labels injected by the wasm-shim from auth context and the model response body. These are the primary metrics for usage dashboards and chargeback:

Metric Type Labels Description
authorized_hits Counter user, subscription, model, limitador_namespace Total tokens consumed per request (from usage.total_tokens in the model response; input + output combined). The model label is extracted via responseBodyJSON("/model").
authorized_calls Counter user, subscription, limitador_namespace Requests allowed (not rate-limited).
limited_calls Counter user, subscription, limitador_namespace Requests denied due to token rate limits.

model label availability

The model label is currently available only on authorized_hits. The authorized_calls and limited_calls metrics carry user and subscription labels but not model, due to how the wasm-shim constructs the CEL evaluation context for these counters. This is a known upstream limitation tracked for improvement in Kuadrant.

Gateway latency is labeled by subscription only via Istio Telemetry (see Per-Subscription Latency Tracking); per-user latency is not exposed on the gateway histogram to keep cardinality bounded.

Authorino Metrics

Authorino exposes metrics on two separate endpoints:

Endpoint Metrics Scraped?
/metrics Controller-runtime (reconcile counts, workqueue depth) Yes (authorino-operator-monitor, provided by Kuadrant)
/server-metrics Auth evaluation metrics (see below) Yes (authorino-server-metrics, deployed by MaaS install-observability.sh)

Auth server metrics (exposed on /server-metrics, port 8080):

Metric Type Labels Description
auth_server_authconfig_total Counter namespace, authconfig Total AuthConfig evaluations
auth_server_authconfig_duration_seconds Histogram namespace, authconfig Auth evaluation latency
auth_server_authconfig_response_status Counter namespace, authconfig, status Auth response status per AuthConfig (OK, denied, etc.)
auth_server_response_status Counter status Aggregate auth response status across all AuthConfigs
grpc_server_handled_total Counter grpc_method, grpc_code gRPC requests handled
grpc_server_handling_seconds Histogram grpc_method gRPC request latency
grpc_server_msg_received_total Counter grpc_method gRPC messages received
grpc_server_msg_sent_total Counter grpc_method gRPC messages sent
grpc_server_started_total Counter grpc_method gRPC requests started

MaaS ServiceMonitor

The Kuadrant-provided authorino-operator-monitor only scrapes /metrics (controller-runtime stats). MaaS deploys an additional authorino-server-metrics ServiceMonitor to scrape /server-metrics for auth evaluation metrics. This is deployed automatically by install-observability.sh.

Lazily registered metrics

Authorino upstream documents additional per-evaluator metrics (auth_server_evaluator_total, auth_server_evaluator_duration_seconds, auth_server_evaluator_cancelled, auth_server_evaluator_denied). These are lazily registered and only appear when specific evaluator types (e.g. OPA, HTTP authorization) are triggered. The MaaS AuthPolicy uses kubernetesTokenReview, which does not emit these metrics. They are not listed in the table above because they are not present in a standard MaaS deployment.

vLLM / Model Server Metrics

MaaS supports three model serving backends that expose Prometheus metrics on /metrics (port 8000), scraped by the kserve-llm-models ServiceMonitor:

  • vLLM (current stable) — full-featured LLM inference server
  • llm-d — llm-d inference platform (runs vLLM as backend + EPP routing layer)
  • llm-d-inference-sim (v0.8.2) — lightweight simulator for testing without GPUs

Supported versions:

Backend Minimum Version Sample Manifests
vLLM v0.7.x stable
llm-d v0.1.x
llm-d-inference-sim v0.8.2 docs/samples/models/simulator/

vLLM Metrics (port 8000)

All three backends expose vllm:-prefixed metrics. The table below shows which metrics each backend provides.

Metric Type Simulator vLLM llm-d Description
vllm:num_requests_running Gauge Y Y Y Requests currently being processed
vllm:num_requests_waiting Gauge Y Y Y Requests queued waiting for processing
vllm:e2e_request_latency_seconds Histogram Y Y Y End-to-end inference latency
vllm:time_to_first_token_seconds Histogram Y Y Y Time to First Token (TTFT)
vllm:request_prompt_tokens Histogram Y Y Y Per-request prompt token counts (_sum gives cumulative total)
vllm:request_generation_tokens Histogram Y Y Y Per-request generation token counts (_sum gives cumulative total)
vllm:inter_token_latency_seconds Histogram Y Y Y Inter-Token Latency (ITL)
vllm:kv_cache_usage_perc Gauge Y Y Y KV-cache usage (0-1)
vllm:prompt_tokens_total Counter Y Y Y Total prompt tokens processed
vllm:generation_tokens_total Counter Y Y Y Total generation tokens processed
vllm:request_queue_time_seconds Histogram Y Y Time requests wait in queue before processing (vLLM/llm-d only)
vllm:request_success_total Counter Y Y Y Successful requests (_total suffix added by prometheus_client)
vllm:request_prefill_time_seconds Histogram Y Y Y Time spent in prefill (prompt processing) phase
vllm:request_decode_time_seconds Histogram Y Y Y Time spent in decode (token generation) phase
vllm:request_inference_time_seconds Histogram Y Total inference time (simulator-specific)
vllm:request_params_max_tokens Histogram Y Distribution of max_tokens request parameter
vllm:max_num_generation_tokens Histogram Y Max generation tokens per request
vllm:lora_requests_info Gauge Y LoRA adapter request info
vllm:cache_config_info Gauge Y Cache configuration info (simulator-specific)
vllm:time_per_output_token_seconds Histogram Y Legacy ITL name (kept by simulator for backward compat; not used by dashboards)

Simulator metric alignment

As of v0.7.1 (still true in v0.8.x), the simulator fully aligns with current vLLM metric names (kv_cache_usage_perc, inter_token_latency_seconds, prompt_tokens_total, generation_tokens_total). Older simulator versions (v0.6.x) used different names (gpu_cache_usage_perc, time_per_output_token_seconds) and are no longer supported by MaaS dashboards. The simulator also exposes additional metrics not used by MaaS dashboards (e.g. request_inference_time_seconds, request_params_max_tokens).

Lazily registered metrics

Some vLLM/simulator metrics are lazily registered — they only appear in /metrics output after the first event that triggers them. For example, request_queue_time_seconds (on real vLLM) only appears after a request actually queues (when max-num-seqs is exceeded). Similarly, histogram counters like e2e_request_latency_seconds only appear after the first inference request completes. Dashboard panels will show "No Data" until sufficient traffic has been generated. This is normal Prometheus client behavior, not a configuration issue.

Counter _total suffix

vLLM code defines counters as vllm:prompt_tokens and vllm:generation_tokens, but the Python prometheus_client library appends _total when exposing metrics. The actual scraped metric names in Prometheus are vllm:prompt_tokens_total and vllm:generation_tokens_total. The llm-d official dashboard confirms this by using the _total form.

llm-d EPP (Endpoint Picker) Metrics

When using llm-d, the inference gateway's Endpoint Picker (EPP) exposes additional routing and scheduling metrics on a separate port (9090). These are complementary to vLLM metrics and require a separate ServiceMonitor:

Metric Type Description
inference_model_request_total Counter Total inference requests per model
inference_model_request_error_total Counter Total errored requests per model
inference_model_request_duration_seconds Histogram Request duration through the EPP
inference_model_input_tokens Counter Input tokens routed per model
inference_model_output_tokens Counter Output tokens routed per model
inference_model_running_requests Gauge Currently running requests per model
inference_pool_average_kv_cache_utilization Gauge Average KV-cache utilization across the pool
inference_pool_average_queue_size Gauge Average queue size across the pool
inference_pool_ready_pods Gauge Number of ready pods in the inference pool

EPP metrics not yet in MaaS dashboards

EPP metrics are not currently scraped or visualized by MaaS. When deploying llm-d with the EPP, refer to the llm-d monitoring docs and the inference gateway dashboard for EPP-specific visualization.

Input/Output Token Split

vLLM metrics provide input vs output token breakdown per model (vllm:prompt_tokens_total / vllm:generation_tokens_total counters, or vllm:request_prompt_tokens / vllm:request_generation_tokens histograms). However, these do not carry user or subscription labels. For per-user billing with input/output split, upstream changes to the Kuadrant wasm-shim are required (see Known Limitations).

Dashboard Metric Queries

Dashboard panels use histogram _sum as primary data source. All queries work across vLLM, llm-d, and llm-d-inference-sim v0.8.2:

Panel PromQL metric
Tokens (1h) request_prompt_tokens_sum + request_generation_tokens_sum
Token Throughput rate(request_prompt_tokens_sum), rate(request_generation_tokens_sum)
Prompt/Gen Ratio rate(request_prompt_tokens_sum) / total
ITL inter_token_latency_seconds_bucket
KV Cache kv_cache_usage_perc
Queue Wait Time request_queue_time_seconds_bucket (vLLM/llm-d only)

See the vLLM metrics documentation for the full vLLM metric list and deprecation policy, and the llm-d monitoring documentation for llm-d-specific setup.

ServiceMonitor Configuration

ServiceMonitors are deployed by install-observability.sh to configure OpenShift's Prometheus to discover and scrape metrics from MaaS components.

Automatically Deployed:

  • Istio Gateway: Scrapes Envoy metrics from the MaaS gateway in openshift-ingress (deployed if the gateway exists)
  • KServe LLM Models: Scrapes vLLM metrics from model pods in the llm namespace (deployed if the llm namespace exists)

Conditionally Deployed (auto-detected by install-observability.sh):

  • Limitador (limitador-servicemonitor.yaml): Scrapes rate-limiting metrics from Limitador pods in kuadrant-system. Skipped when Kuadrant's own PodMonitor is already present. When Kuadrant CR has spec.observability.enable: true, the operator creates its own kuadrant-limitador-monitor PodMonitor that scrapes the same Limitador pod. Deploying both would cause duplicate metrics.
  • Authorino Server Metrics (authorino-server-metrics-servicemonitor.yaml): Scrapes auth evaluation metrics from Authorino's /server-metrics endpoint in kuadrant-system. Skipped if a Kuadrant-provided monitor already scrapes /server-metrics. This collects auth_server_authconfig_duration_seconds, auth_server_authconfig_response_status, and other auth server metrics that are not scraped by the Kuadrant-provided authorino-operator-monitor (which only covers /metrics for controller-runtime stats).

Already Provided by Kuadrant (when observability.enable: true):

  • Limitador PodMonitor (kuadrant-limitador-monitor): Created by the Kuadrant operator
  • Authorino Operator Monitor (authorino-operator-monitor): Scrapes Authorino controller metrics from /metrics only

Authorino Metrics Coverage

The Kuadrant-provided authorino-operator-monitor only scrapes /metrics (controller-runtime stats). The MaaS authorino-server-metrics ServiceMonitor supplements this by scraping /server-metrics for auth evaluation metrics (auth_server_authconfig_duration_seconds, auth_server_authconfig_response_status, etc.). The install-observability.sh script auto-detects whether a Kuadrant-provided monitor already scrapes /server-metrics and skips deploying the MaaS ServiceMonitor to avoid duplicates. See Authorino Observability for details.

High Availability for MaaS Metrics

For production deployments where metric persistence across pod restarts and scaling events is critical, you should configure Limitador to use Redis as a backend storage solution.

Why High Availability Matters

By default, Limitador stores rate-limiting counters in memory, which means:

  • All hit counts are lost when pods restart
  • Metrics reset when pods are rescheduled or scaled down
  • No persistence across cluster maintenance or updates

Setting Up Persistent Metrics

To enable persistent metric counts, refer to the detailed guide:

Configuring Redis storage for rate limiting

This Red Hat documentation provides:

  • Step-by-step Redis configuration for OpenShift
  • Secret management for Redis credentials
  • Limitador custom resource updates
  • Production-ready setup instructions

For local development and testing, you can also use our Limitador Persistence guide which includes a basic Redis setup script that works with any Kubernetes cluster.

Visualization

For dashboard visualization options, see:

Included Dashboards

MaaS includes two Grafana dashboards for different personas:

Platform Admin Dashboard

Provides a comprehensive view of system health, usage across all users, and resource allocation:

Section Metrics
Component Health Limitador up, Authorino pods, MaaS API pods, Gateway pods, Firing Alerts
Key Metrics Total Tokens, Total Requests, Token Rate, Request Rate, Inference Success Rate, Active Users, P50 Response Latency, Rate Limit Ratio
Auth Evaluation Auth Evaluation Latency (P50/P95/P99), Auth Success/Deny Rate
Traffic Analysis Token/Request Rate by Model, Error Rates (4xx excl. 429, 5xx, 429 Rate Limited), Token/Request Rate by Subscription, P95 Latency
Error Breakdown Rate Limited Requests, Unauthorized Requests
Model Metrics vLLM queue depth, inference latency, KV cache usage, token throughput, prompt vs generation token ratio, queue wait time, TTFT, ITL
Top Users By token usage, by declined requests
Detailed Breakdown Token Rate by User, Request Volume by User & Subscription
Resource Allocation CPU/Memory/GPU per model pod

Template Variables

The Platform Admin dashboard uses Grafana template variables for namespace filtering instead of hardcoded values. This allows the dashboard to adapt to different deployment configurations:

Variable Default Description
$datasource prometheus Prometheus datasource
$maas_namespace auto-detected MaaS API namespace (auto-detected from kube_pod_info{pod=~"maas-api.*"})
$kuadrant_namespace kuadrant-system Kuadrant components namespace
$gateway_namespace openshift-ingress Istio/Gateway namespace
$llm_namespace llm LLM model pods namespace
$model All Filter by model name

To customize for your environment, change the variable values in Grafana's dashboard settings (gear icon → Variables).

AI Engineer Dashboard

Personal usage view for individual developers:

Section Metrics
Usage Summary My Total Tokens, My Total Requests, Token Rate, Request Rate, Rate Limit Ratio, Inference Success Rate
Usage Trends Token Usage by Model, Usage Trends (tokens vs rate limited)
Detailed Analysis Token Volume by Model, Rate Limited by Subscription

Inference Success Rate

Both dashboards use rate() on vLLM counters (request_success_total, e2e_request_latency_seconds_count) instead of raw counter values. This handles pod restarts correctly (counters reset independently and raw division produces incorrect results). When no traffic is present, rate()/rate() produces NaN; the dashboards use ((ratio) >= 0) OR vector(1) to filter NaN and default to 100% (healthy) when no traffic exists.

Tokens vs Requests

Both dashboards show token consumption (authorized_hits) for billing/cost tracking and request counts (authorized_calls) for capacity planning. Blue panels indicate request metrics; green panels indicate token metrics.

Per-User Token Billing

The Platform Admin dashboard shows token consumption aggregated by subscription and model for system-level visibility. Per-user token consumption for billing is available via:

  • AI Engineer dashboard: Individual users see their own token usage
  • Prometheus API: Query sum by (user) (increase(authorized_hits[24h])) for billing periods
  • RFE: A dedicated /maas-api/v1/usage chargeback API endpoint is recommended for production billing workflows

Prerequisites

  • Grafana must be installed (for example via your observability team's process, a centralized instance, or the Grafana Operator). The dashboard helper does not install Grafana; it only deploys MaaS dashboard definitions and never fails (warnings only if none or multiple instances are found).
  • Ensure the Grafana instance has label app=grafana so MaaS dashboard definitions attach.
  • Configure a Prometheus or Thanos datasource in Grafana; the MaaS dashboards use the default Prometheus datasource.

Deploying Dashboards

Monitoring is installed by install-observability.sh. Dashboards are installed by a separate helper that discovers Grafana cluster-wide:

./scripts/observability/install-grafana-dashboards.sh

Behavior: Scans for Grafana CRs cluster-wide. If one instance is found, deploys dashboards to that namespace and prints a success message. If none or multiple are found, prints a warning (and, for multiple, lists them) and exits without error. Use flags to target a specific instance:

./scripts/observability/install-grafana-dashboards.sh --grafana-namespace maas-api
./scripts/observability/install-grafana-dashboards.sh --grafana-label app=grafana

To deploy only the dashboard manifests manually (same namespace as your Grafana):

kustomize build deployment/components/observability/grafana | \
  sed "s/namespace: maas-api/namespace: <your-namespace>/g" | \
  kubectl apply -f -

Sample Dashboard JSON

For manual import, a sample dashboard JSON file is available:

To import into Grafana:

  1. Go to Grafana → Dashboards → Import
  2. Upload the JSON file or paste content
  3. Select your Prometheus datasource

Key Metrics Reference

Token and Request Metrics

Metric Description Labels
authorized_hits Total tokens consumed (input + output combined, from usage.total_tokens in model responses) user, subscription, model
authorized_calls Total requests allowed user, subscription
limited_calls Total requests rate-limited user, subscription

When to use which metric

  • Billing/Cost: Use authorized_hits - represents actual token consumption, with model label for per-model breakdown
  • API Usage: Use authorized_calls - represents number of API calls (per user, per subscription)
  • Rate Limiting: Use limited_calls - shows quota violations (per user, per subscription)

Total tokens only (input/output split not yet available)

Token consumption is reported as total tokens (prompt + completion) per request. The pipeline reads usage.total_tokens from the model response via Kuadrant's TokenRateLimitPolicy. Separate input-token (prompt_tokens) and output-token (completion_tokens) counters are not yet available at the gateway level; this would require upstream changes in the Kuadrant wasm-shim to send separate hits_addend values for each token type. Chargeback and usage tracking per user, per subscription, and per model are supported using authorized_hits.

Latency Metrics

Metric Description Labels
istio_request_duration_milliseconds_bucket Gateway-level latency histogram destination_service_name, subscription
vllm:e2e_request_latency_seconds Model inference latency model_name

Per-Subscription Latency Tracking

The MaaS Platform uses an Istio Telemetry resource to add a subscription dimension to gateway latency metrics. This enables tracking request latency per subscription (e.g. free, premium, enterprise). Gateway latency is labeled by subscription only (not by user) to keep metric cardinality bounded and to align with latency-by-subscription requirements (e.g. P50/P95/P99 per subscription). Per-user metrics remain available from Limitador (authorized_hits, authorized_calls, limited_calls).

How it works:

  1. The gateway-auth-policy injects the X-MaaS-Subscription header from the resolved subscription
  2. The Istio Telemetry resource extracts this header and adds it as a subscription label to the REQUEST_DURATION metric
  3. Prometheus scrapes these metrics from the Istio gateway

Configuration (deployment/base/observability/istio-gateway-telemetry.yaml):

apiVersion: telemetry.istio.io/v1
kind: Telemetry
metadata:
  name: latency-per-subscription
  namespace: openshift-ingress
spec:
  selector:
    matchLabels:
      gateway.networking.k8s.io/gateway-name: maas-default-gateway
  metrics:
  - providers:
    - name: prometheus
    overrides:
    - match:
        metric: REQUEST_DURATION
        mode: CLIENT_AND_SERVER
      tagOverrides:
        subscription:
          operation: UPSERT
          value: request.headers["x-maas-subscription"]

Security

The X-MaaS-Subscription header should be injected server-side by AuthPolicy. Ensure your AuthPolicy injects this header from the subscription selection (not client input) for accurate metrics attribution.

Common Queries

Token-based queries (billing/cost):

# Total tokens consumed per user
sum by (user) (authorized_hits)

# Token consumption rate per model (tokens/sec)
sum by (model) (rate(authorized_hits[5m]))

# Top 10 users by tokens consumed
topk(10, sum by (user) (authorized_hits))

# Token consumption by subscription
sum by (subscription) (authorized_hits)

Request-based queries (capacity/usage):

# Total requests per user
sum by (user) (authorized_calls)

# Request rate per subscription (requests/sec)
sum by (subscription) (rate(authorized_calls[5m]))

# Top 10 users by request count
topk(10, sum by (user) (authorized_calls))

Inference success rate (system health — did requests that reached the model succeed?):

# Inference success rate using rate() to handle counter resets correctly
# The >= 0 filter removes NaN (0/0 when no traffic), falling back to vector(1) = 100%
((sum(rate(vllm:request_success_total[5m])) / sum(rate(vllm:e2e_request_latency_seconds_count[5m]))) >= 0) OR vector(1)

Rate limiting metrics (capacity planning — are users exceeding their quotas?):

# Rate limit ratio (percentage of requests rejected by rate limiting)
(sum(limited_calls) / (sum(authorized_calls) + sum(limited_calls))) OR vector(0)

# Rate limit ratio by subscription
(sum by (subscription) (limited_calls) / (sum by (subscription) (authorized_calls) + sum by (subscription) (limited_calls))) OR vector(0)

# Rate limit violations per second by subscription
sum by (subscription) (rate(limited_calls[5m]))

# Users hitting rate limits most
topk(10, sum by (user) (limited_calls))

Latency metrics (per-subscription SLA tracking):

# P99 latency per subscription
histogram_quantile(0.99, sum by (subscription, le) (rate(istio_request_duration_milliseconds_bucket{subscription!=""}[5m])))

# P50 latency per subscription
histogram_quantile(0.5, sum by (subscription, le) (rate(istio_request_duration_milliseconds_bucket{subscription!=""}[5m])))

Latency queries:

# P99 latency by service
histogram_quantile(0.99, sum by (destination_service_name, le) (rate(istio_request_duration_milliseconds_bucket[5m])))

# P50 (median) latency
histogram_quantile(0.5, sum by (le) (rate(istio_request_duration_milliseconds_bucket[5m])))

# P99 latency per subscription
histogram_quantile(0.99, sum by (subscription, le) (rate(istio_request_duration_milliseconds_bucket{subscription!=""}[5m])))

Filtering by subscription

For per-subscription latency queries, use subscription!="" to exclude requests where the X-MaaS-Subscription header was not injected. Token consumption metrics (authorized_hits, authorized_calls) from Limitador already only include successful requests.

Maintenance

Grafana Datasource Token Rotation

The Grafana datasource uses a ServiceAccount token to authenticate with Prometheus. This token is valid for 30 days and must be rotated periodically.

To rotate the token:

# Delete the existing datasource and create a new one (or rotate the token per your Grafana setup).
# To re-deploy only MaaS dashboard definitions: ./scripts/observability/install-grafana-dashboards.sh

Production Recommendation

For production deployments, consider automating token rotation using a CronJob or external secrets operator to avoid dashboard outages.

Known Limitations

Currently Blocked Features

Some features require upstream changes and are currently blocked:

Feature Blocker Workaround
model label on authorized_calls / limited_calls Kuadrant wasm-shim does not pass responseBodyJSON context for these counters Use authorized_hits for per-model breakdown; authorized_calls/limited_calls support per-user and per-subscription
Input/output token split Kuadrant TokenRateLimitPolicy sends a single hits_addend (total tokens); no mechanism for separate prompt/completion counters Total tokens available via authorized_hits; the response body contains usage.prompt_tokens and usage.completion_tokens but the wasm-shim does not split them
Input/output token breakdown per user vLLM does not label its own metrics with user Total tokens per user available via authorized_hits{user="..."}; vLLM prompt/generation token metrics are per-model only
Rate-limited requests not in Istio metrics When the Kuadrant WASM plugin rejects a request (429), it calls sendLocalReply() which short-circuits the Envoy filter chain. These requests appear in Limitador metrics (limited_calls) but may not appear in Istio gateway metrics. Use limited_calls from Limitador for rate-limiting visibility (has correct subscription and user labels).
Kuadrant policy health metrics kuadrant_policies_enforced, kuadrant_policies_total etc. are defined in Kuadrant dev but not yet shipped in RHCL 1.x Enable observability.enable: true on the Kuadrant CR; the ServiceMonitors are created but policy-specific gauges will appear in a future operator release
Authorino auth server metrics (upstream) The Kuadrant-provided authorino-operator-monitor only scrapes /metrics (controller-runtime); /server-metrics is not scraped by the upstream operator Resolved by MaaS: The authorino-server-metrics ServiceMonitor (deployed by install-observability.sh) scrapes /server-metrics. Auth evaluation latency and success/deny rate are visualized in the Platform Admin dashboard.
maas-api application metrics The maas-api Go service does not expose a /metrics endpoint No workaround available. Metrics such as API key creation rate, token issuance rate, model discovery latency, and handler durations require adding Prometheus instrumentation to the Go service (e.g. promhttp handler, custom counters/histograms).
PromQL "name does not end in _total" warnings Limitador metrics (authorized_hits, authorized_calls, limited_calls) and Authorino's auth_server_authconfig_response_status are counters but do not follow the Prometheus naming convention of ending in _total. When rate() is applied, Prometheus generates a warning that Grafana displays on panels. This is Grafana issue #84636 (open). The warnings are cosmetic and do not affect data correctness. All dashboard queries correctly apply rate() or increase() to these counters. The metric names are defined by upstream Kuadrant (Limitador) and Authorino — renaming requires upstream changes.

Total Tokens vs Token Breakdown

Total token consumption per user is available via authorized_hits{user="..."}. The blocked feature is the input/output split (prompt vs generation tokens) at the gateway level, which requires the wasm-shim to send two separate counter updates to Limitador.

Available Per-User and Per-Subscription Metrics

Feature Metric Label
Latency per subscription istio_request_duration_milliseconds_bucket subscription
Token consumption per user authorized_hits user
Token consumption per subscription authorized_hits subscription
Token consumption per model authorized_hits model
Requests per user authorized_calls user
Requests per subscription authorized_calls subscription
Rate limited per user limited_calls user
Rate limited per subscription limited_calls subscription

Requirements Alignment

Requirement Status Notes
Usage dashboards (token consumption per user, per subscription, per model) Met Grafana dashboard + authorized_hits with user, subscription, model; Prometheus scrapes Limitador /metrics.
Latency by subscription (P50/P95/P99) Met istio_request_duration_milliseconds_bucket with subscription label; subscription-only avoids unbounded cardinality.
Request tracking (per user, per subscription) Met authorized_calls with user and subscription labels; limited_calls for rate-limit violations.
Export for chargeback (CSV/API) Not provided (RFE) Per-user token data exists in Prometheus (authorized_hits{user="..."}) but no dedicated billing API or export endpoint is implemented. RFE recommendation: Add /maas-api/v1/usage endpoint that queries Prometheus and returns per-user, per-subscription, per-model token consumption in CSV/JSON for finance and chargeback systems.
Input/output token split Not available Only total tokens (authorized_hits); separate input and output counters require upstream Kuadrant wasm-shim changes to send split hits_addend values.
model label on request/rate-limit counters Partial model available on authorized_hits only; requires upstream Kuadrant fix to propagate responseBodyJSON context to authorized_calls/limited_calls counters.
Policy enforcement health Future Kuadrant operator metrics (kuadrant_policies_enforced, kuadrant_ready, etc.) defined upstream but not yet shipped in RHCL 1.x; limitador_up and datastore_partitioned are available now.
Auth evaluation metrics Met Authorino /server-metrics is scraped by the authorino-server-metrics ServiceMonitor. Auth evaluation latency (P50/P95/P99) and success/deny rate are available in the Platform Admin dashboard.
maas-api application metrics Not available (gap) The maas-api Go service does not expose /metrics. API key creation rate, token issuance rate, and handler latency are not observable. Requires adding Prometheus instrumentation to the Go service.