Observability Dashboard

This document covers the observability stack for the MaaS Platform, including metrics collection, monitoring, and visualization.

Prerequisites

Before deploying the observability stack, ensure the following platform prerequisites are configured. Without these, metrics pipelines for dashboards and showback will not function.

User Workload Monitoring

User Workload Monitoring must be enabled for Prometheus to scrape metrics from MaaS components.

Required for metrics collection

Without User Workload Monitoring enabled, ServiceMonitors deployed by MaaS will not be processed and no metrics will be collected.

Step 1: Create or update the cluster-monitoring-config ConfigMap

The ConfigMap must exist in the openshift-monitoring namespace (not the MaaS application namespace). Cluster admin permissions are required.

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true

Apply with:

kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true
EOF

Step 2: Verify User Workload Monitoring is active

# Check that prometheus-user-workload pods are running
kubectl get pods -n openshift-user-workload-monitoring

# Expected output: prometheus-user-workload-0 and prometheus-user-workload-1 in Running state

If pods are not present, wait a few minutes for the monitoring operator to reconcile.

Kuadrant Observability

The Kuadrant CR must have observability.enabled set to true for the operator to create the necessary PodMonitor that scrapes Limitador metrics.

Required for rate-limiting and usage metrics

Without Kuadrant observability enabled, metrics like authorized_hits, authorized_calls, and limited_calls will not be scraped into Prometheus. Dashboards showing token consumption and rate limiting will have no data.

Step 1: Enable observability on the Kuadrant CR

kubectl patch kuadrant kuadrant -n kuadrant-system --type=merge \
  -p '{"spec":{"observability":{"enable":true}}}'

Or edit the Kuadrant CR directly:

apiVersion: kuadrant.io/v1beta1
kind: Kuadrant
metadata:
  name: kuadrant
  namespace: kuadrant-system
spec:
  observability:
    enable: true  # Required for PodMonitor creation

Step 2: Verify the PodMonitor exists

# Check that the Kuadrant-created PodMonitor exists
kubectl get podmonitor -n kuadrant-system

# Expected: kuadrant-limitador-monitor should be listed

Step 3: Verify Limitador metrics are being scraped

# Query Prometheus for Limitador metrics
curl -sk -H "Authorization: Bearer $(oc whoami -t)" \
  "https://thanos-querier-openshift-monitoring.<cluster>/api/v1/query?query=limitador_up"

# Should return data with limitador_up = 1

Prerequisites Summary

Prerequisite	Namespace	How to Enable	Verification
User Workload Monitoring	`openshift-monitoring`	Create `cluster-monitoring-config` ConfigMap with `enableUserWorkload: true`	`kubectl get pods -n openshift-user-workload-monitoring` shows running Prometheus pods
Kuadrant Observability	`kuadrant-system`	Set `spec.observability.enable: true` on Kuadrant CR	`kubectl get podmonitor -n kuadrant-system` shows `kuadrant-limitador-monitor`

Cluster Admin Permissions

Configuring User Workload Monitoring requires cluster admin permissions to create ConfigMaps in the openshift-monitoring namespace. If you don't have these permissions, contact your cluster administrator.

RHOAI Dashboard Observability Tab

The RHOAI Dashboard includes a built-in Observability tab that displays Perses-based dashboards for platform monitoring. This is separate from the MaaS-specific Grafana dashboards described later in this document.

The following must be in place for the Observability tab to work:

Cluster Observability Operator (COO) and OpenTelemetry Operator — install both from OperatorHub
DSCI monitoring.metrics — see Platform Setup for DSCI configuration
observabilityDashboard: true on OdhDashboardConfig — see Feature Flags

Quick verification:

kubectl get csv -A | grep -E 'cluster-observability|opentelemetry'
kubectl get dsciinitialization default-dsci -o jsonpath='{.spec.monitoring}' | jq .
kubectl get pods -n redhat-ods-monitoring | grep perses

For the full setup procedure, see Managing observability (RHOAI 3.4).

Overview

As part of Dev Preview, MaaS Platform includes a basic observability stack that provides insights into system performance, usage patterns, and operational health.

Note

The observability stack will be enhanced in future releases.

The observability stack consists of:

Limitador: Rate limiting service that exposes usage and rate-limit metrics (with labels from TelemetryPolicy)
Authorino: Authentication/authorization service that exposes auth evaluation metrics (auth_server_*)
Istio Telemetry: Adds subscription to gateway latency metrics for per-subscription latency (P50/P95/P99)
vLLM / llm-d / Simulator: Expose inference metrics (TTFT, ITL, queue depth, token throughput, KV-cache usage); llm-d also exposes EPP routing metrics
Prometheus: Metrics collection and storage (uses OpenShift platform Prometheus)
ServiceMonitors: Deployed to configure Prometheus metric scraping
Visualization: Grafana dashboards (see Grafana documentation)

Component Metrics Status

Component	Exposes Metrics?	Scraped into Prometheus?	In Dashboards?
Limitador	Yes (`/metrics`)	Yes (Kuadrant PodMonitor or MaaS ServiceMonitor)	Yes — 16 panels use `authorized_hits`, `authorized_calls`, `limited_calls`, `limitador_up`
Authorino	Yes (`/metrics` + `/server-metrics`)	Yes — `/metrics` via Kuadrant operator; `/server-metrics` via MaaS `authorino-server-metrics` ServiceMonitor	Yes — Auth Evaluation Latency (P50/P95/P99), Auth Success/Deny Rate, plus pod-up check
Istio Gateway	Yes (Envoy `/stats/prometheus`)	Yes (`istio-gateway-metrics` ServiceMonitor)	Yes — latency histograms, request counts, error rates
maas-api	No — returns 404 on `/metrics`	No	Only pod-up check via `kube_pod_status_phase`
vLLM / llm-d / Simulator	Yes (vLLM metrics on `/metrics` port 8000; llm-d EPP metrics on port 9090)	Yes — vLLM metrics via `kserve-llm-models` ServiceMonitor; EPP metrics require separate scrape config	Yes — TTFT, ITL, queue depth, latency, tokens, cache, prompt/generation ratio, queue wait time (EPP metrics not yet in MaaS dashboards)

maas-api Metrics Gap

The maas-api Go service does not expose a /metrics endpoint. Metrics such as API key creation rate, token issuance rate, model discovery latency, and request handler durations are not available in Prometheus. Adding Prometheus instrumentation (e.g. promhttp handler + application-specific counters/histograms) to the Go service is a recommended future improvement.

Installation

There are two ways to enable deployment-based observability:

Operator-managed (recommended): Enable via Tenant CR
Kustomize-based: Deploy manifests directly

Option 1: Operator-Managed Telemetry

When using the ODH/RHOAI operator, telemetry can be enabled via the Tenant CR (self-bootstrapped by maas-controller in the models-as-a-service namespace):

apiVersion: maas.opendatahub.io/v1alpha1
kind: Tenant
metadata:
  name: default-tenant
  namespace: models-as-a-service
spec:
  telemetry:
    enabled: true  # Enable TelemetryPolicy and Istio Telemetry
    metrics:
      captureOrganization: true
      captureUser: false      # Disabled by default (GDPR)
      captureGroup: false     # High cardinality
      captureModelUsage: true

Or patch an existing CR:

kubectl patch tenant default-tenant -n models-as-a-service --type=merge \
  -p '{"spec":{"telemetry":{"enabled":true}}}'

What the Tenant reconciler creates when telemetry.enabled: true:

Resource	Namespace	Purpose
TelemetryPolicy (`maas-telemetry`)	Gateway namespace	Adds `user`, `subscription`, `model` labels to Limitador usage metrics
Istio Telemetry (`latency-per-subscription`)	Gateway namespace	Adds `subscription` label to gateway latency metrics

Prerequisites for Operator-Managed Telemetry

The Tenant reconciler telemetry feature requires:

OpenShift Service Mesh (Istio) 2.4+ — for Istio Telemetry CRD
Kuadrant/RHCL — for TelemetryPolicy CRD and AuthPolicy header injection
Gateway deployed — Telemetry targets the gateway via selector

The Tenant reconciler checks for CRD availability before creating resources. If a CRD is not present, that resource is silently skipped.

AuthPolicy Header Dependency

The Istio Telemetry reads the subscription value from the X-MaaS-Subscription header, which must be injected by AuthPolicy:

response:
  success:
    headers:
      X-MaaS-Subscription:
        plain:
          expression: 'auth.metadata.apiKeyValidation.subscription'

Without this header injection, the subscription label on latency metrics will be empty.

Verify the feature is working:

# Check Istio Telemetry was created
kubectl get telemetry -n openshift-ingress latency-per-subscription

# Query Prometheus for subscription label
curl -sk -H "Authorization: Bearer $(oc whoami -t)" \
  "https://thanos-querier-openshift-monitoring.<cluster>/api/v1/label/subscription/values"

Option 2: Kustomize-Based Installation

Development/Testing Only

This is not the standard customer install. The production path is operator-managed via ODH/RHOAI (Option 1 above).

These Kustomize entrypoints exist so the team can install, iterate, and test observability without a full operator-driven deployment.

Kustomize Entrypoint Map

The observability stack is defined across multiple Kustomize directories. Each can be built and applied independently:

Entrypoint (kustomize build target)	What it deploys	Operator-owned equivalent
`deployment/base/observability/`	TelemetryPolicy + Istio Telemetry (conditional ServiceMonitors applied only via script)	Operator installs TelemetryPolicy as part of the MaaS stack; Kuadrant operator owns ServiceMonitors when `spec.observability.enable: true`
`deployment/components/observability/grafana/`	GrafanaDashboard CRs (Platform Admin, AI Engineer)	Operator does not manage Grafana dashboards; same CRs are used in both paths
`deployment/components/observability/prometheus/`	Standalone Prometheus + RBAC + ServiceMonitors in `llm-observability` namespace	OpenShift User Workload Monitoring (built-in Prometheus) — operator path relies on this instead
`deployment/components/observability/observability/`	Aggregator that pulls in `prometheus/` above	Same as above — this is a convenience wrapper
`deployment/components/observability/observability/dashboards/`	Perses PersesDashboard + Prometheus datasource	No operator equivalent — Perses is optional in both paths

Base observability resources (deployment/base/observability/):

Resource	Purpose
TelemetryPolicy (`gateway-telemetry-policy.yaml`)	Adds `user`, `subscription`, and `model` labels to Limitador metrics. The `model` label (from `responseBodyJSON`) is available on `authorized_hits`; `authorized_calls` and `limited_calls` carry `user` and `subscription`.
Istio Telemetry (`istio-gateway-telemetry.yaml`)	Adds `subscription` label to gateway latency (`istio_request_duration_milliseconds_bucket`) via the `X-MaaS-Subscription` header injected by the controller-generated AuthPolicy. Enables per-subscription latency tracking (P50/P95/P99).

Conditional resources (not in kustomization.yaml, applied only by install-observability.sh):

Resource	Applied when…
`limitador-servicemonitor.yaml`	No Kuadrant PodMonitor scraping Limitador `/metrics`
`authorino-server-metrics-servicemonitor.yaml`	No existing monitor scraping Authorino `/server-metrics`
`istio-gateway-service.yaml`	Gateway deployment exists in `openshift-ingress`
`istio-gateway-servicemonitor.yaml`	Gateway deployment exists (paired with the Service above)

Dry-Run Verification

Validate that every kustomization builds cleanly without applying to a cluster:

# Base telemetry (TelemetryPolicy + Istio Telemetry)
kustomize build deployment/base/observability \
  | kubectl apply --dry-run=client -f -

# Grafana dashboards
kustomize build deployment/components/observability/grafana \
  | kubectl apply --dry-run=client -f -

# Standalone Prometheus stack
kustomize build deployment/components/observability/prometheus \
  | kubectl apply --dry-run=client -f -

# Prometheus aggregator (same content as prometheus/ above)
kustomize build deployment/components/observability/observability \
  | kubectl apply --dry-run=client -f -

# Perses dashboards
kustomize build deployment/components/observability/observability/dashboards \
  | kubectl apply --dry-run=client -f -

CI Validation

CI runs scripts/ci/validate-manifests.sh on every PR that touches deployment/**, which runs kustomize build against all kustomization.yaml files. Files with kind: Component are skipped (they must be included via a parent's components: field), but directories whose kustomization.yaml declares kind: Kustomization can build standalone and are validated by CI dry-run.

Deploy to a Cluster

Quick deployment (recommended):

# Deploys base telemetry + conditional ServiceMonitors
./scripts/observability/install-observability.sh [--namespace NAMESPACE]

Manual deployment (step-by-step):

# 1. Base telemetry (requires Gateway + AuthPolicy to exist first)
kustomize build deployment/base/observability | kubectl apply -f -

# 2. Conditional ServiceMonitors (Limitador, Authorino, Gateway, LLM models)
#    Use the install script — it detects existing Kuadrant monitors to avoid duplicates:
./scripts/observability/install-observability.sh

# 3. Grafana dashboards (discovers Grafana instance cluster-wide)
./scripts/observability/install-grafana-dashboards.sh

# 4. (Optional) Standalone Prometheus — only if NOT using OpenShift User Workload Monitoring
kustomize build deployment/components/observability/observability | kubectl apply -f -

# 5. (Optional) Perses dashboards
kustomize build deployment/components/observability/observability/dashboards | kubectl apply -f -

When using the full deployment script, this is applied automatically:

./scripts/deploy.sh

Prerequisites

Tools: kubectl, kustomize, jq, yq must be installed
Cluster state: Gateway, AuthPolicy (gateway-auth-policy), and subscription selection must be deployed first. The AuthPolicy injects X-MaaS-Subscription, which Istio Telemetry reads to label latency by subscription. Without it, the subscription label on gateway latency will be empty.
Namespace: Use --namespace if your MaaS API is deployed to a namespace other than maas-api (e.g. --namespace opendatahub)

Operator vs Kustomize Drift Reference

When updating manifests in deployment/base/observability/ or deployment/components/observability/, check whether the operator path produces equivalent resources. Drift between the two is expected in some areas (e.g., the standalone Prometheus stack has no operator equivalent) but should be tracked for the telemetry CRs and ServiceMonitors.

Resource	Kustomize source	Operator creates?	Notes
TelemetryPolicy	`base/observability/gateway-telemetry-policy.yaml`	Yes	Keep in sync — labels must match
Istio Telemetry	`base/observability/istio-gateway-telemetry.yaml`	Yes	Keep in sync — header extraction must match
Limitador ServiceMonitor	`base/observability/limitador-servicemonitor.yaml`	Kuadrant PodMonitor when `observability.enable: true`	Script skips ours if Kuadrant's exists
Authorino /server-metrics	`base/observability/authorino-server-metrics-servicemonitor.yaml`	Not yet (Kuadrant only scrapes `/metrics`)	MaaS supplements the operator gap
Istio Gateway ServiceMonitor	`base/observability/istio-gateway-servicemonitor.yaml`	No	MaaS-only; applied conditionally
Grafana Dashboards	`components/observability/grafana/`	No	Same CRs used in both paths
Standalone Prometheus	`components/observability/prometheus/`	No (uses OpenShift UWM instead)	Dev/test only — not for production
Perses Dashboards	`components/observability/observability/dashboards/`	No	Optional in both paths

Keeping Docs Accurate

When you change any YAML under deployment/base/observability/ or deployment/components/observability/:

Verify the build still passes: kustomize build <dir> | kubectl apply --dry-run=client -f -
Update the entrypoint map table above if you add, remove, or rename an entrypoint
Update this document if the change affects user-facing instructions (metrics, ServiceMonitor behavior, dashboard content)

Metrics Collection

Limitador Metrics

Limitador exposes the following Prometheus metrics (verified against Limitador source code):

Core Limitador Metrics

Metric	Type	Labels	Description
`limitador_up`	Gauge	—	Limitador is running (1 = up)
`datastore_partitioned`	Gauge	—	Limitador is partitioned from backing datastore (0 = healthy)
`datastore_latency`	Histogram	—	Latency to the underlying counter datastore

MaaS Usage Metrics (Limitador + TelemetryPolicy)

When Kuadrant TelemetryPolicy and TokenRateLimitPolicy are applied, Limitador exposes these counters with custom labels injected by the wasm-shim from auth context and the model response body. These are the primary metrics for usage dashboards and chargeback:

Metric	Type	Labels	Description
`authorized_hits`	Counter	`user`, `subscription`, `model`, `limitador_namespace`	Total tokens consumed per request (from `usage.total_tokens` in the model response; input + output combined). The `model` label is extracted via `responseBodyJSON("/model")`.
`authorized_calls`	Counter	`user`, `subscription`, `limitador_namespace`	Requests allowed (not rate-limited).
`limited_calls`	Counter	`user`, `subscription`, `limitador_namespace`	Requests denied due to token rate limits.

model label availability

The model label is currently available only on authorized_hits. The authorized_calls and limited_calls metrics carry user and subscription labels but not model, due to how the wasm-shim constructs the CEL evaluation context for these counters. This is a known upstream limitation tracked for improvement in Kuadrant.

Gateway latency is labeled by subscription only via Istio Telemetry (see Per-Subscription Latency Tracking); per-user latency is not exposed on the gateway histogram to keep cardinality bounded.

Authorino Metrics

Authorino exposes metrics on two separate endpoints:

Endpoint	Metrics	Scraped?
`/metrics`	Controller-runtime (reconcile counts, workqueue depth)	Yes (`authorino-operator-monitor`, provided by Kuadrant)
`/server-metrics`	Auth evaluation metrics (see below)	Yes (`authorino-server-metrics`, deployed by MaaS `install-observability.sh`)

Auth server metrics (exposed on /server-metrics, port 8080):

Metric	Type	Labels	Description
`auth_server_authconfig_total`	Counter	`namespace`, `authconfig`	Total AuthConfig evaluations
`auth_server_authconfig_duration_seconds`	Histogram	`namespace`, `authconfig`	Auth evaluation latency
`auth_server_authconfig_response_status`	Counter	`namespace`, `authconfig`, `status`	Auth response status per AuthConfig (OK, denied, etc.)
`auth_server_response_status`	Counter	`status`	Aggregate auth response status across all AuthConfigs
`grpc_server_handled_total`	Counter	`grpc_method`, `grpc_code`	gRPC requests handled
`grpc_server_handling_seconds`	Histogram	`grpc_method`	gRPC request latency
`grpc_server_msg_received_total`	Counter	`grpc_method`	gRPC messages received
`grpc_server_msg_sent_total`	Counter	`grpc_method`	gRPC messages sent
`grpc_server_started_total`	Counter	`grpc_method`	gRPC requests started

MaaS ServiceMonitor

The Kuadrant-provided authorino-operator-monitor only scrapes /metrics (controller-runtime stats). MaaS deploys an additional authorino-server-metrics ServiceMonitor to scrape /server-metrics for auth evaluation metrics. This is deployed automatically by install-observability.sh.

Lazily registered metrics

Authorino upstream documents additional per-evaluator metrics (auth_server_evaluator_total, auth_server_evaluator_duration_seconds, auth_server_evaluator_cancelled, auth_server_evaluator_denied). These are lazily registered and only appear when specific evaluator types (e.g. OPA, HTTP authorization) are triggered. The MaaS AuthPolicy uses kubernetesTokenReview, which does not emit these metrics. They are not listed in the table above because they are not present in a standard MaaS deployment.

vLLM / Model Server Metrics

MaaS supports three model serving backends that expose Prometheus metrics on /metrics (port 8000), scraped by the kserve-llm-models ServiceMonitor:

vLLM (current stable) — full-featured LLM inference server
llm-d — llm-d inference platform (runs vLLM as backend + EPP routing layer)
llm-d-inference-sim (v0.8.2) — lightweight simulator for testing without GPUs

Supported versions:

Backend	Minimum Version	Sample Manifests
vLLM	v0.7.x stable	—
llm-d	v0.1.x	—
llm-d-inference-sim	v0.8.2	`docs/samples/models/simulator/`

vLLM Metrics (port 8000)

All three backends expose vllm:-prefixed metrics. The table below shows which metrics each backend provides.

Metric	Type	Simulator	vLLM	llm-d	Description
`vllm:num_requests_running`	Gauge	Y	Y	Y	Requests currently being processed
`vllm:num_requests_waiting`	Gauge	Y	Y	Y	Requests queued waiting for processing
`vllm:e2e_request_latency_seconds`	Histogram	Y	Y	Y	End-to-end inference latency
`vllm:time_to_first_token_seconds`	Histogram	Y	Y	Y	Time to First Token (TTFT)
`vllm:request_prompt_tokens`	Histogram	Y	Y	Y	Per-request prompt token counts (`_sum` gives cumulative total)
`vllm:request_generation_tokens`	Histogram	Y	Y	Y	Per-request generation token counts (`_sum` gives cumulative total)
`vllm:inter_token_latency_seconds`	Histogram	Y	Y	Y	Inter-Token Latency (ITL)
`vllm:kv_cache_usage_perc`	Gauge	Y	Y	Y	KV-cache usage (0-1)
`vllm:prompt_tokens_total`	Counter	Y	Y	Y	Total prompt tokens processed
`vllm:generation_tokens_total`	Counter	Y	Y	Y	Total generation tokens processed
`vllm:request_queue_time_seconds`	Histogram	—	Y	Y	Time requests wait in queue before processing (vLLM/llm-d only)
`vllm:request_success_total`	Counter	Y	Y	Y	Successful requests (`_total` suffix added by prometheus_client)
`vllm:request_prefill_time_seconds`	Histogram	Y	Y	Y	Time spent in prefill (prompt processing) phase
`vllm:request_decode_time_seconds`	Histogram	Y	Y	Y	Time spent in decode (token generation) phase
`vllm:request_inference_time_seconds`	Histogram	Y	—	—	Total inference time (simulator-specific)
`vllm:request_params_max_tokens`	Histogram	Y	—	—	Distribution of `max_tokens` request parameter
`vllm:max_num_generation_tokens`	Histogram	Y	—	—	Max generation tokens per request
`vllm:lora_requests_info`	Gauge	Y	—	—	LoRA adapter request info
`vllm:cache_config_info`	Gauge	Y	—	—	Cache configuration info (simulator-specific)
`vllm:time_per_output_token_seconds`	Histogram	Y	—	—	Legacy ITL name (kept by simulator for backward compat; not used by dashboards)

Simulator metric alignment

As of v0.7.1 (still true in v0.8.x), the simulator fully aligns with current vLLM metric names (kv_cache_usage_perc, inter_token_latency_seconds, prompt_tokens_total, generation_tokens_total). Older simulator versions (v0.6.x) used different names (gpu_cache_usage_perc, time_per_output_token_seconds) and are no longer supported by MaaS dashboards. The simulator also exposes additional metrics not used by MaaS dashboards (e.g. request_inference_time_seconds, request_params_max_tokens).

Lazily registered metrics

Some vLLM/simulator metrics are lazily registered — they only appear in /metrics output after the first event that triggers them. For example, request_queue_time_seconds (on real vLLM) only appears after a request actually queues (when max-num-seqs is exceeded). Similarly, histogram counters like e2e_request_latency_seconds only appear after the first inference request completes. Dashboard panels will show "No Data" until sufficient traffic has been generated. This is normal Prometheus client behavior, not a configuration issue.

Counter _total suffix

vLLM code defines counters as vllm:prompt_tokens and vllm:generation_tokens, but the Python prometheus_client library appends _total when exposing metrics. The actual scraped metric names in Prometheus are vllm:prompt_tokens_total and vllm:generation_tokens_total. The llm-d official dashboard confirms this by using the _total form.

llm-d EPP (Endpoint Picker) Metrics

When using llm-d, the inference gateway's Endpoint Picker (EPP) exposes additional routing and scheduling metrics on a separate port (9090). These are complementary to vLLM metrics and require a separate ServiceMonitor:

Metric	Type	Description
`inference_model_request_total`	Counter	Total inference requests per model
`inference_model_request_error_total`	Counter	Total errored requests per model
`inference_model_request_duration_seconds`	Histogram	Request duration through the EPP
`inference_model_input_tokens`	Counter	Input tokens routed per model
`inference_model_output_tokens`	Counter	Output tokens routed per model
`inference_model_running_requests`	Gauge	Currently running requests per model
`inference_pool_average_kv_cache_utilization`	Gauge	Average KV-cache utilization across the pool
`inference_pool_average_queue_size`	Gauge	Average queue size across the pool
`inference_pool_ready_pods`	Gauge	Number of ready pods in the inference pool

EPP metrics not yet in MaaS dashboards

EPP metrics are not currently scraped or visualized by MaaS. When deploying llm-d with the EPP, refer to the llm-d monitoring docs and the inference gateway dashboard for EPP-specific visualization.

Input/Output Token Split

vLLM metrics provide input vs output token breakdown per model (vllm:prompt_tokens_total / vllm:generation_tokens_total counters, or vllm:request_prompt_tokens / vllm:request_generation_tokens histograms). However, these do not carry user or subscription labels. For per-user billing with input/output split, upstream changes to the Kuadrant wasm-shim are required (see Known Limitations).

Dashboard Metric Queries

Dashboard panels use histogram _sum as primary data source. All queries work across vLLM, llm-d, and llm-d-inference-sim v0.8.2:

Panel	PromQL metric
Tokens (1h)	`request_prompt_tokens_sum` + `request_generation_tokens_sum`
Token Throughput	`rate(request_prompt_tokens_sum)`, `rate(request_generation_tokens_sum)`
Prompt/Gen Ratio	`rate(request_prompt_tokens_sum)` / total
ITL	`inter_token_latency_seconds_bucket`
KV Cache	`kv_cache_usage_perc`
Queue Wait Time	`request_queue_time_seconds_bucket` (vLLM/llm-d only)

See the vLLM metrics documentation for the full vLLM metric list and deprecation policy, and the llm-d monitoring documentation for llm-d-specific setup.

ServiceMonitor Configuration

ServiceMonitors are deployed by install-observability.sh to configure OpenShift's Prometheus to discover and scrape metrics from MaaS components.

Automatically Deployed:

Istio Gateway: Scrapes Envoy metrics from the MaaS gateway in openshift-ingress (deployed if the gateway exists)
KServe LLM Models: Scrapes vLLM metrics from model pods in the llm namespace (deployed if the llm namespace exists)

Conditionally Deployed (auto-detected by install-observability.sh):

Limitador (limitador-servicemonitor.yaml): Scrapes rate-limiting metrics from Limitador pods in kuadrant-system. Skipped when Kuadrant's own PodMonitor is already present. When Kuadrant CR has spec.observability.enable: true, the operator creates its own kuadrant-limitador-monitor PodMonitor that scrapes the same Limitador pod. Deploying both would cause duplicate metrics.
Authorino Server Metrics (authorino-server-metrics-servicemonitor.yaml): Scrapes auth evaluation metrics from Authorino's /server-metrics endpoint in kuadrant-system. Skipped if a Kuadrant-provided monitor already scrapes /server-metrics. This collects auth_server_authconfig_duration_seconds, auth_server_authconfig_response_status, and other auth server metrics that are not scraped by the Kuadrant-provided authorino-operator-monitor (which only covers /metrics for controller-runtime stats).

Already Provided by Kuadrant (when observability.enable: true):

Limitador PodMonitor (kuadrant-limitador-monitor): Created by the Kuadrant operator
Authorino Operator Monitor (authorino-operator-monitor): Scrapes Authorino controller metrics from /metrics only

Authorino Metrics Coverage

The Kuadrant-provided authorino-operator-monitor only scrapes /metrics (controller-runtime stats). The MaaS authorino-server-metrics ServiceMonitor supplements this by scraping /server-metrics for auth evaluation metrics (auth_server_authconfig_duration_seconds, auth_server_authconfig_response_status, etc.). The install-observability.sh script auto-detects whether a Kuadrant-provided monitor already scrapes /server-metrics and skips deploying the MaaS ServiceMonitor to avoid duplicates. See Authorino Observability for details.

High Availability for MaaS Metrics

For production deployments where metric persistence across pod restarts and scaling events is critical, you should configure Limitador to use Redis as a backend storage solution.

Why High Availability Matters

By default, Limitador stores rate-limiting counters in memory, which means:

All hit counts are lost when pods restart
Metrics reset when pods are rescheduled or scaled down
No persistence across cluster maintenance or updates

Setting Up Persistent Metrics

To enable persistent metric counts, refer to the detailed guide:

Configuring Redis storage for rate limiting

This Red Hat documentation provides:

Step-by-step Redis configuration for OpenShift
Secret management for Redis credentials
Limitador custom resource updates
Production-ready setup instructions

For local development and testing, you can also use our Limitador Persistence guide which includes a basic Redis setup script that works with any Kubernetes cluster.

Visualization

For dashboard visualization options, see:

OpenShift Monitoring: Monitoring overview
Grafana on OpenShift: Red Hat OpenShift AI Monitoring

Included Dashboards

MaaS includes two Grafana dashboards for different personas:

Platform Admin Dashboard

Provides a comprehensive view of system health, usage across all users, and resource allocation:

Section	Metrics
Component Health	Limitador up, Authorino pods, MaaS API pods, Gateway pods, Firing Alerts
Key Metrics	Total Tokens, Total Requests, Token Rate, Request Rate, Inference Success Rate, Active Users, P50 Response Latency, Rate Limit Ratio
Auth Evaluation	Auth Evaluation Latency (P50/P95/P99), Auth Success/Deny Rate
Traffic Analysis	Token/Request Rate by Model, Error Rates (4xx excl. 429, 5xx, 429 Rate Limited), Token/Request Rate by Subscription, P95 Latency
Error Breakdown	Rate Limited Requests, Unauthorized Requests
Model Metrics	vLLM queue depth, inference latency, KV cache usage, token throughput, prompt vs generation token ratio, queue wait time, TTFT, ITL
Top Users	By token usage, by declined requests
Detailed Breakdown	Token Rate by User, Request Volume by User & Subscription
Resource Allocation	CPU/Memory/GPU per model pod

Template Variables

The Platform Admin dashboard uses Grafana template variables for namespace filtering instead of hardcoded values. This allows the dashboard to adapt to different deployment configurations:

Variable	Default	Description
`$datasource`	`prometheus`	Prometheus datasource
`$maas_namespace`	auto-detected	MaaS API namespace (auto-detected from `kube_pod_info{pod=~"maas-api.*"}`)
`$kuadrant_namespace`	`kuadrant-system`	Kuadrant components namespace
`$gateway_namespace`	`openshift-ingress`	Istio/Gateway namespace
`$llm_namespace`	`llm`	LLM model pods namespace
`$model`	`All`	Filter by model name

To customize for your environment, change the variable values in Grafana's dashboard settings (gear icon → Variables).

AI Engineer Dashboard

Personal usage view for individual developers:

Section	Metrics
Usage Summary	My Total Tokens, My Total Requests, Token Rate, Request Rate, Rate Limit Ratio, Inference Success Rate
Usage Trends	Token Usage by Model, Usage Trends (tokens vs rate limited)
Detailed Analysis	Token Volume by Model, Rate Limited by Subscription

Inference Success Rate

Both dashboards use rate() on vLLM counters (request_success_total, e2e_request_latency_seconds_count) instead of raw counter values. This handles pod restarts correctly (counters reset independently and raw division produces incorrect results). When no traffic is present, rate()/rate() produces NaN; the dashboards use ((ratio) >= 0) OR vector(1) to filter NaN and default to 100% (healthy) when no traffic exists.

Tokens vs Requests

Both dashboards show token consumption (authorized_hits) for billing/cost tracking and request counts (authorized_calls) for capacity planning. Blue panels indicate request metrics; green panels indicate token metrics.

Per-User Token Billing

The Platform Admin dashboard shows token consumption aggregated by subscription and model for system-level visibility. Per-user token consumption for billing is available via:

AI Engineer dashboard: Individual users see their own token usage
Prometheus API: Query sum by (user) (increase(authorized_hits[24h])) for billing periods
RFE: A dedicated /maas-api/v1/usage chargeback API endpoint is recommended for production billing workflows

Prerequisites

Grafana must be installed (for example via your observability team's process, a centralized instance, or the Grafana Operator). The dashboard helper does not install Grafana; it only deploys MaaS dashboard definitions and never fails (warnings only if none or multiple instances are found).
Ensure the Grafana instance has label app=grafana so MaaS dashboard definitions attach.
Configure a Prometheus or Thanos datasource in Grafana; the MaaS dashboards use the default Prometheus datasource.

Deploying Dashboards

Monitoring is installed by install-observability.sh. Dashboards are installed by a separate helper that discovers Grafana cluster-wide:

./scripts/observability/install-grafana-dashboards.sh

Behavior: Scans for Grafana CRs cluster-wide. If one instance is found, deploys dashboards to that namespace and prints a success message. If none or multiple are found, prints a warning (and, for multiple, lists them) and exits without error. Use flags to target a specific instance:

./scripts/observability/install-grafana-dashboards.sh --grafana-namespace maas-api
./scripts/observability/install-grafana-dashboards.sh --grafana-label app=grafana

To deploy only the dashboard manifests manually (same namespace as your Grafana):

kustomize build deployment/components/observability/grafana | \
  sed "s/namespace: maas-api/namespace: <your-namespace>/g" | \
  kubectl apply -f -

Sample Dashboard JSON

For manual import, a sample dashboard JSON file is available:

MaaS Token Metrics Dashboard

To import into Grafana:

Go to Grafana → Dashboards → Import
Upload the JSON file or paste content
Select your Prometheus datasource

Key Metrics Reference

Token and Request Metrics

Metric	Description	Labels
`authorized_hits`	Total tokens consumed (input + output combined, from `usage.total_tokens` in model responses)	`user`, `subscription`, `model`
`authorized_calls`	Total requests allowed	`user`, `subscription`
`limited_calls`	Total requests rate-limited	`user`, `subscription`

When to use which metric

Billing/Cost: Use authorized_hits - represents actual token consumption, with model label for per-model breakdown
API Usage: Use authorized_calls - represents number of API calls (per user, per subscription)
Rate Limiting: Use limited_calls - shows quota violations (per user, per subscription)

Total tokens only (input/output split not yet available)

Token consumption is reported as total tokens (prompt + completion) per request. The pipeline reads usage.total_tokens from the model response via Kuadrant's TokenRateLimitPolicy. Separate input-token (prompt_tokens) and output-token (completion_tokens) counters are not yet available at the gateway level; this would require upstream changes in the Kuadrant wasm-shim to send separate hits_addend values for each token type. Chargeback and usage tracking per user, per subscription, and per model are supported using authorized_hits.

Latency Metrics

Metric	Description	Labels
`istio_request_duration_milliseconds_bucket`	Gateway-level latency histogram	`destination_service_name`, `subscription`
`vllm:e2e_request_latency_seconds`	Model inference latency	`model_name`

Per-Subscription Latency Tracking

The MaaS Platform uses an Istio Telemetry resource to add a subscription dimension to gateway latency metrics. This enables tracking request latency per subscription (e.g. free, premium, enterprise). Gateway latency is labeled by subscription only (not by user) to keep metric cardinality bounded and to align with latency-by-subscription requirements (e.g. P50/P95/P99 per subscription). Per-user metrics remain available from Limitador (authorized_hits, authorized_calls, limited_calls).

How it works:

The gateway-auth-policy injects the X-MaaS-Subscription header from the resolved subscription
The Istio Telemetry resource extracts this header and adds it as a subscription label to the REQUEST_DURATION metric
Prometheus scrapes these metrics from the Istio gateway

Configuration (deployment/base/observability/istio-gateway-telemetry.yaml):

apiVersion: telemetry.istio.io/v1
kind: Telemetry
metadata:
  name: latency-per-subscription
  namespace: openshift-ingress
spec:
  selector:
    matchLabels:
      gateway.networking.k8s.io/gateway-name: maas-default-gateway
  metrics:
  - providers:
    - name: prometheus
    overrides:
    - match:
        metric: REQUEST_DURATION
        mode: CLIENT_AND_SERVER
      tagOverrides:
        subscription:
          operation: UPSERT
          value: request.headers["x-maas-subscription"]

Security

The X-MaaS-Subscription header should be injected server-side by AuthPolicy. Ensure your AuthPolicy injects this header from the subscription selection (not client input) for accurate metrics attribution.

Common Queries

Token-based queries (billing/cost):

# Total tokens consumed per user
sum by (user) (authorized_hits)

# Token consumption rate per model (tokens/sec)
sum by (model) (rate(authorized_hits[5m]))

# Top 10 users by tokens consumed
topk(10, sum by (user) (authorized_hits))

# Token consumption by subscription
sum by (subscription) (authorized_hits)

Request-based queries (capacity/usage):

# Total requests per user
sum by (user) (authorized_calls)

# Request rate per subscription (requests/sec)
sum by (subscription) (rate(authorized_calls[5m]))

# Top 10 users by request count
topk(10, sum by (user) (authorized_calls))

Inference success rate (system health — did requests that reached the model succeed?):

# Inference success rate using rate() to handle counter resets correctly
# The >= 0 filter removes NaN (0/0 when no traffic), falling back to vector(1) = 100%
((sum(rate(vllm:request_success_total[5m])) / sum(rate(vllm:e2e_request_latency_seconds_count[5m]))) >= 0) OR vector(1)

Rate limiting metrics (capacity planning — are users exceeding their quotas?):

# Rate limit ratio (percentage of requests rejected by rate limiting)
(sum(limited_calls) / (sum(authorized_calls) + sum(limited_calls))) OR vector(0)

# Rate limit ratio by subscription
(sum by (subscription) (limited_calls) / (sum by (subscription) (authorized_calls) + sum by (subscription) (limited_calls))) OR vector(0)

# Rate limit violations per second by subscription
sum by (subscription) (rate(limited_calls[5m]))

# Users hitting rate limits most
topk(10, sum by (user) (limited_calls))

Latency metrics (per-subscription SLA tracking):

# P99 latency per subscription
histogram_quantile(0.99, sum by (subscription, le) (rate(istio_request_duration_milliseconds_bucket{subscription!=""}[5m])))

# P50 latency per subscription
histogram_quantile(0.5, sum by (subscription, le) (rate(istio_request_duration_milliseconds_bucket{subscription!=""}[5m])))

Latency queries:

# P99 latency by service
histogram_quantile(0.99, sum by (destination_service_name, le) (rate(istio_request_duration_milliseconds_bucket[5m])))

# P50 (median) latency
histogram_quantile(0.5, sum by (le) (rate(istio_request_duration_milliseconds_bucket[5m])))

# P99 latency per subscription
histogram_quantile(0.99, sum by (subscription, le) (rate(istio_request_duration_milliseconds_bucket{subscription!=""}[5m])))

Filtering by subscription

For per-subscription latency queries, use subscription!="" to exclude requests where the X-MaaS-Subscription header was not injected. Token consumption metrics (authorized_hits, authorized_calls) from Limitador already only include successful requests.

Maintenance

Grafana Datasource Token Rotation

The Grafana datasource uses a ServiceAccount token to authenticate with Prometheus. This token is valid for 30 days and must be rotated periodically.

To rotate the token:

# Delete the existing datasource and create a new one (or rotate the token per your Grafana setup).
# To re-deploy only MaaS dashboard definitions: ./scripts/observability/install-grafana-dashboards.sh

Production Recommendation

For production deployments, consider automating token rotation using a CronJob or external secrets operator to avoid dashboard outages.

Known Limitations

Currently Blocked Features

Some features require upstream changes and are currently blocked:

Feature	Blocker	Workaround
`model` label on `authorized_calls` / `limited_calls`	Kuadrant wasm-shim does not pass `responseBodyJSON` context for these counters	Use `authorized_hits` for per-model breakdown; `authorized_calls`/`limited_calls` support per-user and per-subscription
Input/output token split	Kuadrant TokenRateLimitPolicy sends a single `hits_addend` (total tokens); no mechanism for separate prompt/completion counters	Total tokens available via `authorized_hits`; the response body contains `usage.prompt_tokens` and `usage.completion_tokens` but the wasm-shim does not split them
Input/output token breakdown per user	vLLM does not label its own metrics with `user`	Total tokens per user available via `authorized_hits{user="..."}`; vLLM prompt/generation token metrics are per-model only
Rate-limited requests not in Istio metrics	When the Kuadrant WASM plugin rejects a request (429), it calls `sendLocalReply()` which short-circuits the Envoy filter chain. These requests appear in Limitador metrics (`limited_calls`) but may not appear in Istio gateway metrics.	Use `limited_calls` from Limitador for rate-limiting visibility (has correct `subscription` and `user` labels).
Kuadrant policy health metrics	`kuadrant_policies_enforced`, `kuadrant_policies_total` etc. are defined in Kuadrant dev but not yet shipped in RHCL 1.x	Enable `observability.enable: true` on the Kuadrant CR; the ServiceMonitors are created but policy-specific gauges will appear in a future operator release
Authorino auth server metrics (upstream)	The Kuadrant-provided `authorino-operator-monitor` only scrapes `/metrics` (controller-runtime); `/server-metrics` is not scraped by the upstream operator	Resolved by MaaS: The `authorino-server-metrics` ServiceMonitor (deployed by `install-observability.sh`) scrapes `/server-metrics`. Auth evaluation latency and success/deny rate are visualized in the Platform Admin dashboard.
maas-api application metrics	The maas-api Go service does not expose a `/metrics` endpoint	No workaround available. Metrics such as API key creation rate, token issuance rate, model discovery latency, and handler durations require adding Prometheus instrumentation to the Go service (e.g. `promhttp` handler, custom counters/histograms).
PromQL "name does not end in _total" warnings	Limitador metrics (`authorized_hits`, `authorized_calls`, `limited_calls`) and Authorino's `auth_server_authconfig_response_status` are counters but do not follow the Prometheus naming convention of ending in `_total`. When `rate()` is applied, Prometheus generates a warning that Grafana displays on panels. This is Grafana issue #84636 (open).	The warnings are cosmetic and do not affect data correctness. All dashboard queries correctly apply `rate()` or `increase()` to these counters. The metric names are defined by upstream Kuadrant (Limitador) and Authorino — renaming requires upstream changes.

Total Tokens vs Token Breakdown

Total token consumption per user is available via authorized_hits{user="..."}. The blocked feature is the input/output split (prompt vs generation tokens) at the gateway level, which requires the wasm-shim to send two separate counter updates to Limitador.

Available Per-User and Per-Subscription Metrics

Feature	Metric	Label
Latency per subscription	`istio_request_duration_milliseconds_bucket`	`subscription`
Token consumption per user	`authorized_hits`	`user`
Token consumption per subscription	`authorized_hits`	`subscription`
Token consumption per model	`authorized_hits`	`model`
Requests per user	`authorized_calls`	`user`
Requests per subscription	`authorized_calls`	`subscription`
Rate limited per user	`limited_calls`	`user`
Rate limited per subscription	`limited_calls`	`subscription`

Requirements Alignment

Requirement	Status	Notes
Usage dashboards (token consumption per user, per subscription, per model)	Met	Grafana dashboard + `authorized_hits` with `user`, `subscription`, `model`; Prometheus scrapes Limitador `/metrics`.
Latency by subscription (P50/P95/P99)	Met	`istio_request_duration_milliseconds_bucket` with `subscription` label; subscription-only avoids unbounded cardinality.
Request tracking (per user, per subscription)	Met	`authorized_calls` with `user` and `subscription` labels; `limited_calls` for rate-limit violations.
Export for chargeback (CSV/API)	Not provided (RFE)	Per-user token data exists in Prometheus (`authorized_hits{user="..."}`) but no dedicated billing API or export endpoint is implemented. RFE recommendation: Add `/maas-api/v1/usage` endpoint that queries Prometheus and returns per-user, per-subscription, per-model token consumption in CSV/JSON for finance and chargeback systems.
Input/output token split	Not available	Only total tokens (`authorized_hits`); separate input and output counters require upstream Kuadrant wasm-shim changes to send split `hits_addend` values.
`model` label on request/rate-limit counters	Partial	`model` available on `authorized_hits` only; requires upstream Kuadrant fix to propagate `responseBodyJSON` context to `authorized_calls`/`limited_calls` counters.
Policy enforcement health	Future	Kuadrant operator metrics (`kuadrant_policies_enforced`, `kuadrant_ready`, etc.) defined upstream but not yet shipped in RHCL 1.x; `limitador_up` and `datastore_partitioned` are available now.
Auth evaluation metrics	Met	Authorino `/server-metrics` is scraped by the `authorino-server-metrics` ServiceMonitor. Auth evaluation latency (P50/P95/P99) and success/deny rate are available in the Platform Admin dashboard.
maas-api application metrics	Not available (gap)	The maas-api Go service does not expose `/metrics`. API key creation rate, token issuance rate, and handler latency are not observable. Requires adding Prometheus instrumentation to the Go service.