Model Access Guide
This guide explains how to interact with deployed models on the MaaS Platform, including authentication, making requests, and handling responses.
Overview
The MaaS Platform provides a secure, tier-based model access system where: - Users authenticate with tokens obtained from the MaaS API - Access is controlled by subscription tiers (free, premium, enterprise) - Rate limiting and token consumption tracking are enforced - Models are accessed through the MaaS gateway for policy enforcement
Authentication
Getting a Token
Before accessing models, you need to obtain an authentication token:
# Get your OpenShift token
OC_TOKEN=$(oc whoami -t)
# Set your MaaS API endpoint
HOST="https://maas-api.your-domain.com"
MAAS_API_URL="${HOST}/maas-api"
# Request an access token
TOKEN_RESPONSE=$(curl -sSk \
-H "Authorization: Bearer ${OC_TOKEN}" \
-H "Content-Type: application/json" \
-d '{"expiration": "15m"}' \
"${MAAS_API_URL}/v1/tokens")
ACCESS_TOKEN=$(echo $TOKEN_RESPONSE | jq -r .token)
Token Lifecycle
- Default lifetime: 4 hours
- Maximum lifetime: Determined by cluster configuration
- Refresh: Request a new token before expiration
- Revocation: Tokens can be revoked if compromised
Discovering Models
List Available Models
Get a list of models available to your tier:
MODELS=$(curl ${HOST}/maas-api/v1/models \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${ACCESS_TOKEN}")
echo $MODELS | jq .
Example response:
{
"data": [
{
"id": "simulator",
"name": "Simulator Model",
"url": "https://gateway.your-domain.com/simulator/v1/chat/completions",
"tier": "free"
},
{
"id": "qwen3",
"name": "Qwen3 Model",
"url": "https://gateway.your-domain.com/qwen3/v1/chat/completions",
"tier": "premium"
}
]
}
Get Model Details
Get detailed information about a specific model:
MODEL_ID="simulator"
MODEL_INFO=$(curl ${HOST}/maas-api/v1/models \
-H "Authorization: Bearer ${ACCESS_TOKEN}" | \
jq --arg model "$MODEL_ID" '.data[] | select(.id == $model)')
echo $MODEL_INFO | jq .
Making Requests
Basic Chat Completion
Make a simple chat completion request:
MODEL_URL="https://gateway.your-domain.com/simulator/v1/chat/completions"
curl -sSk \
-H "Authorization: Bearer ${ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"model": "simulator",
"messages": [
{
"role": "user",
"content": "Hello, how are you?"
}
],
"max_tokens": 100
}' \
"${MODEL_URL}"
Advanced Request Parameters
Use additional parameters for more control:
curl -sSk \
-H "Authorization: Bearer ${ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"model": "simulator",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Explain quantum computing in simple terms."
}
],
"max_tokens": 200,
"temperature": 0.7,
"top_p": 0.9,
"stream": false
}' \
"${MODEL_URL}"
Streaming Responses
For real-time responses, use streaming:
curl -sSk \
-H "Authorization: Bearer ${ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"model": "simulator",
"messages": [
{
"role": "user",
"content": "Write a short story about a robot."
}
],
"max_tokens": 300,
"stream": true
}' \
"${MODEL_URL}" | while IFS= read -r line; do
if [[ $line == data:* ]]; then
echo "${line#data: }" | jq -r '.choices[0].delta.content // empty' 2>/dev/null
fi
done
Handling Responses
Standard Response Format
Models return responses in the OpenAI-compatible format:
{
"id": "chatcmpl-123",
"object": "chat.completion",
"created": 1677652288,
"model": "simulator",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! I am doing well, thank you for asking."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 9,
"completion_tokens": 12,
"total_tokens": 21
}
}
Processing Responses
Extract content from responses:
RESPONSE=$(curl -sSk \
-H "Authorization: Bearer ${ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"model": "simulator",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
],
"max_tokens": 50
}' \
"${MODEL_URL}")
# Extract the response content
CONTENT=$(echo $RESPONSE | jq -r '.choices[0].message.content')
echo "Model response: $CONTENT"
# Extract token usage
PROMPT_TOKENS=$(echo $RESPONSE | jq -r '.usage.prompt_tokens')
COMPLETION_TOKENS=$(echo $RESPONSE | jq -r '.usage.completion_tokens')
TOTAL_TOKENS=$(echo $RESPONSE | jq -r '.usage.total_tokens')
echo "Token usage: $TOTAL_TOKENS total ($PROMPT_TOKENS prompt + $COMPLETION_TOKENS completion)"
Error Handling
Common Error Responses
401 Unauthorized
{
"error": {
"message": "Invalid authentication token",
"type": "invalid_request_error",
"code": "invalid_api_key"
}
}
403 Forbidden
{
"error": {
"message": "Insufficient permissions for this model",
"type": "permission_error",
"code": "access_denied"
}
}
429 Too Many Requests
{
"error": {
"message": "Rate limit exceeded",
"type": "rate_limit_error",
"code": "rate_limit_exceeded"
}
}
Handling Errors in Scripts
make_request() {
local model_url="$1"
local prompt="$2"
response=$(curl -sSk \
-H "Authorization: Bearer ${ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-d "{
\"model\": \"simulator\",
\"messages\": [
{
\"role\": \"user\",
\"content\": \"$prompt\"
}
],
\"max_tokens\": 100
}" \
"${model_url}")
# Check for errors
if echo "$response" | jq -e '.error' > /dev/null; then
error_message=$(echo "$response" | jq -r '.error.message')
error_code=$(echo "$response" | jq -r '.error.code')
echo "Error: $error_message (Code: $error_code)" >&2
return 1
fi
# Extract and return content
echo "$response" | jq -r '.choices[0].message.content'
}
# Usage
if result=$(make_request "$MODEL_URL" "Hello, world!"); then
echo "Success: $result"
else
echo "Request failed"
fi
Rate Limiting and Quotas
Understanding Limits
Each tier has different limits:
Tier | Requests/2min | Tokens/min |
---|---|---|
Free | 5 | 100 |
Premium | 20 | 50,000 |
Enterprise | 50 | 100,000 |
Monitoring Usage
Check your current usage through response headers or the metrics dashboard:
# Make a request and check headers
curl -I -sSk \
-H "Authorization: Bearer ${ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-d '{"model": "simulator", "messages": [{"role": "user", "content": "test"}]}' \
"${MODEL_URL}" | grep -i "x-ratelimit"
Implementing Rate Limiting
For applications that need to respect rate limits:
# Simple rate limiting implementation
make_request_with_backoff() {
local model_url="$1"
local prompt="$2"
local max_retries=3
local retry_count=0
while [ $retry_count -lt $max_retries ]; do
response=$(curl -sSk \
-H "Authorization: Bearer ${ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-d "{
\"model\": \"simulator\",
\"messages\": [
{
\"role\": \"user\",
\"content\": \"$prompt\"
}
],
\"max_tokens\": 100
}" \
"${model_url}")
# Check for rate limit error
if echo "$response" | jq -e '.error.code == "rate_limit_exceeded"' > /dev/null; then
retry_count=$((retry_count + 1))
echo "Rate limit exceeded, waiting before retry $retry_count/$max_retries..." >&2
sleep 30 # Wait 30 seconds before retry
else
echo "$response" | jq -r '.choices[0].message.content'
return 0
fi
done
echo "Max retries exceeded" >&2
return 1
}