Deploy Model

Overview

Model deployment in Bridge lets you serve machine learning models as inference APIs so that applications and services can call them for predictions.

As a Tenant Admin, you can deploy models from Bridge model catalog, configure GPU resources and rate limits, and expose models via endpoints for tenant users and applications.

This guide covers:

Selecting a model from the catalog and starting deployment
Configuring the model name, endpoint, GPU type and count, and rate limits
Monitoring deployment until the model is running

After deployment, tenant users can test the model in the Model Playground.

Prerequisites

Tenant Admin access — You must log in as a Tenant Admin to deploy models from the catalog.
Model catalog — The model you want to deploy must be available in Bridge model catalog (see Prepare Model below).
Endpoint (optional) — For some models you may need to create or select an endpoint (e.g., an LLM endpoint) before or during deployment.

Prepare Model

Model Requirements

Models should be:

Serialized in standard format (SavedModel, ONNX, etc.)
Packaged with dependencies
Tested locally
Documented with input/output specs

Supported Formats

TensorFlow SavedModel
PyTorch TorchScript
ONNX models
Custom containers

Model catalog

Models are deployed from the Bridge model catalog. Bridge provides default models from Hugging Face (e.g., Qwen/Qwen2.5-1.5B-Instruct) and NIM (e.g., meta/llama-3.2-3b-instruct); you can deploy these directly from the catalog. The Available Models tab in the Models section lists all models that can be deployed. If the model you need is not listed, contact Bridge Super Administrator to add it to the catalog.

Model requirements (for catalog models)

Models in the catalog are typically:

Serialized in a standard format — For example, SavedModel, ONNX, or TorchScript, so they can be loaded and served by the runtime.
Packaged with dependencies — Any runtime or framework dependencies are included or specified.
Documented — Input/output specifications and usage are documented so you can configure deployment and test correctly.

The catalog may include models in these formats; the model card or catalog entry indicates the format.

Credentials (if required)

Some catalog models (for example, from Hugging Face) require credentials to pull the model or access gated assets. Have the following ready if your model needs them:

Hugging Face token — For models that require Hugging Face access; you will enter this when configuring the deployment (see Step 2).

note

If you do not have a Hugging Face token, create one from https://huggingface.co/ (sign in or sign up, then create a token in your account settings).

Deploy Model

Step 1: Select Model

Log in to Bridge as a Tenant Admin.
In the left sidebar, open Models.
Open the Available Models tab. All models available for deployment are listed.

Available Models

Hover over the model you want to deploy (e.g., Qwen/Qwen2.5-1.5B-Instruct). Click Deploy Model when it appears.

Select Model

Step 2: Configure Model and Endpoint

Enter a Model name (e.g., qwenmodel). The selected catalog model (e.g., Qwen/Qwen2.5-1.5B-Instruct) is auto-selected. Click Next.

Configure Model

Enter your Hugging Face Token if the model requires it. Select or create the Endpoint (e.g., llmendpoint) that will expose this model. Click Next.

Configure Model Endpoint

Select the GPU type (e.g., L4) and set GPU count (e.g., 1). Click Next.

Select Model GPU

Set rate limits and pricing as required, then click Deploy.

Example values:
- Token per minute — e.g., 4000000
- Request per minute — e.g., 50
- Currency — e.g., USD
- Price per million input tokens — e.g., 1
- Price per million output tokens — e.g., 1

Model Rate LimitGPU

Step 3: Monitor Deployment

Deployment typically takes 10–15 minutes.

Watch the deployment progress in the UI. The model status will initially show Processing.

Model Process State

When deployment completes successfully, the model status shows Running.

Model Success State

Performance Metrics

Monitor model performance:

Click on the AI Studio then click View details for the model.

Model view details

The following metrics are shown:

TTFT (Time To First Token)
TPOT (Time Per Output Token)
Input Token Rate
Output Token Rate
KV Cache Utilization
Prefix Cache Hit Rate
End-to-End (E2E) Latency
Request Prefill Time
Request Decode Time

Model LLm Observability-1

Model LLm Observability-2

Model Logs

View model serving logs:

kubectl logs deployment/<model-name>

Or through UI:

Model Logs

Scale Model

Add Replicas

Enable Autoscaling to dynamically adjust resources based on demand:

Enable Autoscaling to allow replicas to increase based on demand.
Enter Min replicas and Max replicas.
Enter the threshold value.

Scale Model

Make Predictions

Via API Endpoint

Call your deployed model:

curl -X POST https://model-endpoint.domain.com/predict \
  -H "Authorization: Bearer API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [[1.0, 2.0, 3.0, 4.0]]
  }'

Response Format

Model returns predictions:

{
  "predictions": [0.95],
  "confidence": 0.98
}

Best Practices

Model Serving

Version your models
Test locally before deployment
Monitor prediction latency
Set appropriate resource limits
Use health checks

Performance

Batch requests when possible
Optimize model size
Use quantization if appropriate
Monitor GPU/CPU utilization
Scale based on demand

Security

Authenticate API access
Rate limit requests
Monitor for abuse
Log all predictions
Encrypt data in transit

Overview​

Prerequisites​

Prepare Model​

Model Requirements​

Supported Formats​

Model catalog​

Model requirements (for catalog models)​

Credentials (if required)​

Deploy Model​

Step 1: Select Model​

Step 2: Configure Model and Endpoint​

Step 3: Monitor Deployment​

Performance Metrics​

Model Logs​

Scale Model​

Add Replicas​

Make Predictions​

Via API Endpoint​

Response Format​

Best Practices​

Model Serving​

Performance​

Security​

Next Steps​

Overview

Prerequisites

Prepare Model

Model Requirements

Supported Formats

Model catalog

Model requirements (for catalog models)

Credentials (if required)

Deploy Model

Step 1: Select Model

Step 2: Configure Model and Endpoint

Step 3: Monitor Deployment

Performance Metrics

Model Logs

Scale Model

Add Replicas

Make Predictions

Via API Endpoint

Response Format

Best Practices

Model Serving

Performance

Security

Next Steps