Skip to main content
Version: 5.4.0

Deploy Hugging Face Model

Overview

This guide walks you through deploying a Hugging Face model from Bridge model catalog. Hugging Face models are open-source models hosted on the Hugging Face Hub (e.g., Qwen/Qwen2.5-1.5B-Instruct, meta-llama/Llama-3). Bridge pulls the model weights from Hugging Face, serves the model on GPU infrastructure, and exposes it via an endpoint for inference.

This guide covers:

  • Selecting a Hugging Face model from the catalog and starting deployment
  • Configuring the model name, endpoint, GPU type and count, and rate limits
  • Providing a Hugging Face token (if required)
  • Monitoring deployment until the model is running

Prerequisites

  • Tenant Admin access — You must log in as a Tenant Admin to deploy models from the catalog.
  • Model catalog — The Hugging Face model you want to deploy must be available in Bridge model catalog. The Available Models tab lists all deployable models. If the model you need is not listed, contact your Bridge Super Administrator to add it.
  • Endpoint — You may need to create or select an LLM endpoint before or during deployment.
  • Hugging Face token — Some models (especially gated models) require a Hugging Face access token to download model weights. Have your token ready.
note

If you do not have a Hugging Face token, create one from https://huggingface.co/ — sign in or sign up, then create a token in your account settings.

Deploy Model

Step 1: Select Model

  1. Log in to Bridge as a Tenant Admin.
  2. In the left sidebar, open Models.
  3. Open the Available Models tab. All catalog models available for deployment are listed.

Available Models

  1. Find the Hugging Face model you want to deploy (e.g., Qwen/Qwen2.5-1.5B-Instruct) and click Deploy.

Select Model

Step 2: Model Details

  1. Enter a model Name (e.g., qwenmodel) and Description.
  2. Click Next.

Configure Model

Step 3: Model Configuration

  1. Select the dType (data type) from the dropdown.
  2. Enter your Hugging Face Token to download the model.
  3. Enter the Max Model Length.
info
  • dType controls the numerical precision used for the model's weights during inference — lower precision (e.g., float16) reduces GPU memory usage and speeds up inference, while higher precision (e.g., float32) preserves accuracy. Select auto to use the model's original precision.
  • Max Model Length is the maximum number of tokens (input + output combined) the model can process in a single request. A higher value allows longer prompts and responses but requires more GPU memory.

Configure Model Endpoint

Step 4: Select Endpoint and GPU

  1. Select the Endpoint that will expose this model.
  2. Select the GPU type (e.g., L4).
  3. Set GPU count (e.g., 1).
  4. Click Next.

Select Model GPU

Step 5: Set Rate Limits and Pricing

  1. Configure the following rate limits and pricing:
    • Token per minute — e.g., 4000000
    • Request per minute — e.g., 50
    • Currency — e.g., USD
    • Price per million input tokens — e.g., 1
    • Price per million output tokens — e.g., 1
  2. Click Deploy.

Model Rate Limit

Step 6: Monitor Deployment

Deployment typically takes 10–15 minutes.

  1. Watch the deployment progress in the UI. The model status will initially show Processing.

Model Process State

  1. When deployment completes successfully, the model status shows Running.

Model Success State

Next Steps