Skip to main content

GPU Overview

Bridge discovers and manages GPUs as part of metal provisioning. GPU details are retrieved from each server during the Redfish discovery phase and stored in the Bridge catalog as hardware attributes. This information drives flavor creation, compute allocation, and resource scheduling.

Supported GPU Families

VendorPlatformInterconnectIsolation Model
NVIDIAH100 PCIe, A100 PCIePCIeIOMMU passthrough
NVIDIAHGX H100, HGX H200, HGX B200NVLink + NVSwitchIOMMU + NVLink Secure Partition
NVIDIAGB200 NVL72NVLink + NVSwitchIOMMU + NVLink Secure Partition
NVIDIAGH200NVLink + NVSwitchIOMMU + NVLink Secure Partition
AMDInstinct MI300XPCIeIn progress

GPU Post-Provisioning

After the OS is deployed, Bridge's post-provisioning controller installs the GPU software stack on each server:

ComponentPurpose
CUDA librariesEnable GPU compute workloads on NVIDIA hardware
MOFED (Mellanox OFED)Enable RDMA networking for GPU Direct RDMA
GPU kernel modulesLoad NVIDIA or AMD GPU drivers and kernel extensions
DCGM (Data Center GPU Manager)GPU health metrics collection and fault detection
OTEL agentExport GPU metrics to Bridge observability pipeline

Post-provisioning runs automatically after OS installation and completes before the server is made available for tenant allocation.

GPU Resource Models

Bridge supports three GPU resource allocation models depending on the deployment type:

Full GPU Passthrough

The default model for bare metal and VM deployments. Each GPU is assigned exclusively to a single tenant via PCIe device passthrough (IOMMU). This provides maximum performance and full hardware isolation.

For HGX systems, Bridge additionally configures an NVLink Secure Partition, ensuring that full NVLink bandwidth is available within the tenant's GPU group while cross-tenant NVLink traffic is blocked at the hardware level.

MIG Partitioning

On NVIDIA A100 and H100 GPUs, Bridge can prepare servers for Multi-Instance GPU (MIG) partitioning. MIG divides a single GPU into up to seven isolated instances, each with dedicated compute units, L2 cache, and memory bandwidth. Bridge configures MIG instances as part of the compute allocation flow, with the Kubernetes device plugin managing instance-level scheduling.

vGPU (Time-Sliced)

Bridge supports vGPU deployments using NVIDIA's licensed vGPU stack. vGPU enables multiple VMs to share a single physical GPU through time-sliced scheduling. This model requires NVIDIA vGPU software licenses and is configured during VM provisioning.

For the highest quality of service, MIG-backed vGPU combines hardware spatial partitioning (MIG) with temporal partitioning (vGPU), providing guaranteed isolation between tenant slices.

GPU Observability

Bridge collects GPU health and performance metrics using DCGM, and forwards them via OTEL to the Bridge monitoring pipeline. Metrics are visible on the Bridge admin dashboard and can be used for capacity planning, fault detection, and tenant billing.

Metric CategoryExamples
Compute utilizationGPU core utilization (%)
MemoryGPU memory used and free
ThermalGPU temperature, thermal throttling
PowerGPU power draw, power cap
ErrorsECC memory errors, retired pages
NVLinkNVLink bandwidth (HGX systems)
  • NVIDIA — NVIDIA GPU configuration and isolation models
  • AMD — AMD Instinct GPU support
  • NVSwitch — NVLink Secure Partition for HGX multi-GPU servers
  • Metal Provisioning Overview — Full server provisioning sequence