GPU Overview
Bridge discovers and manages GPUs as part of metal provisioning. GPU details are retrieved from each server during the Redfish discovery phase and stored in the Bridge catalog as hardware attributes. This information drives flavor creation, compute allocation, and resource scheduling.
Supported GPU Families
| Vendor | Platform | Interconnect | Isolation Model |
|---|---|---|---|
| NVIDIA | H100 PCIe, A100 PCIe | PCIe | IOMMU passthrough |
| NVIDIA | HGX H100, HGX H200, HGX B200 | NVLink + NVSwitch | IOMMU + NVLink Secure Partition |
| NVIDIA | GB200 NVL72 | NVLink + NVSwitch | IOMMU + NVLink Secure Partition |
| NVIDIA | GH200 | NVLink + NVSwitch | IOMMU + NVLink Secure Partition |
| AMD | Instinct MI300X | PCIe | In progress |
GPU Post-Provisioning
After the OS is deployed, Bridge's post-provisioning controller installs the GPU software stack on each server:
| Component | Purpose |
|---|---|
| CUDA libraries | Enable GPU compute workloads on NVIDIA hardware |
| MOFED (Mellanox OFED) | Enable RDMA networking for GPU Direct RDMA |
| GPU kernel modules | Load NVIDIA or AMD GPU drivers and kernel extensions |
| DCGM (Data Center GPU Manager) | GPU health metrics collection and fault detection |
| OTEL agent | Export GPU metrics to Bridge observability pipeline |
Post-provisioning runs automatically after OS installation and completes before the server is made available for tenant allocation.
GPU Resource Models
Bridge supports three GPU resource allocation models depending on the deployment type:
Full GPU Passthrough
The default model for bare metal and VM deployments. Each GPU is assigned exclusively to a single tenant via PCIe device passthrough (IOMMU). This provides maximum performance and full hardware isolation.
For HGX systems, Bridge additionally configures an NVLink Secure Partition, ensuring that full NVLink bandwidth is available within the tenant's GPU group while cross-tenant NVLink traffic is blocked at the hardware level.
MIG Partitioning
On NVIDIA A100 and H100 GPUs, Bridge can prepare servers for Multi-Instance GPU (MIG) partitioning. MIG divides a single GPU into up to seven isolated instances, each with dedicated compute units, L2 cache, and memory bandwidth. Bridge configures MIG instances as part of the compute allocation flow, with the Kubernetes device plugin managing instance-level scheduling.
vGPU (Time-Sliced)
Bridge supports vGPU deployments using NVIDIA's licensed vGPU stack. vGPU enables multiple VMs to share a single physical GPU through time-sliced scheduling. This model requires NVIDIA vGPU software licenses and is configured during VM provisioning.
For the highest quality of service, MIG-backed vGPU combines hardware spatial partitioning (MIG) with temporal partitioning (vGPU), providing guaranteed isolation between tenant slices.
GPU Observability
Bridge collects GPU health and performance metrics using DCGM, and forwards them via OTEL to the Bridge monitoring pipeline. Metrics are visible on the Bridge admin dashboard and can be used for capacity planning, fault detection, and tenant billing.
| Metric Category | Examples |
|---|---|
| Compute utilization | GPU core utilization (%) |
| Memory | GPU memory used and free |
| Thermal | GPU temperature, thermal throttling |
| Power | GPU power draw, power cap |
| Errors | ECC memory errors, retired pages |
| NVLink | NVLink bandwidth (HGX systems) |
Related Pages
- NVIDIA — NVIDIA GPU configuration and isolation models
- AMD — AMD Instinct GPU support
- NVSwitch — NVLink Secure Partition for HGX multi-GPU servers
- Metal Provisioning Overview — Full server provisioning sequence