Version: 5.4.0

GPU Overview

Bridge discovers and manages GPUs as part of metal provisioning. GPU details are retrieved from each server during the Redfish discovery phase and stored in Bridge catalog as hardware attributes. This information drives flavor creation, compute allocation, and resource scheduling.

Supported GPU Families

Vendor	Platform	Interconnect	Isolation Model
NVIDIA	H100 PCIe, A100 PCIe	PCIe	IOMMU passthrough
NVIDIA	HGX H100, HGX H200, HGX B200	NVLink + NVSwitch	IOMMU + NVLink Secure Partition
NVIDIA	GB200 NVL72	NVLink + NVSwitch	IOMMU + NVLink Secure Partition
NVIDIA	GH200	NVLink + NVSwitch	IOMMU + NVLink Secure Partition
AMD	Instinct MI300X	PCIe	In progress

GPU Post-Provisioning

After the OS is deployed, Bridge's post-provisioning controller installs the GPU software stack on each server:

Component	Purpose
CUDA libraries	Enable GPU compute workloads on NVIDIA hardware
MOFED (Mellanox OFED)	Enable RDMA networking for GPU Direct RDMA
GPU kernel modules	Load NVIDIA or AMD GPU drivers and kernel extensions
DCGM (Data Center GPU Manager)	GPU health metrics collection and fault detection
OTEL agent	Export GPU metrics to Bridge observability pipeline

Post-provisioning runs automatically after OS installation and completes before the server is made available for tenant allocation.

GPU Resource Models

Bridge supports three GPU resource allocation models depending on the deployment type:

Full GPU Passthrough

The default model for bare metal and VM deployments. Each GPU is assigned exclusively to a single tenant via PCIe device passthrough (IOMMU). This provides maximum performance and full hardware isolation.

For HGX systems, Bridge additionally configures an NVLink Secure Partition, ensuring that full NVLink bandwidth is available within the tenant's GPU group while cross-tenant NVLink traffic is blocked at the hardware level.

MIG Partitioning

On NVIDIA A100 and H100 GPUs, Bridge can prepare servers for Multi-Instance GPU (MIG) partitioning. MIG divides a single GPU into up to seven isolated instances, each with dedicated compute units, L2 cache, and memory bandwidth. Bridge configures MIG instances as part of the compute allocation flow, with the Kubernetes device plugin managing instance-level scheduling.

vGPU (Time-Sliced)

Bridge supports vGPU deployments using NVIDIA's licensed vGPU stack. vGPU enables multiple VMs to share a single physical GPU through time-sliced scheduling. This model requires NVIDIA vGPU software licenses and is configured during VM provisioning.

For the highest quality of service, MIG-backed vGPU combines hardware spatial partitioning (MIG) with temporal partitioning (vGPU), providing guaranteed isolation between tenant slices.

GPU Observability

Bridge collects GPU health and performance metrics using DCGM, and forwards them via OTEL to Bridge monitoring pipeline. Metrics are visible on Bridge admin dashboard and can be used for capacity planning, fault detection, and tenant billing.

Metric Category	Examples
Compute utilization	GPU core utilization (%)
Memory	GPU memory used and free
Thermal	GPU temperature, thermal throttling
Power	GPU power draw, power cap
Errors	ECC memory errors, retired pages
NVLink	NVLink bandwidth (HGX systems)

NVIDIA — NVIDIA GPU configuration and isolation models
AMD — AMD Instinct GPU support
NVSwitch — NVLink Secure Partition for HGX multi-GPU servers
Metal Provisioning Overview — Full server provisioning sequence

Supported GPU Families​

GPU Post-Provisioning​

GPU Resource Models​

Full GPU Passthrough​

MIG Partitioning​

vGPU (Time-Sliced)​

GPU Observability​

Related Pages​