Version: 5.4.0

NVIDIA GPU

Bridge manages NVIDIA GPUs across PCIe and HGX form factors, providing hardware-enforced tenant isolation, full NVLink performance on HGX systems, and automated post-provisioning of the GPU software stack.

Supported Platforms

Platform	GPU Count	Interconnect	NVSwitch	MIG Support
H100 PCIe	1–8 (per server)	PCIe	No	Yes
A100 PCIe	1–8 (per server)	PCIe	No	Yes
HGX H100	8	NVLink + NVSwitch	Yes	Yes
HGX H200	8	NVLink + NVSwitch	Yes	Yes
HGX B200	8	NVLink + NVSwitch	Yes	Yes
GB200 NVL72	36 Grace Hopper	NVLink + NVSwitch	Yes	Yes
GH200	1 per module	NVLink	Optional	Yes

GPU Discovery and Catalog

During Redfish discovery, Bridge queries each server for PCIe device details and identifies NVIDIA GPUs by PCI vendor and device ID. The following attributes are recorded in Bridge catalog for each server:

GPU model and count
GPU memory per device
NVSwitch presence (HGX systems)
NVLink capability
MIG support

These attributes are used to create GPU flavor templates in Bridge catalog, which tenants select when requesting compute allocation.

GPU Isolation

PCIe Servers (H100 PCIe / A100 PCIe)

On servers without NVSwitch, Bridge enforces GPU isolation using IOMMU passthrough:

Bridge configures intel_iommu=on on the host kernel.
Each GPU is assigned to an IOMMU group and bound to the VFIO driver.
At compute allocation, Bridge passes the GPU directly through to the tenant's VM or bare metal partition.

IOMMU passthrough ensures that a GPU assigned to one tenant cannot access host memory or memory belonging to another tenant's GPU.

HGX Servers (NVSwitch Systems)

On HGX systems, Bridge enforces isolation at two hardware levels simultaneously:

Level	Mechanism
Host level	IOMMU device passthrough — GPU mapped exclusively to tenant's compute
Fabric level	NVLink Secure Partition — NVLink routing blocked between tenant GPU groups

Bridge integrates with NVIDIA Fabric Manager to configure NVLink Secure Partitions. When a tenant's GPUs are allocated, Bridge programs the NVSwitch routing tables to create a private NVLink domain containing only the tenant's GPUs. Within the partition, full NVLink bandwidth is available. Across partitions, all NVLink traffic is blocked at the hardware level.

See NVSwitch for the full isolation architecture.

GPU Post-Provisioning

After OS deployment, Bridge's post-provisioning controller installs the NVIDIA GPU software stack:

Component	Version Management
CUDA toolkit	Managed per server flavor in Bridge catalog
cuDNN	Installed alongside CUDA toolkit
NCCL	Enables high-performance collective communications over NVLink and RDMA
MOFED (Mellanox OFED)	Required for GPU Direct RDMA over RoCE or InfiniBand
NVIDIA kernel driver	Loaded as kernel module (`nvidia`, `nvidia-uvm`, `nvidia-drm`)
DCGM	Deployed as a systemd service for GPU health monitoring

For VM deployments, Bridge additionally configures PCIe GPU passthrough and ensures a 1:1 mapping between GPUs and compute NICs.

MIG Partitioning

Bridge supports NVIDIA Multi-Instance GPU (MIG) on H100 and A100 GPUs. MIG divides a single GPU into isolated instances with dedicated compute slices, L2 cache, and memory bandwidth — providing hardware-level QoS for multi-tenant workloads.

MIG Profile	Compute Slices	Memory (H100 80GB)
1g.10gb	1	10 GB
2g.20gb	2	20 GB
3g.40gb	3	40 GB
4g.40gb	4	40 GB
7g.80gb	7 (full GPU)	80 GB

Bridge prepares each server for MIG as part of compute post-provisioning. The NVIDIA device plugin for Kubernetes manages MIG instance scheduling at the pod level, enabling tenants to request specific MIG profiles in their workload definitions.

MIG-Backed vGPU

For the highest isolation guarantees in VM environments, Bridge supports MIG-backed vGPU. In this model:

MIG slices provide hardware-level spatial partitioning between tenants.
vGPU provides temporal partitioning within a MIG slice for multiple VMs.
Isolation between MIG slices eliminates contention and scheduling latency between tenants.

MIG-backed vGPU requires NVIDIA vGPU software licenses.

GPU Observability

Bridge deploys DCGM (Data Center GPU Manager) on each server to collect GPU health and performance metrics. DCGM metrics are exported via an OTEL pipeline to Bridge observability stack and displayed on the admin dashboard.

Key metrics collected:

Metric	Description
`DCGM_FI_DEV_GPU_UTIL`	GPU core utilization (%)
`DCGM_FI_DEV_MEM_COPY_UTIL`	GPU memory bandwidth utilization
`DCGM_FI_DEV_FB_USED`	GPU framebuffer (memory) used
`DCGM_FI_DEV_POWER_USAGE`	Current power draw (W)
`DCGM_FI_DEV_GPU_TEMP`	GPU temperature (°C)
`DCGM_FI_DEV_ECC_DBE_VOL_TOTAL`	Double-bit ECC errors (hardware fault indicator)

Bridge uses NVML (NVIDIA Management Library) for programmatic access to GPU state beyond DCGM metrics, including NVLink bandwidth per GPU on HGX systems.

NVSwitch — NVLink Secure Partition for HGX multi-GPU isolation
GPU Overview — Supported GPU families and resource models
Metal Provisioning Overview — Server provisioning sequence

Supported Platforms​

GPU Discovery and Catalog​

GPU Isolation​

PCIe Servers (H100 PCIe / A100 PCIe)​

HGX Servers (NVSwitch Systems)​

GPU Post-Provisioning​

MIG Partitioning​

MIG-Backed vGPU​

GPU Observability​

Related Pages​