Skip to main content

NVIDIA GPU

Bridge manages NVIDIA GPUs across PCIe and HGX form factors, providing hardware-enforced tenant isolation, full NVLink performance on HGX systems, and automated post-provisioning of the GPU software stack.

Supported Platforms

PlatformGPU CountInterconnectNVSwitchMIG Support
H100 PCIe1–8 (per server)PCIeNoYes
A100 PCIe1–8 (per server)PCIeNoYes
HGX H1008NVLink + NVSwitchYesYes
HGX H2008NVLink + NVSwitchYesYes
HGX B2008NVLink + NVSwitchYesYes
GB200 NVL7236 Grace HopperNVLink + NVSwitchYesYes
GH2001 per moduleNVLinkOptionalYes

GPU Discovery and Catalog

During Redfish discovery, Bridge queries each server for PCIe device details and identifies NVIDIA GPUs by PCI vendor and device ID. The following attributes are recorded in the Bridge catalog for each server:

  • GPU model and count
  • GPU memory per device
  • NVSwitch presence (HGX systems)
  • NVLink capability
  • MIG support

These attributes are used to create GPU flavor templates in the Bridge catalog, which tenants select when requesting compute allocation.

GPU Isolation

PCIe Servers (H100 PCIe / A100 PCIe)

On servers without NVSwitch, Bridge enforces GPU isolation using IOMMU passthrough:

  1. Bridge configures intel_iommu=on on the host kernel.
  2. Each GPU is assigned to an IOMMU group and bound to the VFIO driver.
  3. At compute allocation, Bridge passes the GPU directly through to the tenant's VM or bare metal partition.

IOMMU passthrough ensures that a GPU assigned to one tenant cannot access host memory or memory belonging to another tenant's GPU.

HGX Servers (NVSwitch Systems)

On HGX systems, Bridge enforces isolation at two hardware levels simultaneously:

LevelMechanism
Host levelIOMMU device passthrough — GPU mapped exclusively to tenant's compute
Fabric levelNVLink Secure Partition — NVLink routing blocked between tenant GPU groups

Bridge integrates with NVIDIA Fabric Manager to configure NVLink Secure Partitions. When a tenant's GPUs are allocated, Bridge programs the NVSwitch routing tables to create a private NVLink domain containing only the tenant's GPUs. Within the partition, full NVLink bandwidth is available. Across partitions, all NVLink traffic is blocked at the hardware level.

See NVSwitch for the full isolation architecture.

GPU Post-Provisioning

After OS deployment, Bridge's post-provisioning controller installs the NVIDIA GPU software stack:

ComponentVersion Management
CUDA toolkitManaged per server flavor in Bridge catalog
cuDNNInstalled alongside CUDA toolkit
NCCLEnables high-performance collective communications over NVLink and RDMA
MOFED (Mellanox OFED)Required for GPU Direct RDMA over RoCE or InfiniBand
NVIDIA kernel driverLoaded as kernel module (nvidia, nvidia-uvm, nvidia-drm)
DCGMDeployed as a systemd service for GPU health monitoring

For VM deployments, Bridge additionally configures PCIe GPU passthrough and ensures a 1:1 mapping between GPUs and compute NICs.

MIG Partitioning

Bridge supports NVIDIA Multi-Instance GPU (MIG) on H100 and A100 GPUs. MIG divides a single GPU into isolated instances with dedicated compute slices, L2 cache, and memory bandwidth — providing hardware-level QoS for multi-tenant workloads.

MIG ProfileCompute SlicesMemory (H100 80GB)
1g.10gb110 GB
2g.20gb220 GB
3g.40gb340 GB
4g.40gb440 GB
7g.80gb7 (full GPU)80 GB

Bridge prepares each server for MIG as part of compute post-provisioning. The NVIDIA device plugin for Kubernetes manages MIG instance scheduling at the pod level, enabling tenants to request specific MIG profiles in their workload definitions.

MIG-Backed vGPU

For the highest isolation guarantees in VM environments, Bridge supports MIG-backed vGPU. In this model:

  • MIG slices provide hardware-level spatial partitioning between tenants.
  • vGPU provides temporal partitioning within a MIG slice for multiple VMs.
  • Isolation between MIG slices eliminates contention and scheduling latency between tenants.

MIG-backed vGPU requires NVIDIA vGPU software licenses.

GPU Observability

Bridge deploys DCGM (Data Center GPU Manager) on each server to collect GPU health and performance metrics. DCGM metrics are exported via an OTEL pipeline to the Bridge observability stack and displayed on the admin dashboard.

Key metrics collected:

MetricDescription
DCGM_FI_DEV_GPU_UTILGPU core utilization (%)
DCGM_FI_DEV_MEM_COPY_UTILGPU memory bandwidth utilization
DCGM_FI_DEV_FB_USEDGPU framebuffer (memory) used
DCGM_FI_DEV_POWER_USAGECurrent power draw (W)
DCGM_FI_DEV_GPU_TEMPGPU temperature (°C)
DCGM_FI_DEV_ECC_DBE_VOL_TOTALDouble-bit ECC errors (hardware fault indicator)

Bridge uses NVML (NVIDIA Management Library) for programmatic access to GPU state beyond DCGM metrics, including NVLink bandwidth per GPU on HGX systems.