Skip to main content

InfiniBand Overview

Bridge supports InfiniBand (IB) switch fabrics for high-performance East-West GPU-to-GPU networking. InfiniBand integration is managed through the NVIDIA Unified Fabric Manager (UFM), which Bridge uses to discover the IB topology and enforce per-tenant network isolation via Partition Keys (PKEYs).

Network Isolation Model

InfiniBand uses a different isolation mechanism than Ethernet:

Network TypeIsolation MechanismControl Plane
EthernetVxLAN + VRF with BGP EVPNNVIDIA Cumulus Linux (NVUE)
InfiniBandPartition Key (PKEY)NVIDIA UFM

Compute Network Isolation

Each tenant is assigned a unique PKEY, which acts as an isolated virtual network within the InfiniBand fabric. Only compute nodes whose network interface GUIDs (Globally Unique Identifiers) are registered to the tenant's PKEY can communicate within that tenant's IB network.

Converged Network Mapping

The tenant's PKEY is mapped to a corresponding tenant-specific Converged VLAN ID or VRF, enabling interoperability between the InfiniBand compute network and the Ethernet Converged Network (which carries storage, in-band management, and external connectivity traffic).

Bridge Integration with UFM

Bridge integrates tightly with UFM to automate all IB network management tasks:

OperationHow Bridge Uses UFM
Fabric discoveryUFM discovers all IB switches, compute nodes, and links during Day 0 topology discovery
Per-tenant PKEY creationWhen a tenant is created in Bridge, UFM dynamically provisions a unique PKEY for that tenant
Compute allocationWhen a compute node is allocated to a tenant, Bridge registers the node's IB interface GUIDs to the tenant's PKEY
Topology validationBridge triggers UFM topology validation to verify the discovered fabric matches the intended design

Supported Topologies

Bridge supports multiple InfiniBand topology configurations:

  • Compute topology — Dedicated IB fabric for East-West GPU compute traffic.
  • Converged topology — Combined fabric carrying storage, in-band management, and external traffic alongside compute traffic.
  • Dedicated storage IB — Separate IB topology for storage, with the converged network handling in-band management and external access.

Security and Performance Roadmap

Bridge plans to extend IB management with the following capabilities:

FeatureDescription
Mkey / AMKey / VSkeyIB security key management for additional fabric access control
Rail validationibdiagnet --rail_validation integration for validating rail-optimized IB topologies
Congestion control diagnosticsibdiagnet --congestion_control and congestion counter tooling
  • UFM — UFM onboarding, topology discovery, and PKEY management
  • Networking Overview — Full tenant network isolation architecture