InfiniBand Overview
Bridge supports InfiniBand (IB) switch fabrics for high-performance East-West GPU-to-GPU networking. InfiniBand integration is managed through the NVIDIA Unified Fabric Manager (UFM), which Bridge uses to discover the IB topology and enforce per-tenant network isolation via Partition Keys (PKEYs).
Network Isolation Model
InfiniBand uses a different isolation mechanism than Ethernet:
| Network Type | Isolation Mechanism | Control Plane |
|---|---|---|
| Ethernet | VxLAN + VRF with BGP EVPN | NVIDIA Cumulus Linux (NVUE) |
| InfiniBand | Partition Key (PKEY) | NVIDIA UFM |
Compute Network Isolation
Each tenant is assigned a unique PKEY, which acts as an isolated virtual network within the InfiniBand fabric. Only compute nodes whose network interface GUIDs (Globally Unique Identifiers) are registered to the tenant's PKEY can communicate within that tenant's IB network.
Converged Network Mapping
The tenant's PKEY is mapped to a corresponding tenant-specific Converged VLAN ID or VRF, enabling interoperability between the InfiniBand compute network and the Ethernet Converged Network (which carries storage, in-band management, and external connectivity traffic).
Bridge Integration with UFM
Bridge integrates tightly with UFM to automate all IB network management tasks:
| Operation | How Bridge Uses UFM |
|---|---|
| Fabric discovery | UFM discovers all IB switches, compute nodes, and links during Day 0 topology discovery |
| Per-tenant PKEY creation | When a tenant is created in Bridge, UFM dynamically provisions a unique PKEY for that tenant |
| Compute allocation | When a compute node is allocated to a tenant, Bridge registers the node's IB interface GUIDs to the tenant's PKEY |
| Topology validation | Bridge triggers UFM topology validation to verify the discovered fabric matches the intended design |
Supported Topologies
Bridge supports multiple InfiniBand topology configurations:
- Compute topology — Dedicated IB fabric for East-West GPU compute traffic.
- Converged topology — Combined fabric carrying storage, in-band management, and external traffic alongside compute traffic.
- Dedicated storage IB — Separate IB topology for storage, with the converged network handling in-band management and external access.
Security and Performance Roadmap
Bridge plans to extend IB management with the following capabilities:
| Feature | Description |
|---|---|
| Mkey / AMKey / VSkey | IB security key management for additional fabric access control |
| Rail validation | ibdiagnet --rail_validation integration for validating rail-optimized IB topologies |
| Congestion control diagnostics | ibdiagnet --congestion_control and congestion counter tooling |
Related Pages
- UFM — UFM onboarding, topology discovery, and PKEY management
- Networking Overview — Full tenant network isolation architecture