Bridge GPU CMS Announces Network Automation and Multi-Tenancy for NVIDIA Spectrum-X

January 15, 2025 · 5 min read

Sriram Rupanagunta

Engineering

Sandeep Sharma

Engineering

The latest Bridge GPU CMS announces network automation, observability, fault management, and multi-tenancy for the v1.3 NVIDIA Spectrum-X Reference Architecture (RA). The Reference Architecture defines an East-West compute network fabric optimized for AI cloud deployments with HGX systems and a North-South converged network for external access, storage, and control plane traffic.

As part of this announcement, Bridge supports NVIDIA Spectrum-4 SN5000 Series Ethernet switches, NVIDIA Cumulus Linux, NVIDIA BlueField-3 SuperNICs and DPUs, NVIDIA NetQ AI observability and telemetry platform, and NVIDIA Air data center digital twin platform along with NVIDIA HGX H100/H200 nodes.

Multi-Tenancy Across Seven Pillars

NVIDIA Cloud Partner (NCP) and enterprise AI clouds must provide hard isolation between tenants. Bridge GPU CMS addresses this comprehensively across seven pillars:

High-performance networking
InfiniBand fabrics
NVLink GPU interconnects
Scalable storage
Virtual private clouds (VPCs)
Compute resources
GPUs

By enforcing isolation and performance in each pillar, we ensure tenants can run demanding AI workloads securely and without compromising the performance on shared hardware.

Switch Fabric Automation

A modern GPU data center switch fabric typically comprises multiple segmented networks—most notably the East-West (Compute) network and the North-South (Converged) network. The North-South network itself includes Inband management, Storage, and External/Tenant access.

A single Scalable Unit (SU)—as defined by NVIDIA Spectrum-X Reference Architectures—contains:

32 GPU nodes
12 switches
256 physical cable connections

This topology serves just 256 GPUs, highlighting the operational complexity and scale. Without automation, managing such fabrics—especially across multiple SUs—becomes impractical and error-prone.

Bridge CMS addresses this challenge by offering:

Topology auto-discovery
Underlay configuration
Lifecycle automation for switch configurations
Overlay network creation for tenant-level isolation
Integration with compute orchestration pipelines

BlueField-3 SuperNIC Spectrum-X Configuration

Bridge GPU CMS automates all the tasks needed to turn a BlueField-3 equipped server into a Spectrum-X host. When a new host is provisioned, the GPU CMS:

Installs the DOCA Host Packages, enabling key services and libraries
Brings up the DOCA Management Service Daemon (DMSD)
Configures the BlueField-3 SuperNIC with Spectrum-X capabilities like RoCE, congestion control, adaptive routing, and IP routing

This makes the host fully Spectrum-X aware — ready to participate in high-performance GPU-to-GPU networking, where the multi-tenancy relies on BGP EVPN from Cumulus and Spectrum-X switches.

Network Multi-Tenancy

Bridge GPU CMS creates tenant isolated overlay networks on the ethernet switch fabric using VxLAN and VRF with BGP as the control plane. From the end-user perspective, the GPU CMS provides the ability to define Virtual Private Clouds (VPCs) with multiple subnets, much like a traditional public cloud.

Behind the scenes, each VPC is backed by an isolated VRF (Virtual Routing and Forwarding) and each subnet corresponds to a unique VxLAN segment.

This abstraction gives users the flexibility to:

Create isolated environments for different workloads or projects
Define fine-grained IP subnetting and routing policies
Attach load balancers or gateways at subnet edges
Connect VPCs to storage networks or on-prem environments

VxLAN and VRF-Based Tenant Segmentation

To support multiple tenants securely and efficiently on the same physical fabric, Bridge GPU CMS uses VxLAN (Virtual Extensible LAN) in combination with VRF (Virtual Routing and Forwarding) constructs. This ensures complete L2/L3 network isolation across tenants.

Each tenant's workloads operate within a dedicated overlay network, backed by a separate VRF instance, enabling traffic segmentation, independent routing policies, and security boundaries, without sacrificing performance.

Observability and Fault Management

Bridge GPU CMS supports both NetQ and OTLP based telemetry for NVIDIA Spectrum-X switches.

Unified Observability with OTLP

Switch Telemetry via Cumulus NVUE:

NVIDIA Cumulus Linux on Spectrum-X switches exports telemetry via NVUE commands
Metrics exposed: Buffer occupancy histograms, interface-level stats (bandwidth, errors, drops), platform metrics (temperature, fan speed, power)
Telemetry is exported in OTEL format

SuperNIC Telemetry via DOCA DTS:

DOCA Telemetry Service (DTS) collects real-time SuperNIC metrics
Supports telemetry types: High-Frequency Telemetry (HFT), Programmable Congestion Control (PCC)

NetQ Integration

Infrastructure administrators can subscribe to NetQ events directly from the GPU CMS. When subscribed events are triggered, Bridge GPU CMS invokes its own remediation logic to auto-correct faults without manual intervention, helping customers with dramatic OPEX reduction.

The CMS subscribes to the following fault scenarios from NetQ:

Switch Failure Detection: Power loss or hardware faults on switches are detected promptly
Link Failure Detection: Real-time monitoring of physical interfaces allows rapid identification of link failures
Configuration Drift Detection: Configuration changes are continuously audited to detect and flag drifts
BGP Session State Monitoring: Changes in BGP session status are actively monitored

NVIDIA Air Integration

Bridge GPU CMS integration with Spectrum-X is extensively validated on NVIDIA Air (data center digital twin), enabling realistic simulation and demonstration of features in a virtual environment.

Storage Integration: Supports vanilla NFS servers and is integrated with major storage vendors such as DDN and VAST
External Gateway Integration: Border leaf configurations are tested with F5 BIG-IP gateway

Global Customer Deployments

Our platform is already being used by NCPs in multiple regions along with Spectrum-X. For example, a leading Southeast Asian telecom operator is deploying an AI compute grid for distributed inference. Similarly, a top U.S. NCP is building an AI-and-RAN edge platform with multi-tenancy support for multiple use cases or customers.

Multi-Tenancy Across Seven Pillars​

Switch Fabric Automation​

BlueField-3 SuperNIC Spectrum-X Configuration​

Network Multi-Tenancy​

VxLAN and VRF-Based Tenant Segmentation​

Observability and Fault Management​

Unified Observability with OTLP​

NetQ Integration​

NVIDIA Air Integration​

Global Customer Deployments​