4 posts tagged with "multi-tenancy"

Automated InfiniBand Network Isolation with Bridge GPU CMS

March 7, 2025 · 3 min read

Raghuram Gopalshetty

Engineering

Managing network isolation in AI cloud environments is critical for ensuring tenant data security, performance consistency, and compliance. This becomes even more important in high-performance AI clusters that rely on InfiniBand fabric for ultra-low latency communication between GPU nodes.

With Bridge GPU Cloud Management Software (GPU CMS), cloud providers can achieve complete InfiniBand network isolation for every tenant—all through an automated, policy-driven process. This ensures each tenant's data and traffic are fully segregated, with no manual intervention required.

Bridge GPU CMS Announces Network Automation and Multi-Tenancy for NVIDIA Spectrum-X

January 15, 2025 · 5 min read

Sriram Rupanagunta

Engineering

Sandeep Sharma

Engineering

The latest Bridge GPU CMS announces network automation, observability, fault management, and multi-tenancy for the v1.3 NVIDIA Spectrum-X Reference Architecture (RA). The Reference Architecture defines an East-West compute network fabric optimized for AI cloud deployments with HGX systems and a North-South converged network for external access, storage, and control plane traffic.

As part of this announcement, Bridge supports NVIDIA Spectrum-4 SN5000 Series Ethernet switches, NVIDIA Cumulus Linux, NVIDIA BlueField-3 SuperNICs and DPUs, NVIDIA NetQ AI observability and telemetry platform, and NVIDIA Air data center digital twin platform along with NVIDIA HGX H100/H200 nodes.

Armada Powering the Next Generation of Secure, Multi-Tenant AI Factories

December 15, 2024 · 3 min read

Amar Kapadia

Product

As organizations continue to build AI factories capable of handling massive-scale inference and data processing, one challenge looms large: how to deliver secure, multi-tenant infrastructure that keeps GPUs fully utilized without adding operational complexity.

At Armada, we're solving that challenge head-on. Bridge product now integrates with NVIDIA BlueField-3 data processing units (DPUs) and NVIDIA RTX PRO Servers featuring NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs — creating an end-to-end foundation for high-performance, automated AI infrastructure.

The Hidden Risks of Soft Isolation in Multi-Tenant GPU Clouds

October 15, 2024 · 5 min read

Amar Kapadia

Product

Relying solely on Kubernetes Namespaces or vClusters for multi-tenant isolation in GPU clouds is risky — especially when hosting untrusted or external workloads.

In September 2024, Wiz discovered a critical NVIDIA Container Toolkit vulnerability (CVE-2024-0132) that allowed GPU containers to escape soft isolation and gain root access to the host. This flaw impacted over one-third of GPU-enabled environments and exposed the limits of Kubernetes-based isolation.

Soft isolation is not secure isolation. For environments like Neoclouds, NVIDIA Cloud Partners (NCPs), or regulated industries, only hard or hybrid isolation strategies — such as dedicated Kubernetes clusters, MIG-based GPU partitioning, VPCs, VxLAN, VRFs, KVM virtualization, IB P-KEY, and NVLink partitioning — can protect against container escapes.