2 posts tagged with "isolation" | Armada Documentation

Automated InfiniBand Network Isolation with Bridge GPU CMS

March 7, 2025 · 3 min read

Raghuram Gopalshetty

Engineering

Managing network isolation in AI cloud environments is critical for ensuring tenant data security, performance consistency, and compliance. This becomes even more important in high-performance AI clusters that rely on InfiniBand fabric for ultra-low latency communication between GPU nodes.

With Bridge GPU Cloud Management Software (GPU CMS), cloud providers can achieve complete InfiniBand network isolation for every tenant—all through an automated, policy-driven process. This ensures each tenant's data and traffic are fully segregated, with no manual intervention required.

The Hidden Risks of Soft Isolation in Multi-Tenant GPU Clouds

October 15, 2024 · 5 min read

Amar Kapadia

Product

Relying solely on Kubernetes Namespaces or vClusters for multi-tenant isolation in GPU clouds is risky — especially when hosting untrusted or external workloads.

In September 2024, Wiz discovered a critical NVIDIA Container Toolkit vulnerability (CVE-2024-0132) that allowed GPU containers to escape soft isolation and gain root access to the host. This flaw impacted over one-third of GPU-enabled environments and exposed the limits of Kubernetes-based isolation.

Soft isolation is not secure isolation. For environments like Neoclouds, NVIDIA Cloud Partners (NCPs), or regulated industries, only hard or hybrid isolation strategies — such as dedicated Kubernetes clusters, MIG-based GPU partitioning, VPCs, VxLAN, VRFs, KVM virtualization, IB P-KEY, and NVLink partitioning — can protect against container escapes.