Skip to main content

4 posts tagged with "multi-tenancy"

View All Tags

Automated InfiniBand Network Isolation with Bridge GPU CMS

· 3 min read
Raghuram Gopalshetty
Raghuram Gopalshetty
Engineering

Managing network isolation in AI cloud environments is critical for ensuring tenant data security, performance consistency, and compliance. This becomes even more important in high-performance AI clusters that rely on InfiniBand fabric for ultra-low latency communication between GPU nodes.

With Bridge GPU Cloud Management Software (GPU CMS), cloud providers can achieve complete InfiniBand network isolation for every tenant—all through an automated, policy-driven process. This ensures each tenant's data and traffic are fully segregated, with no manual intervention required.

Bridge GPU CMS Announces Network Automation and Multi-Tenancy for NVIDIA Spectrum-X

· 5 min read
Sriram Rupanagunta
Sriram Rupanagunta
Engineering
Sandeep Sharma
Sandeep Sharma
Engineering

The latest Bridge GPU CMS announces network automation, observability, fault management, and multi-tenancy for the v1.3 NVIDIA Spectrum-X Reference Architecture (RA). The Reference Architecture defines an East-West compute network fabric optimized for AI cloud deployments with HGX systems and a North-South converged network for external access, storage, and control plane traffic.

As part of this announcement, Bridge supports NVIDIA Spectrum-4 SN5000 Series Ethernet switches, NVIDIA Cumulus Linux, NVIDIA BlueField-3 SuperNICs and DPUs, NVIDIA NetQ AI observability and telemetry platform, and NVIDIA Air data center digital twin platform along with NVIDIA HGX H100/H200 nodes.

Armada Powering the Next Generation of Secure, Multi-Tenant AI Factories

· 3 min read
Amar Kapadia
Amar Kapadia
Product

As organizations continue to build AI factories capable of handling massive-scale inference and data processing, one challenge looms large: how to deliver secure, multi-tenant infrastructure that keeps GPUs fully utilized without adding operational complexity.

At Armada, we're solving that challenge head-on. Bridge product now integrates with NVIDIA BlueField-3 data processing units (DPUs) and NVIDIA RTX PRO Servers featuring NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs — creating an end-to-end foundation for high-performance, automated AI infrastructure.

The Hidden Risks of Soft Isolation in Multi-Tenant GPU Clouds

· 5 min read
Amar Kapadia
Amar Kapadia
Product

Relying solely on Kubernetes Namespaces or vClusters for multi-tenant isolation in GPU clouds is risky — especially when hosting untrusted or external workloads.

In September 2024, Wiz discovered a critical NVIDIA Container Toolkit vulnerability (CVE-2024-0132) that allowed GPU containers to escape soft isolation and gain root access to the host. This flaw impacted over one-third of GPU-enabled environments and exposed the limits of Kubernetes-based isolation.

Soft isolation is not secure isolation. For environments like Neoclouds, NVIDIA Cloud Partners (NCPs), or regulated industries, only hard or hybrid isolation strategies — such as dedicated Kubernetes clusters, MIG-based GPU partitioning, VPCs, VxLAN, VRFs, KVM virtualization, IB P-KEY, and NVLink partitioning — can protect against container escapes.