Skip to main content

Operationalizing Distributed AI: Armada and NVIDIA AI Grid

· 6 min read
Anish Swaminathan
Anish Swaminathan
Engineering
Amar Kapadia
Amar Kapadia
Product
Sandeep Sharma
Sandeep Sharma
Engineering

Real-time AI is reshaping infrastructure requirements.

Inference workloads such as conversational AI, real-time video generation, AR/XR streaming, visual search, and large-scale personalization demand ultra-low latency, predictable performance, and geographic proximity to users and data sources. Centralized AI factories remain essential for training, but for many AI-native services, inference at scale requires AI Grids: geographically distributed GPU infrastructure operating as a unified, policy-controlled system.

Armada is collaborating with NVIDIA to enable NVIDIA AI Grid on Armada Edge Platform (AEP), providing telecommunications operators, service providers, and enterprises with a validated architecture for deploying and operating distributed AI infrastructure at global scale.

This post explores the architecture and operational model behind that system.

How NVIDIA DSX Air Reduces Dev/Test Costs, Accelerates PoCs, and Lowers Production Risk for Armada and Its Customers

· 4 min read
Pavan Samudrala
Pavan Samudrala
Engineering
Sriram Rupanagunta
Sriram Rupanagunta
Engineering
Amar Kapadia
Amar Kapadia
Product

Armada has been an NVIDIA DSX Air user since late 2024, and we have derived significant benefits from its ability to simulate Spectrum-X Ethernet environments for both internal development and customer proof-of-concept initiatives. NVIDIA DSX Air has enabled us to validate networking configurations and topologies, test multi-tenant configurations, and accelerate Bridge software deployments (Bridge is an on-prem Armada software product that provides multi-tenancy and cloud services on GPU hardware) without relying exclusively on physical hardware.

We are excited about the launch of NVIDIA DSX Air and the expanded AI Factory digital simulation capabilities it introduces. This next evolution unlocks multiple use cases for Armada and our customers — driving measurable improvements across development velocity, PoC efficiency, operational stability, OPEX reductions and go-to-market. The key benefits are described below.

Delivering Distributed AI at the Edge with Bridge

· 6 min read
Amar Kapadia
Amar Kapadia
Product
Sriram Rupanagunta
Sriram Rupanagunta
Engineering

All AI is not created equal. While centralized inference serves some use-cases well where long thinking times are acceptable, new use cases such as physical AI, real-time agentic AI chatbots, digital avatars doing real time dialog, and computer vision require faster response times. It is not just about network latency, but compute latency becomes important, mandating computation closer to data sources, and lower bandwidth usage across the network in order to scale cost effectively.

These applications can't tolerate the latency of round trips to centralized data centers nor can they afford the cost of constantly transferring large volumes of data. Instead, they require inference that is geographically distributed, dynamically orchestrated, and tightly optimized for latency and bandwidth.

Onboarding NVIDIA NVIS Deployed GPU Topology with Bridge GPU CMS

· 3 min read
Namachi Sankaranarayanan
Namachi Sankaranarayanan
Engineering

We recently collaborated with the NVIDIA Infrastructure Specialist (NVIS) team to onboard and validate a complex metadata topology deployed by NVIS into our Bridge GPU Cloud Management Software (CMS). This activity demonstrates how Bridge GPU CMS can take over an NVIS deployed GPU topology and then perform day 1, 2 activities such as discovery, dynamic multi-tenancy, observability, fault management, and more.

Seamless Integration of Bridge with DDN EXAScaler for High-Performance AI Workloads

· 3 min read
Raghuram Gopalshetty
Raghuram Gopalshetty
Engineering

Managing external storage for GPU-accelerated AI workloads can be complex—especially when ensuring that storage volumes are provisioned correctly, isolated per tenant, and automatically mounted to the right compute nodes. With Bridge GPU Cloud Management Software (GPU CMS), this entire process is streamlined through seamless integration with DDN EXAScaler.

Automated InfiniBand Network Isolation with Bridge GPU CMS

· 3 min read
Raghuram Gopalshetty
Raghuram Gopalshetty
Engineering

Managing network isolation in AI cloud environments is critical for ensuring tenant data security, performance consistency, and compliance. This becomes even more important in high-performance AI clusters that rely on InfiniBand fabric for ultra-low latency communication between GPU nodes.

With Bridge GPU Cloud Management Software (GPU CMS), cloud providers can achieve complete InfiniBand network isolation for every tenant—all through an automated, policy-driven process. This ensures each tenant's data and traffic are fully segregated, with no manual intervention required.

Seamless External Storage Integration with VAST Using Bridge GPU CMS

· 3 min read
Raghuram Gopalshetty
Raghuram Gopalshetty
Engineering

Managing external storage for GPU-accelerated AI workloads can be complex—especially when ensuring that storage volumes are provisioned correctly, isolated per tenant, and automatically mounted to the right compute nodes. With Bridge GPU Cloud Management Software (GPU CMS), this entire process is streamlined through seamless integration with VAST external storage systems.

Bridge GPU CMS Announces Network Automation and Multi-Tenancy for NVIDIA Spectrum-X

· 5 min read
Sriram Rupanagunta
Sriram Rupanagunta
Engineering
Sandeep Sharma
Sandeep Sharma
Engineering

The latest Bridge GPU CMS announces network automation, observability, fault management, and multi-tenancy for the v1.3 NVIDIA Spectrum-X Reference Architecture (RA). The Reference Architecture defines an East-West compute network fabric optimized for AI cloud deployments with HGX systems and a North-South converged network for external access, storage, and control plane traffic.

As part of this announcement, Bridge supports NVIDIA Spectrum-4 SN5000 Series Ethernet switches, NVIDIA Cumulus Linux, NVIDIA BlueField-3 SuperNICs and DPUs, NVIDIA NetQ AI observability and telemetry platform, and NVIDIA Air data center digital twin platform along with NVIDIA HGX H100/H200 nodes.

Armada Powering the Next Generation of Secure, Multi-Tenant AI Factories

· 3 min read
Amar Kapadia
Amar Kapadia
Product

As organizations continue to build AI factories capable of handling massive-scale inference and data processing, one challenge looms large: how to deliver secure, multi-tenant infrastructure that keeps GPUs fully utilized without adding operational complexity.

At Armada, we're solving that challenge head-on. Bridge product now integrates with NVIDIA BlueField-3 data processing units (DPUs) and NVIDIA RTX PRO Servers featuring NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs — creating an end-to-end foundation for high-performance, automated AI infrastructure.

The Hidden Risks of Soft Isolation in Multi-Tenant GPU Clouds

· 5 min read
Amar Kapadia
Amar Kapadia
Product

Relying solely on Kubernetes Namespaces or vClusters for multi-tenant isolation in GPU clouds is risky — especially when hosting untrusted or external workloads.

In September 2024, Wiz discovered a critical NVIDIA Container Toolkit vulnerability (CVE-2024-0132) that allowed GPU containers to escape soft isolation and gain root access to the host. This flaw impacted over one-third of GPU-enabled environments and exposed the limits of Kubernetes-based isolation.

Soft isolation is not secure isolation. For environments like Neoclouds, NVIDIA Cloud Partners (NCPs), or regulated industries, only hard or hybrid isolation strategies — such as dedicated Kubernetes clusters, MIG-based GPU partitioning, VPCs, VxLAN, VRFs, KVM virtualization, IB P-KEY, and NVLink partitioning — can protect against container escapes.