Infrastructure
Setting Up a Multi-Tenant Infrastructure
NCPs and AI Cloud Providers need to plan and design their AI cloud infrastructure based on their service offerings. Once all the hardware components are deployed physically in the data center, activities such as day 0 provisioning of bare metal servers, setting up the underlay network etc. typically follow. The Armada Bridge automates all these activities such as day 0 and day 1 discovery and provisioning of complete Data Centre topology that includes GPU compute nodes, network switches and storage nodes.
This section discusses the technical aspects of setting up the infrastructure so that it can be consumed by multiple tenants in an isolated manner. By isolation, we mean hard isolation that is equivalent to physical isolation. We exclude Kubernetes style soft isolation.
To start, the Armada Bridge performs infrastructure discovery and configuration, in a sequence of steps:
- Network & GPU Fabric Discovery: The NCP admin can initiate the discovery process for a multi-tier network and GPU fabric, enabling centralized management.
- Fabric Components: The fabric includes Ethernet switches, InfiniBand switches, and a Subnet Manager (UFM), which are essential for configuring and managing network connectivity.
- Automated Infrastructure Discovery: Once triggered, the discovery process discovers all the GPU compute nodes, identifies and maps the GPU compute nodes to network switch ports and builds a topology providing a comprehensive view of the infrastructure.
- Multi-tenancy configurations: This involves configuring the CPU and GPU compute nodes, network fabric comprising Ethernet / InfiniBand and external storage for simultaneous usage by multiple tenants ensuring strict isolation.
- Centralized Management: After discovery, the entire infrastructure—including network switches and GPU fabric—can be orchestrated and managed by Bridge GPU CMS platform, ensuring streamlined operations and automation.
Further sub-sections detail all the above activities for setting up a multi-tenant AI cloud infrastructure.
Hardware Provisioning
The Hardware provisioning of the underlying compute resources involves multiple steps. This assumes that the provisioning is done directly using Armada Bridge. If other provisioning tools (such as NVIDIA BCM) are used, the following steps will not be needed.
Bare-Metal Provisioning via Redfish:
- CMS leverages Redfish to communicate with the server's Baseboard Management Controller (BMC), allowing for full control over server hardware.
- Redfish enables tasks like power management, hardware configuration, and eventually operating system (OS) installation on bare-metal servers.
Integration with MaaS (Metal as a Service):
- Internally, Armada Bridge integrates with MaaS for OS provisioning. MaaS is an open-source tool designed for automating the provisioning of physical servers in a datacenter.
- While Redfish handles the hardware provisioning, MaaS takes care of the OS installation, enabling seamless automation of OS deployment on bare-metal machines once they are powered on and initialized.
Custom OS Image Management:
- The NCP Admin (Network Control Panel Admin) is responsible for managing custom operating system images within the Armada Bridge UI. These images can be pre-configured OS environments tailored for specific tenant needs.
- The admin can upload or configure OS images, making them available for tenants to use when provisioning new machines.
Provisioning and De-Provisioning Flows: Both provisioning and de-provisioning flows are supported:
- Provisioning involves installing and configuring the hardware and OS according to tenant specifications.
- De-Provisioning allows the Armada Bridge to securely shut down and wipe the hardware, making it available for future use. This may involve wiping the storage, resetting hardware settings, and returning the server to a "bare" state.
Once the provisioning is done, the tenant users can request bare-metal servers via the Armada Bridge UI, selecting from the custom OS images that the admin has prepared. Once the tenant selects an OS image and initiates the provisioning, Armada Bridge uses Redfish to manage the underlying hardware, while MaaS takes over the OS installation.
In few deployment scenarios, where NCPs prefer NVIDIA Base Command Manager (BCM) for low level infrastructure provisioning, Armada Bridge integrates with BCM for offering the above functionalities though a single pane of glass. Please refer to figure 3 (Bridge Reference Architecture) that depicts BCM integration with Armada Bridge.
Finally, Armada Bridge also supports day 2 management activities involving OS upgrades, applying patches etc.
The GPU Infrastructure consists of the servers, storage, switches and other Data Center equipment that is managed by Bridge.
Once the Super Admin logs into the Bridge UI, the infrastructure can be either:
- Discovered by Bridge, by providing the necessary information about the switches. This is recommended for larger configurations involving several switches, servers and other data center equipment.
- Imported into Bridge, by providing the details in a CSV file in a specific format (typically recommended for small to medium deployment sizes).