Version: 5.4.0

Cluster Observability

Overview

Cluster Observability provides real-time monitoring and visualization of your Kubernetes cluster's health, resource utilization, and workload performance. It offers detailed insights into node, pod, and microservice-level metrics through interactive dashboards.

Prerequisites

Tenant Admin access — Log in as a Tenant Admin.
Cluster in Running state — A Kubernetes cluster must be created and in a Running state.

Cluster Level Observability

Accessing Cluster Level Observability

Navigate to Compute → Kubernetes and click on your cluster name to open the cluster detail view.
The cluster detail page opens on the Overview tab by default.
The Overview tab contains three sub-sections: Health, Utilization, and Pods.

You can adjust the Time Range (e.g., Last 1 hour) and Refresh interval (e.g., 30s) using the controls in the top-right corner.

Health

The Health sub-section provides a comprehensive view of cluster health through three expandable panels:

Node Resource Overview — Aggregated metrics for all nodes in the cluster.
Pod Resource Overview — Resource consumption details for individual pods.
Microservices (Container Name) Resource Overview — Per-container metrics for microservices running in the cluster.

Health Overview

Node Resource Overview

The Node Resource Overview panel displays aggregated metrics across all nodes in the cluster, helping tenant administrators monitor infrastructure-level health at a glance.

Key metrics include:

Panel	Description
Node Memory Ratio	Memory utilization, usage, and limits as a percentage of total node memory
Node CPU Ratio	CPU utilization, request, and limit ratios across all nodes
Nodes with Pod	Summary counts — total number of pods, nodes, and upper pod limits
Namespace Resource Statistics	Breakdown of resource counts (services, pods) per namespace
Network Overview	Network receive and send throughput (Mib/s) over time
Workload / Total Pod / Total Nodes	Aggregate counts of workloads, pods, and nodes in the cluster
Memory Usage [All]	Time-series chart of total memory usage across all nodes
CPU Used Cores [All]	Time-series chart of total CPU core usage across all nodes
Pod Number and Nodes [All]	Pod count vs. upper pod limit over time

Node Resource Overview

Additional node-level panels:

Panel	Description
Nodes CPU Breakdown	Per-node CPU usage breakdown over time
Node Memory Breakdown	Per-node memory usage breakdown over time
Node Network Overview	Inflow and outflow network traffic per node
Node Information Detail	Table with per-node details — Pod Limit, CPU Usage%, Memory Usage%, Disk Usage%, CPU Total Cores, Memory Total, Disk Total, and more

Node Resource Overview

Storage and namespace panels:

Panel	Description
PVC Storage Usage	Persistent Volume Claim usage, total, usage rate, and mount path
Namespaces CPU Usage	CPU consumption per namespace (kernel-level, >0.005 cores)
Namespaces WSS Memory Usage	Working Set Size (WSS) memory consumption per namespace (>100MiB)

Node Resource Overview - Storage and Namespaces

Pod Resource Overview

The Pod Resource Overview panel provides granular visibility into individual pod performance and resource consumption.

Key metrics include:

Panel	Description
Pod Resource Detail	Table listing each pod with Namespace, Container Name, Pod Name, CPU%, WSS%, RSS%, Memory Limit, WSS, RSS, Disk Limit, Disk Usage, Survival time, Requests, and Inflow
Pod Containers CPU Utilization	CPU utilization percentage (Maximum 100%) per pod container over time
Pod Container Memory Usage	Memory usage per pod container over time
Pod Network Bandwidth per Second	Network inflow and outflow bandwidth per pod
Pod Containers CPU Core Usage	Actual CPU core consumption per pod container
Pod Containers WSS Memory Usage	Working Set Size memory per pod container
Pod Containers RSS Memory Usage	Resident Set Size memory per pod container

Pod Resource Overview

Microservices (Container Name) Resource Overview

The Microservices Resource Overview panel provides container-level metrics, allowing you to monitor individual microservices running within pods.

Key metrics include:

Panel	Description
Resource Statistics	Table listing each container with Namespace, Container Name, Pod count, Average CPU Usage, WSS Memory Usage, RSS Memory Use, Total CPU Restrictions, CPU Core Usage, WSS Memory, Total RSS Memory, Disk Linage, Disk Used, CPU Demand, and Total Memory Requirements
Average CPU Usage	CPU usage percentage (Maximum 100%) per microservice over time
Average Memory Utilization	Memory utilization percentage per microservice over time
Network Bandwidth per Second	Network inflow and outflow per microservice
Overall CPU Cores Used	Actual CPU core consumption per microservice with limits shown
Overall Memory Usage	Memory consumption (WSS and RSS) per microservice
Pod Number	Number of pods running for each microservice over time

Microservices Resource Overview

Utilization

The Utilization sub-section provides cluster-wide resource utilization metrics through four expandable panels: Overview, Resources, Kubernetes, and Network.

Utilization Tab

Overview

The Overview panel displays high-level cluster resource consumption at a glance.

Key metrics include:

Panel	Description
Global CPU Usage	Real-time CPU usage percentage across the cluster
Global RAM Usage	Real-time RAM usage percentage across the cluster
Nodes	Total number of nodes in the cluster
Namespaces	Total number of namespaces in the cluster
Kubernetes Resource Count	Time-series chart showing counts of Running Containers, Running Pods, ConfigMaps, Services, Endpoints, Secrets, Ingresses, and Nodes
CPU Usage	Breakdown of CPU — Real, Requests, Limits values
RAM Usage	Breakdown of RAM — Real, Requests, Limits values
Running Pods	Total number of currently running pods

Utilization Overview Expanded

Resources

The Resources panel provides detailed time-series charts for CPU, memory, and related resource utilization across the cluster.

Key metrics include:

Panel	Description
Cluster CPU Utilization	Overall CPU utilization percentage over time
Cluster Memory Utilization	Overall memory utilization percentage over time
CPU Utilization by Namespace	CPU usage breakdown per namespace over time
Memory Utilization by Namespace	Memory usage breakdown per namespace over time
CPU Utilization by Instance	CPU usage breakdown per node instance over time
Memory Utilization by Instance	Memory usage breakdown per node instance over time
CPU Throttled Seconds by Namespace	CPU throttling duration per namespace
CPU Data Transmit by Instance	CPU data transmission rate per node instance

Utilization Resources

Kubernetes

The Kubernetes panel provides insights into Kubernetes-specific resource states and pod health.

Key metrics include:

Panel	Description
Kubernetes Pods QoS Classes	Breakdown of pods by QoS class — BestEffort, Burstable, Guaranteed — with min, max, and mean counts over time
Kubernetes Pods Status Reason	Pod status distribution — Created, NodeAffinity, NodeLost, Shutdown, UnexpectedAdmissionError
OOM Events by Namespace	Out-of-Memory kill events per namespace over time
Container Restarts by Namespace	Container restart counts per namespace over time

Utilization Kubernetes

Network

The Network panel provides detailed network traffic and bandwidth metrics across the cluster.

Key metrics include:

Panel	Description
Global Network Utilization by Device	Overall network bandwidth (send/receive) per network device
Network Saturation - Packets Dropped	Dropped packet counts indicating network saturation
Network Received by Namespace	Inbound network traffic per namespace
Total Network Received (with all virtual devices) by Instance	Total inbound traffic per node instance including virtual devices
Network Received (without loopback) by Instance	Inbound traffic per node instance excluding loopback
Network Received (loopback only) by Instance	Loopback-only inbound traffic per node instance

Utilization Network

Pods

The Pods sub-section provides pod-level observability scoped to a specific namespace and pod. Use the Namespace and Pod dropdowns at the top to filter the view. It contains four expandable panels: Information, Resources, Kubernetes, and Network.

Pods Tab

Information

The Information panel provides metadata and classification details for the selected pods.

Key metrics include:

Panel	Description
Created by	The controller or resource that created the pods (e.g., DaemonSet, ReplicaSet)
Running on	The node(s) where the pods are currently running
Pod IP	The IP addresses assigned to each pod
Priority Class	The priority class assigned to each pod
QOS Class	The Quality of Service class for each pod (BestEffort, Burstable, Guaranteed)
Last Terminated Reason	The reason for the last pod termination, if applicable
Last Terminated Exit Code	The exit code from the last terminated container

Pods Information

Resources

The Resources panel provides detailed CPU, memory, and resource consumption metrics for the selected pods.

Key metrics include:

Panel	Description
Total pod CPU request/usage	Gauge showing actual CPU usage vs. requested CPU
Total pod CPU limits/usage	Gauge showing actual CPU usage vs. CPU limits
Total pod RAM request/usage	Gauge showing actual RAM usage vs. requested RAM
Total pod RAM limits/usage	Gauge showing actual RAM usage vs. RAM limits
Resources by Container	Table with per-container CPU Requests, Memory Count, Memory Requests, and Memory Limits
CPU Usage / Requests & Limits by Container	Stacked time-series chart of CPU usage, requests, and limits per container
Memory Usage / Requests & Limits by Container	Stacked time-series chart of memory usage, requests, and limits per container
CPU Usage by Container	CPU core usage per container over time
Memory Usage by Container	Memory consumption per container over time
CPU Throttled Seconds by Container	CPU throttling duration per container over time

Pods Resources

Kubernetes

The Kubernetes panel provides Kubernetes-specific health and scheduling metrics for the selected pods.

Pods Kubernetes

Key metrics include:

Panel	Description
OOM Events by Container	Out-of-Memory kill events per container over time
Container Restarts by Container	Container restart counts per container over time
Pods with Container Issues	Pods experiencing container-level issues
Unscheduled Pod Count	Number of pods that could not be scheduled
Unscheduled Pods (detail)	Details of pods that failed scheduling

Network

The Network panel provides network traffic metrics for the selected pods.

Pods Network

Key metrics include:

Panel	Description
Network - Bandwidth	Network bandwidth (received and transmitted) over time
Network - Packets Rate	Packet rate (received and transmitted) over time
Network - Packets Dropped	Dropped packets (received and transmitted) over time
Network - Errors	Network errors (received and transmitted) over time

Node Level Observability

Node Observability provides detailed monitoring for individual nodes in your cluster. You can access it from the Nodes tab in the cluster detail view.

Accessing Node Level Observability

From the cluster detail view, click the Nodes tab.
Click on the node name to open the node detail page.
The node detail page displays node metadata (IP, CPU, Memory, GPU Allocated) and an Observability section with six expandable panels: Overview, Resources, System, Network, Kubernetes Storage, and Node Storage.

Node Observability Panels

Overview

The Overview panel displays high-level resource consumption and pod information for the selected node.

Key metrics include:

Panel	Description
CPU Usage	Gauge showing current CPU usage percentage on the node
RAM Usage	Gauge showing current RAM usage percentage on the node
Pods on Node	Total number of pods running on the node
CPU Used / CPU Total	Actual CPU cores used vs. total available
RAM Used / RAM Total	Actual RAM used vs. total available
Uptime	How long the node has been running
List of Pods on Node	Table listing all pods on the node with pod name, namespace, and created_by kind

Node Observability Overview

Resources

The Resources panel provides detailed CPU and memory utilization charts for the selected node.

Key metrics include:

Panel	Description
CPU Usage	Time-series chart of CPU usage breakdown (iowait, nice, softirq, steal, system, user)
Memory Usage	Time-series chart of memory usage breakdown (RAM Used, RAM Cache, RAM Buffer, SWAP Used, SWAP Cache, SWAP Total)
CPU Usage by Pod	CPU core consumption per pod over time
Memory Usage by Pod	Memory consumption per pod over time
Number of CPU Core Throttled	CPU core throttling events over time

Node Observability Resources

System

The System panel provides operating system-level metrics for the selected node.

Key metrics include:

Panel	Description
System Load	System load averages (1m, 5m, 15m) over time
Context Switches & Interrupts	Context switch and interrupt counts over time
File Descriptors	Maximum vs. allocated file descriptors over time
Time Sync	Estimated time synchronization error in seconds

Node Observability System

Network

The Network panel provides network traffic and connection metrics for the selected node.

Key metrics include:

Panel	Description
Network Usage (bytes/s)	Network bandwidth (received and transmitted) in bytes per second
Network Errors	Network error counts (received and transmitted) over time
Network Usage (packets/s)	Packet rate (received and transmitted) over time
Network Total Drops	Total dropped packets over time
TCP Currently Established	Number of currently established TCP connections
NF Conntrack	Netfilter connection tracking table usage

Node Observability Network

Kubernetes Storage

The Kubernetes Storage panel provides Persistent Volume metrics for the selected node.

Key metrics include:

Panel	Description
Persistent Volumes - Usage in %	Percentage utilization of persistent volumes
Persistent Volumes - Usage in GB	Actual storage usage of persistent volumes in GB
Persistent Volumes - Inodes	Inode usage for persistent volumes

Node Observability Kubernetes Storage

Node Storage

The Node Storage panel provides disk-level metrics for the selected node's local storage.

Key metrics include:

Panel	Description
FS Usage in %	Filesystem usage percentage per mount point over time
FS Inode Usage in %	Inode usage percentage per mount point over time
Reads by Disk (Bytes)	Disk read throughput in bytes per disk device
Writes by Disk (Bytes)	Disk write throughput in bytes per disk device
Completed IOPS by Disk	Completed I/O operations per second per disk
Completed Writes by Disk	Completed write operations per second per disk
I/O - IOMIX Status	Overall I/O mix status chart

Node Observability Node Storage

Next Steps

Deploy Hugging Face Model — Deploy open-source models from Hugging Face Hub.
Deploy NIM Model — Deploy GPU-optimized NVIDIA NIM inference containers.
Deploy Azure ML Model — Deploy models from your Azure ML model registry.
Application Deployment — Define and deploy custom applications on your Kubernetes cluster.
NGINX Web Server — Deploy NGINX from the application catalog.

Overview​

Prerequisites​

Cluster Level Observability​

Accessing Cluster Level Observability​

Health​

Node Resource Overview​

Pod Resource Overview​

Microservices (Container Name) Resource Overview​

Utilization​

Overview​

Resources​

Kubernetes​

Network​

Pods​

Information​

Resources​

Kubernetes​

Network​

Node Level Observability​

Accessing Node Level Observability​

Overview​

Resources​

System​

Network​

Kubernetes Storage​

Node Storage​

Next Steps​

Overview

Prerequisites

Cluster Level Observability

Accessing Cluster Level Observability

Health

Node Resource Overview

Pod Resource Overview

Microservices (Container Name) Resource Overview

Utilization

Overview

Resources

Kubernetes

Network

Pods

Information

Resources

Kubernetes

Network

Node Level Observability

Accessing Node Level Observability

Overview

Resources

System

Network

Kubernetes Storage

Node Storage

Next Steps