Skip to main content
Version: 5.4.0

Cluster Observability

Overview

Cluster Observability provides real-time monitoring and visualization of your Kubernetes cluster's health, resource utilization, and workload performance. It offers detailed insights into node, pod, and microservice-level metrics through interactive dashboards.

Prerequisites

  • Tenant Admin access — Log in as a Tenant Admin.
  • Cluster in Running state — A Kubernetes cluster must be created and in a Running state.

Cluster Level Observability

Accessing Cluster Level Observability

  1. Navigate to Compute → Kubernetes and click on your cluster name to open the cluster detail view.
  2. The cluster detail page opens on the Overview tab by default.
  3. The Overview tab contains three sub-sections: Health, Utilization, and Pods.

You can adjust the Time Range (e.g., Last 1 hour) and Refresh interval (e.g., 30s) using the controls in the top-right corner.

Health

The Health sub-section provides a comprehensive view of cluster health through three expandable panels:

  • Node Resource Overview — Aggregated metrics for all nodes in the cluster.
  • Pod Resource Overview — Resource consumption details for individual pods.
  • Microservices (Container Name) Resource Overview — Per-container metrics for microservices running in the cluster.

Health Overview

Node Resource Overview

The Node Resource Overview panel displays aggregated metrics across all nodes in the cluster, helping tenant administrators monitor infrastructure-level health at a glance.

Key metrics include:

PanelDescription
Node Memory RatioMemory utilization, usage, and limits as a percentage of total node memory
Node CPU RatioCPU utilization, request, and limit ratios across all nodes
Nodes with PodSummary counts — total number of pods, nodes, and upper pod limits
Namespace Resource StatisticsBreakdown of resource counts (services, pods) per namespace
Network OverviewNetwork receive and send throughput (Mib/s) over time
Workload / Total Pod / Total NodesAggregate counts of workloads, pods, and nodes in the cluster
Memory Usage [All]Time-series chart of total memory usage across all nodes
CPU Used Cores [All]Time-series chart of total CPU core usage across all nodes
Pod Number and Nodes [All]Pod count vs. upper pod limit over time

Node Resource Overview

Additional node-level panels:

PanelDescription
Nodes CPU BreakdownPer-node CPU usage breakdown over time
Node Memory BreakdownPer-node memory usage breakdown over time
Node Network OverviewInflow and outflow network traffic per node
Node Information DetailTable with per-node details — Pod Limit, CPU Usage%, Memory Usage%, Disk Usage%, CPU Total Cores, Memory Total, Disk Total, and more

Node Resource Overview

Storage and namespace panels:

PanelDescription
PVC Storage UsagePersistent Volume Claim usage, total, usage rate, and mount path
Namespaces CPU UsageCPU consumption per namespace (kernel-level, >0.005 cores)
Namespaces WSS Memory UsageWorking Set Size (WSS) memory consumption per namespace (>100MiB)

Node Resource Overview - Storage and Namespaces

Pod Resource Overview

The Pod Resource Overview panel provides granular visibility into individual pod performance and resource consumption.

Key metrics include:

PanelDescription
Pod Resource DetailTable listing each pod with Namespace, Container Name, Pod Name, CPU%, WSS%, RSS%, Memory Limit, WSS, RSS, Disk Limit, Disk Usage, Survival time, Requests, and Inflow
Pod Containers CPU UtilizationCPU utilization percentage (Maximum 100%) per pod container over time
Pod Container Memory UsageMemory usage per pod container over time
Pod Network Bandwidth per SecondNetwork inflow and outflow bandwidth per pod
Pod Containers CPU Core UsageActual CPU core consumption per pod container
Pod Containers WSS Memory UsageWorking Set Size memory per pod container
Pod Containers RSS Memory UsageResident Set Size memory per pod container

Pod Resource Overview

Pod Resource Overview

Microservices (Container Name) Resource Overview

The Microservices Resource Overview panel provides container-level metrics, allowing you to monitor individual microservices running within pods.

Key metrics include:

PanelDescription
Resource StatisticsTable listing each container with Namespace, Container Name, Pod count, Average CPU Usage, WSS Memory Usage, RSS Memory Use, Total CPU Restrictions, CPU Core Usage, WSS Memory, Total RSS Memory, Disk Linage, Disk Used, CPU Demand, and Total Memory Requirements
Average CPU UsageCPU usage percentage (Maximum 100%) per microservice over time
Average Memory UtilizationMemory utilization percentage per microservice over time
Network Bandwidth per SecondNetwork inflow and outflow per microservice
Overall CPU Cores UsedActual CPU core consumption per microservice with limits shown
Overall Memory UsageMemory consumption (WSS and RSS) per microservice
Pod NumberNumber of pods running for each microservice over time

Microservices Resource Overview

Microservices Resource Overview

Utilization

The Utilization sub-section provides cluster-wide resource utilization metrics through four expandable panels: Overview, Resources, Kubernetes, and Network.

Utilization Tab

Overview

The Overview panel displays high-level cluster resource consumption at a glance.

Key metrics include:

PanelDescription
Global CPU UsageReal-time CPU usage percentage across the cluster
Global RAM UsageReal-time RAM usage percentage across the cluster
NodesTotal number of nodes in the cluster
NamespacesTotal number of namespaces in the cluster
Kubernetes Resource CountTime-series chart showing counts of Running Containers, Running Pods, ConfigMaps, Services, Endpoints, Secrets, Ingresses, and Nodes
CPU UsageBreakdown of CPU — Real, Requests, Limits values
RAM UsageBreakdown of RAM — Real, Requests, Limits values
Running PodsTotal number of currently running pods

Utilization Overview Expanded

Resources

The Resources panel provides detailed time-series charts for CPU, memory, and related resource utilization across the cluster.

Key metrics include:

PanelDescription
Cluster CPU UtilizationOverall CPU utilization percentage over time
Cluster Memory UtilizationOverall memory utilization percentage over time
CPU Utilization by NamespaceCPU usage breakdown per namespace over time
Memory Utilization by NamespaceMemory usage breakdown per namespace over time
CPU Utilization by InstanceCPU usage breakdown per node instance over time
Memory Utilization by InstanceMemory usage breakdown per node instance over time
CPU Throttled Seconds by NamespaceCPU throttling duration per namespace
CPU Data Transmit by InstanceCPU data transmission rate per node instance

Utilization Resources

Kubernetes

The Kubernetes panel provides insights into Kubernetes-specific resource states and pod health.

Key metrics include:

PanelDescription
Kubernetes Pods QoS ClassesBreakdown of pods by QoS class — BestEffort, Burstable, Guaranteed — with min, max, and mean counts over time
Kubernetes Pods Status ReasonPod status distribution — Created, NodeAffinity, NodeLost, Shutdown, UnexpectedAdmissionError
OOM Events by NamespaceOut-of-Memory kill events per namespace over time
Container Restarts by NamespaceContainer restart counts per namespace over time

Utilization Kubernetes

Network

The Network panel provides detailed network traffic and bandwidth metrics across the cluster.

Key metrics include:

PanelDescription
Global Network Utilization by DeviceOverall network bandwidth (send/receive) per network device
Network Saturation - Packets DroppedDropped packet counts indicating network saturation
Network Received by NamespaceInbound network traffic per namespace
Total Network Received (with all virtual devices) by InstanceTotal inbound traffic per node instance including virtual devices
Network Received (without loopback) by InstanceInbound traffic per node instance excluding loopback
Network Received (loopback only) by InstanceLoopback-only inbound traffic per node instance

Utilization Network

Utilization Network

Pods

The Pods sub-section provides pod-level observability scoped to a specific namespace and pod. Use the Namespace and Pod dropdowns at the top to filter the view. It contains four expandable panels: Information, Resources, Kubernetes, and Network.

Pods Tab

Information

The Information panel provides metadata and classification details for the selected pods.

Key metrics include:

PanelDescription
Created byThe controller or resource that created the pods (e.g., DaemonSet, ReplicaSet)
Running onThe node(s) where the pods are currently running
Pod IPThe IP addresses assigned to each pod
Priority ClassThe priority class assigned to each pod
QOS ClassThe Quality of Service class for each pod (BestEffort, Burstable, Guaranteed)
Last Terminated ReasonThe reason for the last pod termination, if applicable
Last Terminated Exit CodeThe exit code from the last terminated container

Pods Information

Resources

The Resources panel provides detailed CPU, memory, and resource consumption metrics for the selected pods.

Key metrics include:

PanelDescription
Total pod CPU request/usageGauge showing actual CPU usage vs. requested CPU
Total pod CPU limits/usageGauge showing actual CPU usage vs. CPU limits
Total pod RAM request/usageGauge showing actual RAM usage vs. requested RAM
Total pod RAM limits/usageGauge showing actual RAM usage vs. RAM limits
Resources by ContainerTable with per-container CPU Requests, Memory Count, Memory Requests, and Memory Limits
CPU Usage / Requests & Limits by ContainerStacked time-series chart of CPU usage, requests, and limits per container
Memory Usage / Requests & Limits by ContainerStacked time-series chart of memory usage, requests, and limits per container
CPU Usage by ContainerCPU core usage per container over time
Memory Usage by ContainerMemory consumption per container over time
CPU Throttled Seconds by ContainerCPU throttling duration per container over time

Pods Resources

Pods Resources

Kubernetes

The Kubernetes panel provides Kubernetes-specific health and scheduling metrics for the selected pods.

Pods Kubernetes

Key metrics include:

PanelDescription
OOM Events by ContainerOut-of-Memory kill events per container over time
Container Restarts by ContainerContainer restart counts per container over time
Pods with Container IssuesPods experiencing container-level issues
Unscheduled Pod CountNumber of pods that could not be scheduled
Unscheduled Pods (detail)Details of pods that failed scheduling

Network

The Network panel provides network traffic metrics for the selected pods.

Pods Network

Key metrics include:

PanelDescription
Network - BandwidthNetwork bandwidth (received and transmitted) over time
Network - Packets RatePacket rate (received and transmitted) over time
Network - Packets DroppedDropped packets (received and transmitted) over time
Network - ErrorsNetwork errors (received and transmitted) over time

Node Level Observability

Node Observability provides detailed monitoring for individual nodes in your cluster. You can access it from the Nodes tab in the cluster detail view.

Accessing Node Level Observability

  1. From the cluster detail view, click the Nodes tab.
  2. Click on the node name to open the node detail page.
  3. The node detail page displays node metadata (IP, CPU, Memory, GPU Allocated) and an Observability section with six expandable panels: Overview, Resources, System, Network, Kubernetes Storage, and Node Storage.

Node Observability Panels

Overview

The Overview panel displays high-level resource consumption and pod information for the selected node.

Key metrics include:

PanelDescription
CPU UsageGauge showing current CPU usage percentage on the node
RAM UsageGauge showing current RAM usage percentage on the node
Pods on NodeTotal number of pods running on the node
CPU Used / CPU TotalActual CPU cores used vs. total available
RAM Used / RAM TotalActual RAM used vs. total available
UptimeHow long the node has been running
List of Pods on NodeTable listing all pods on the node with pod name, namespace, and created_by kind

Node Observability Overview

Resources

The Resources panel provides detailed CPU and memory utilization charts for the selected node.

Key metrics include:

PanelDescription
CPU UsageTime-series chart of CPU usage breakdown (iowait, nice, softirq, steal, system, user)
Memory UsageTime-series chart of memory usage breakdown (RAM Used, RAM Cache, RAM Buffer, SWAP Used, SWAP Cache, SWAP Total)
CPU Usage by PodCPU core consumption per pod over time
Memory Usage by PodMemory consumption per pod over time
Number of CPU Core ThrottledCPU core throttling events over time

Node Observability Resources

System

The System panel provides operating system-level metrics for the selected node.

Key metrics include:

PanelDescription
System LoadSystem load averages (1m, 5m, 15m) over time
Context Switches & InterruptsContext switch and interrupt counts over time
File DescriptorsMaximum vs. allocated file descriptors over time
Time SyncEstimated time synchronization error in seconds

Node Observability System

Network

The Network panel provides network traffic and connection metrics for the selected node.

Key metrics include:

PanelDescription
Network Usage (bytes/s)Network bandwidth (received and transmitted) in bytes per second
Network ErrorsNetwork error counts (received and transmitted) over time
Network Usage (packets/s)Packet rate (received and transmitted) over time
Network Total DropsTotal dropped packets over time
TCP Currently EstablishedNumber of currently established TCP connections
NF ConntrackNetfilter connection tracking table usage

Node Observability Network

Node Observability Network

Kubernetes Storage

The Kubernetes Storage panel provides Persistent Volume metrics for the selected node.

Key metrics include:

PanelDescription
Persistent Volumes - Usage in %Percentage utilization of persistent volumes
Persistent Volumes - Usage in GBActual storage usage of persistent volumes in GB
Persistent Volumes - InodesInode usage for persistent volumes

Node Observability Kubernetes Storage

Node Storage

The Node Storage panel provides disk-level metrics for the selected node's local storage.

Key metrics include:

PanelDescription
FS Usage in %Filesystem usage percentage per mount point over time
FS Inode Usage in %Inode usage percentage per mount point over time
Reads by Disk (Bytes)Disk read throughput in bytes per disk device
Writes by Disk (Bytes)Disk write throughput in bytes per disk device
Completed IOPS by DiskCompleted I/O operations per second per disk
Completed Writes by DiskCompleted write operations per second per disk
I/O - IOMIX StatusOverall I/O mix status chart

Node Observability Node Storage

Node Observability Node Storage

Next Steps