Overview
The CHAI Cluster Manager Tool is a unified, modular automation framework designed to simplify and accelerate the deployment, configuration, and management of HPC and AI clusters. It integrates provisioning, orchestration, workload scheduling, and monitoring using industry-standard tools like xCAT, Ansible, SLURM, OpenLDAP and Kubernetes etc.
The framework currently supports:
x86_64 architecture and multi OS-based environments
Multiple software stack versions with version control
Bare Metal and Virtual Machine
Both bare-metal and containerized software stack of HPC-AI infrastructure
Multi-tenant Kubernetes control planes for AI user(Team) isolation
HPC Cluster Management and Service Nodes
To efficiently manage the cluster, dedicated service nodes handle installation, deployment, and administrative tasks. These include:
Head Node - A head node which is CHAI Manger tool. To configure all the service nodes. (It can be temporary or permanent.)
Master Nodes – Manage compute nodes using xCAT for provisioning and SLURM for workload scheduling and ldap for central user authentication.
AI Nodes – Kubernetes based orchestration for AI clusters, supporting containerized AI workloads and workload management.
Management Nodes – Handle monitoring, logging, and ticketing systems.
Login Nodes – Provide user access to the cluster.
BMC (Baseboard Management Controller) Nodes – Oversee hardware health and remote management.
Firewall Node – Ensures network security