Operational Data Analytics: HPC Efficiency Improvements with Interoperable Monitoring and Analysis
Date: Tuesday, May 14, 2024, 03:30 PM - 04:30 PM
Room: Hall E - 2nd Floor
Type: Birds of a Feather
Description: Operational Data Analytics (ODA) refers to the collection, monitoring, analysis, and optimization of operational data from HPC systems and their data centers. The breadth of data related to efficiently operating HPC environments encompasses everything from system-level logs & metrics through facility-wide orchestration of cooling and airflow. In HPC, the nature of operational data — i.e., time-series—based, continuous 24x7, heterogeneous & distributed with high volume and dimensionality — makes it challenging to collect and gain insights from. However, the benefits gleaned from ODA have been demonstrated to improve energy efficiency across facilities, detect and diagnose performance degradation in system components and applications, and provide valuable feedback toward increased sustainability of current and future HPC data centers. Many HPC sites have deployed sophisticated ODA frameworks, capable of ingesting operational data, providing real-time monitoring & alerting, and enabling advanced analytics (using machine learning, decision-making, and artificial intelligence for operations). The goal of this BoF is to foster community-driven discussions on state-of-the-art practices of these ODA frameworks across sites, share challenges faced (and overcome), and identify areas of commonality and shared goals. This BoF is organized by members of the Energy Efficiency HPC Working Group (EEHPCWG) ODA team, which is a global effort of over 100 sites spanning industry, research, and academia. We invite community members to contribute and ask questions, whether they are just beginning or have fully developed ODA ecosystems. Together, we will also work to establish a path forward in standardizing monitoring data to facilitate interoperability of tools as a community.
Links: Official link from ISC 2024