Toward Resilience in HPC: A Prototype to Analyze & Predict System Behavior

Date: Monday, June 20, 2016, 01:00 PM - 03:00 PM

Room: Panorama 1, Forum

Type: PhD Forum

Description: Failure rates in high performance computers rapidly increase due to the growth in system size and complexity. Hence, failures became the norm rather than the exception. The efficiency of recovery mechanisms, e.g., checkpoint-restart, is dependent on the mean time between failures (MTBF). In the near future, the MTBF of HPC systems is expected to be too short, and that current failure recovery mechanisms will no longer be able to recover the systems from failures. Early failure detection is a new class of failure recovery methods that can be beneficial for HPC systems with short MTBF. Detecting failures in their early stage can reduce their negative effects by preventing their propagation to other parts of the system. The early goal of this study is to contribute to the foundation of early failure detection techniques and the final goal is going toward a resilient HPC via introducing a prototype to analyze and predict system behavior. Taurus is a HPC system in TU-Dresden, designed to handle highly parallel applications. We use Taurus as our test case. To achieve the goal of our study, as the first step we required system wide information about Taurus. Therefore, we built a monitoring facility which is monitoring, collecting, and indexing all the useful information. By having this information in hand we tried to detect patterns of anomalies in node behaviors. Different anomaly patterns were detected. The second step was recognizing similarities in anomaly patterns. In this step we suddenly realized a geographical correlation between node outages on Taurus. This hint lead us to further investigation on failure correlations. After considering many different scenarios of node failures, we could categorize all failures into correlated and uncorrelated failures. The first category itself consists of three sub categories of (1) temporal, (2) spatial, and (3) logical correlations. Only based on a deeper understanding of the failure patterns and their correlations, effective failure recovery and prediction mechanisms can be devised. In general, based on our observations, on Taurus many failures are temporally correlated, spatially correlated, or both, depending on the time interval within our examined time period. The differences are mainly caused by the various usages of the system over time; this naturally leads to the need for detecting logical correlations. The logical correlation is not always easy to infer. In this study, so far we learned that logical correlation can more easily be revealed by further examination of the spatially and/or temporally correlated failures. However this study is ongoing. To be able to detect early, diagnose, and correlate generic failures, further investigation and analysis is needed for a generic failure diagnosis and correlation methodology. Upon completion of the study, such a generic failure diagnosis and correlation methodology could be used to detect and prevent failures in a shorter time and more efficiently than the nowadays techniques. Following this study, we will examine correlation patterns which could help detect failures in their early stage or even predict them. The automation of the analysis approach is also planned.

Links: Official link from ISC 2016

Back to overview Prev Next