JUNE 18–22, 2017

Session Details

Name: Fault Tolerance for Next Generation High Performance Computing
Time: Wednesday, June 21, 2017
01:45 pm - 03:15 pm
Room:   Panorama 1
Messe Frankfurt
Breaks:03:15 pm - 03:45 pm Coffee Break
Chair:   Franck Cappello, ANL
Abstract:   Most of the scientific applications running at large scale feature checkpoint/restart codes to tolerate HPC system failures and extend the execution beyond the limit of the time allocation. The evolution of the HPC system characteristics (more faults, new storage hierarchy including non-volatile memory and Burst Buffers, limitation of the file system bandwidth) will impose for most applications to adapt their fault tolerance strategy. The session on Fault Tolerance for HPC will present talks bringing key insights concerning four major points toward adapting applications for future HPC systems: Failure characterization, Application Resiliency and failure injection, New Checkpoint/Restart techniques, and Resilient programming with MPI.  
Presentations: Characterizing Faults, Errors & Failures in Extreme-Scale Computing Systems
01:45 pm - 02:05 pm
  Christian Engelmann, ORNL
Evaluating Parallel Application Resiliency with the Software Fault Injector, PFSEFI
02:05 pm - 02:25 pm
  Nathan DeBardeleben, Los Alamos National Laboratory
Performance Portable Checkpoint/Restart with VeloC & UnifyCR
02:25 pm - 02:45 pm
  Kathryn Mohror, LLNL
Support for Resilience in Parallel Applications
02:45 pm - 03:05 pm
  George Bosilca, University of Tennessee
Questions & Answers
03:05 pm - 03:15 pm