Session Details |
|
|
Name: |
|
Fault Tolerance for Next Generation High Performance Computing |
|
Time: |
|
Wednesday, June 21, 2017 01:45 pm - 03:15 pm |
|
|
Room: |
|
Panorama 1 Messe Frankfurt |
|
|
Breaks: | 03:15 pm - 03:45 pm Coffee Break |
|
Chair: |
|
Franck Cappello, ANL |
|
|
Abstract: |
|
Most of the scientific applications running at large scale feature
checkpoint/restart codes to tolerate HPC system failures and extend the
execution beyond the limit of the time allocation. The evolution of the
HPC system characteristics
(more faults, new storage hierarchy including non-volatile memory and
Burst Buffers, limitation of the file system bandwidth) will impose for
most applications to adapt their fault tolerance strategy. The session
on Fault Tolerance for HPC will present talks
bringing key insights concerning four major points toward adapting
applications for future HPC systems: Failure characterization,
Application Resiliency and failure injection, New Checkpoint/Restart
techniques, and Resilient programming with MPI. |
|
|
Presentations: |
|
Characterizing Faults, Errors & Failures in Extreme-Scale Computing Systems 01:45 pm - 02:05 pm |
|
|
Christian Engelmann, ORNL |
|
| |
|
|
Evaluating Parallel Application Resiliency with the Software Fault Injector, PFSEFI 02:05 pm - 02:25 pm |
|
|
Nathan DeBardeleben, Los Alamos National Laboratory |
|
| |
|
|
Performance Portable Checkpoint/Restart with VeloC & UnifyCR 02:25 pm - 02:45 pm |
|
|
Kathryn Mohror, LLNL |
|
| |
|
|
Support for Resilience in Parallel Applications 02:45 pm - 03:05 pm |
|
|
George Bosilca, University of Tennessee |
|
| |
|
|
Questions & Answers 03:05 pm - 03:15 pm |
|
|