Agenda Planner ISC 2017 - Presentation Details

JUNE 18–22, 2017
FRANKFURT AM MAIN, GERMANY

Presentation Details

Name:		Performance Portable Checkpoint/Restart with VeloC & UnifyCR

Time:		Wednesday, June 21, 2017 02:25 pm - 02:45 pm

Room:		Panorama 1 Messe Frankfurt

Speaker:		Kathryn Mohror, LLNL

Abstract:		High-end supercomputing systems generally achieve increased computing speeds by increasing the number of computing cores in the system. While FLOP goals can be reached with this strategy, the consequence of a larger number of system components is a higher failure rate. Today, systems experience failures on the order of hours or days; however, on future exascale systems, failures could occur more frequently. In this talk, I will give an overview of the work of two ECP projects that address problems associated with fault tolerance on large systems. The VeloC project will produce production-quality multilevel checkpointing software which can significantly reduce the overhead of checkpointing by utilizing fast storage devices (e.g., burst buffers). The UnifyCR project will develop a user-level distributed file system for burst buffers, specialized to provide high performance for checkpoint/restart workloads. Together, VeloC and UnifyCR will provide performance portability for applications to achieve low-overhead fault tolerance on emerging systems.