Abstract: |
|
High-end supercomputing systems generally achieve
increased computing speeds by increasing the number of computing cores
in the system. While FLOP goals can be reached with this strategy, the
consequence of a larger
number of system components is a higher failure rate. Today, systems
experience failures on the order of hours or days; however, on future
exascale systems, failures could occur more frequently. In this talk, I
will give an overview of the work of two ECP
projects that address problems associated with fault tolerance on large
systems. The VeloC project will produce production-quality multilevel
checkpointing software which can significantly reduce the overhead of
checkpointing by utilizing fast storage devices
(e.g., burst buffers). The UnifyCR project will develop a user-level
distributed file system for burst buffers, specialized to provide high
performance for checkpoint/restart workloads. Together, VeloC and
UnifyCR will provide performance portability for applications
to achieve low-overhead fault tolerance on emerging systems. |
|