Presentation Details |
|
|
Name: |
|
Characterizing Faults, Errors & Failures in Extreme-Scale Computing Systems |
|
|
Time: |
|
Wednesday, June 21, 2017 01:45 pm - 02:05 pm |
|
|
Room: |
|
Panorama 1 Messe Frankfurt |
|
|
Breaks: | 12:30 pm - 01:45 pm Lunch |
|
Speaker: |
|
Christian Engelmann, ORNL |
|
|
Abstract: |
|
The path to exascale computing poses several research challenges.
Resilience, i.e., providing efficiency and correctness in the presence
of faults, errors and failures, is one of the most important challenges
as systems scale
up in component count and component reliability does not increase
accordingly. This talk provides an overview of recent and ongoing
resilience research activities at Oak Ridge National Laboratory, Argonne
National Laboratory and Lawrence Livermore National
Laboratory in developing the missing high-performance computing (HPC)
fault model. This effort identifies, categorizes and models the fault,
error and failure properties of today's HPC systems. It develops a
taxonomy, catalog and models that capture the observed
and inferred conditions in current systems and extrapolates this
knowledge to exascale HPC systems. |
|
|
|