FFMK: A Fast & Fault-Tolerant Microkernel-Based System for Exascale Computing
Carsten Weinhold, TU Dresden
Abstract:   The FFMK project designs, builds and evaluates a software system architecture to address the challenges expected in an Exascale system. In particular, these include performance losses caused by the much larger impact of runtime variability within applications, hardware, and operating system (OS), as well as increased vulnerability to failures. The architecture builds upon a node-local small OS kernel supporting a combination of specialized runtimes and a full-blown general purpose operating system. Global platform management supports dynamically changing partitions. For FFMK, we have instantiated the architecture components with the L4 microkernel, virtualized Linux, and MPI as an application runtime for bulk-synchronous applications. We carefully split OS and runtime components such that performance-critical functionality remains undisturbed by management-related background activities. The latter includes randomized Gossip algorithms and decentralized decision making as the basic building blocks of global platform management. XtreemFS serves as a fault-tolerant checkpoint store that uses local memory of all nodes to achieve scalability. The poster summarizes key achievements from the first phase of the project, as well as the main objectives and results of the ongoing phase 2. 

