Recent years have seen a tremendous increase in the volume of data generated in the life sciences. The analysis of these data sets poses difficult computational challenges and is an active field of research. Currently, a popular strategy in data rich scenarios across many areas of science and industry is to adopt big data technology, a class of highly scalable and fault-tolerant systems for handling huge data sets. However, the characteristics of typical biological data sets and their intended usage differ significantly from most other application areas of big data technologies: while biological data sets are often remarkably large, they tend to be smaller than typical ’big’ data sets. On the other hand, biological data processing often requires more complex analysis techniques than can be afforded by big data technology, which is often constrained to algorithms with linear or sub-linear complexity. Consequently, the computational life sciences today tend to rely on classical high performance computing (HPC) paradigms instead. Conceptually, most HPC methods work at a lower level of abstraction than common big data techniques and often demand a much deeper understanding for the intricacies of parallel computing. In my talk, I will demonstrate how hybrid approaches, combining ideas from big data with HPC methodologies, can help flexible and highly performant HPC methods to scale towards large data sets. At the same time, much of the domain specific programming can be performed at a high level of abstraction, similar to classical big data approaches, enabling the user to focus more on the biological problem at hand than at the low level primitives.
|