Presentation Details |
|||||
Name: | (RP16) Optimizing Massive Data Access for Large Scale Population Genomics Analysis Using HDF5 | ||||
Time: | Tuesday, June 20, 2017 08:35 am - 09:45 am |
||||
Room: | Substanz 1+2 | ||||
Breaks: | 07:30 am - 10:00 am Welcome Coffee | ||||
Presenter: | Hui Yan, NSCCGZ | ||||
Abstract: | More and more DNA sequencing data are generated, which enables population scale modeling for both scientific and clinical purposes. The traditional plain organization and layout of these data volumes don’t fit well with large scale analysis. Genotype imputation needs to analyze the same genome region of all individuals, thus small partial data of a large amount of files will be read. Such kind of data access significantly increases the workload of parallel file system, causing performance bottleneck. To tackle this, HDF5 file format is employed as kind of container for these raw data files. Naturally one single HDF5 file for a human chromosome, inside the HDF5 file two layouts are proposed and tested. The first one is one-dimensional, data distributed as different individuals/samples. The second one is two-dimensional, data distributed along both fixed size genome regions and different individuals/samples. Our experiment shows that both layouts improve the performance significantly, 3.4x speedup is observed. And two-dimensional layout performs even better because the feasibility to locate a certain region. It is clear that our work solves the metadata congestion as well as improvement in data access performance. Authors: Junrong Yang, South China University of Technology Peihao Liu, National University of Defense Technology Guixin Guo, National Supercomputer Center in Guangzhou Hanquan Liang, National Supercomputer Center in Guangzhou BingQiang Wang, National Supercomputer Center in Guangzhou Shoubin Dong, South China University of Technology Yutong Lu, National Supercomputer Center in Guangzhou |
||||
Download | RP16_Yan.pdf (864 KB) |
||||