Abstract: |
|
GPUs have been immensely popular in HPC over the last decade due to their compelling price-performance and power-performance ratios. However, most of this popularity was amassed by “porting” existing CPU-based applications to GPUs. But today the computational prowess of GPUs coupled with the timely abundance of available data has given rise to a new application area of Deep Learning. Deep learning has significantly advanced the state of image, speech and text recognition systems. At the core, deep learning comprises of operations such as convolutions and matrix-matrix multiplications, thereby being limited by the compute capability of underlying hardware. We believe that further advances in deep learning rely upon both the algorithmic innovations as well as architectural innovations to build faster processors. A key difference in Deep Learning is its insensitivity to precision compared to what we have been used to in traditional HPC. We leverage this aspect to devise a compelling implementation of convolutions for GPUs, which effectively doubles its performance without incurring any extra memory overhead. We also present key results gathered by tuning various knobs like memory bandwidth, compute throughput, among others, to understand the key GPU architectural features which impact the performance of Deep Learning. This talk concludes by discussing some aspects which are critical to succeed in Deep Learning at scale. |
|