DeepUM: Tensor Migration and Prefetching in Unified Memory – 서울대학교 데이터사이언스대학원

DeepUM: Tensor Migration and Prefetching in Unified Memory

Deep neural networks are continuing to get wider and deeper. As a result, it requires a tremendous amount of GPU memory and computing power. To address the issue, Prof. Lee's team proposed the DeepUM framework, which exploits CUDA Unified Memory (UM) to allow GPU memory oversubscription for DNNs. While UM allows memory oversubscription using a page fault mechanism, page migration introduces enormous overhead. DeepUM uses a new correlation prefetching technique to hide the page migration overhead. It is fully automatic and transparent to users. The evaluation result indicates that DeepUM is very effective for GPU memory oversubscription and can handle larger models that other approaches fail to handle.

References

Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, Xiaoyi Lu, and Dhabaleswar K. Panda. 2018. OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training. In 2018 IEEE 25th International Conference on High Performance Computing (HiPC). 143– 152.
Pak Markthub, Mehmet E. Belviranli, Seyong Lee, Jeffrey S. Vetter, and Satoshi Matsuoka. 2018. DRAGON: Breaking GPU Memory Capacity Limits with Direct NVM Access. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 414–426.
Jonghyun Bae, Jongsung Lee, Yunho Jin, Sam Son, Shine Kim, Hakbeom Jang, Tae Jun Ham, and Jae W. Lee. 2021. FlashNeuron: SSD-Enabled Large-Batch Training of Very Deep Neural Networks. In 19th USENIX Conference on File and Storage Technologies (FAST 21). USENIX Association, 387–401.
Mark Hildebrand, Jawad Khan, Sanjeev Trika, Jason Lowe-Power, and Venkatesh Akella. 2020. AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems Using Integer Linear Programming. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS ’20). Association for Computing Machinery, New York, NY, USA, 875–890.
Chien-Chin Huang, Gu Jin, and Jinyang Li. 2020. SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS ’20). Association for Computing Machinery, New York, NY, USA, 1341–1355.
Tung D. Le, Haruki Imai, Yasushi Negishi, and Kiyokuni Kawachiya. 2018. TFLMS: Large Model Support in TensorFlow by Graph Rewriting. ArXiv abs/1807.02037 (2018).
Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, Weiliang Ma, Qian Xiong, Fan Yang, and Xuehai Qian. 2020. Capuchin: Tensor-Based GPU Memory Management for Deep Learning. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS ’20). Association for Computing Machinery, New York, NY, USA, 891–905.
T. Alexander and G. Kedem. 1996. Distributed prefetch-buffer/cache design for high performance memory systems. In Proceedings. Second International Symposium on High-Performance Computer Architecture. 254–263.
An-Chow Lai, C. Fide, and B. Falsafi. 2001. Dead-block prediction dead-block correlating prefetchers. In Proceedings 28th Annual International Symposium on Computer Architecture. 144–154.
D. Joseph and D. Grunwald. 1999. Prefetching using Markov predictors. IEEE Trans. Comput. 48, 2 (1999), 121–133.
Y. Solihin, Jaejin Lee, and J. Torrellas. 2002. Using a user-level memory thread for correlation prefetching. In Proceedings 29th Annual International Symposium on Computer Architecture. 171–182.