Recent Deep Neural Network (DNN) models require tremendous GPU memory and computation power because deeper and wider layers generally provide better performance. Moreover, a larger batch size can lead to shorter training time. The problem is that the current state-of-the-art DNN models are so big that even fine-tuning the models is hard to perform with a small system, especially with a single GPU system. For example, one of the state-of-the-art language models, GPT3, has 175 billion parameters. One hundred seventy-five billion parameters require approximately 326 GB of memory space to load the model when saved in FP16. Such an amount of memory space cannot even be handled using a single NVIDIA H100 80GB GPU which is the latest data center GPU from NVIDIA.

GPU memory swapping is a technique to solve such memory capacity problems of DNNs. Previous approaches in memory swapping are divided into two categories. One uses CUDA Unified Memory with page prefetching[1, 2]. The other uses pure (i.e., non-UM) GPU memory with swapping-in/swapping-out memory objects[3-7]. Unified Memory (UM) provides a single address space shared between the CPUs and the GPUs. It exploits the GPU page fault mechanism to migrate pages between processors on-demand. The name could vary depending on the programming model (e.g., Shared Virtual Memory in OpenCL and Intel oneAPI), but they all have very similar semantics.

Prof. Lee’s research group proposed a framework called DeepUM that exploits UM to allow the oversubscribing of GPU memory and implements optimization techniques to minimize the overhead caused by UM. DeepUM modifies a correlation prefetching technique originally developed for cache-line prefetching to prefetch GPU pages [8-11]. DeepUM exploits the fact that the kernel execution patterns and their memory access patterns are mostly fixed and repeated in the DNN training workload. To minimize the fault handling time, DeepUM proposes two optimization techniques for the GPU fault handling routines that are a part of the NVIDIA device driver. One is page pre-eviction based on the information from correlation tables, and the other is page invalidation in the GPU memory when the page eviction victim is expected to be no longer used by PyTorch.

DeepUM supports PyTorch, one of the most popular Deep Learning frameworks. Compared with the previous approaches, DeepUM requires very few code modifications in the original PyTorch source code (less than ten lines of the code) to change the behavior of the PyTorch memory allocator. Moreover, it requires no user code modification.

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, January 2023, Pages 207-221, https://doi.org/10.1145/3575693.3575736

References
  1. Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, Xiaoyi Lu, and Dhabaleswar K. Panda. 2018. OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training. In 2018 IEEE 25th International Conference on High Performance Computing (HiPC). 143– 152.
  2. Pak Markthub, Mehmet E. Belviranli, Seyong Lee, Jeffrey S. Vetter, and Satoshi Matsuoka. 2018. DRAGON: Breaking GPU Memory Capacity Limits with Direct NVM Access. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 414–426.
  3. Jonghyun Bae, Jongsung Lee, Yunho Jin, Sam Son, Shine Kim, Hakbeom Jang, Tae Jun Ham, and Jae W. Lee. 2021. FlashNeuron: SSD-Enabled Large-Batch Training of Very Deep Neural Networks. In 19th USENIX Conference on File and Storage Technologies (FAST 21). USENIX Association, 387–401.
  4. Mark Hildebrand, Jawad Khan, Sanjeev Trika, Jason Lowe-Power, and Venkatesh Akella. 2020. AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems Using Integer Linear Programming. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS ’20). Association for Computing Machinery, New York, NY, USA, 875–890.
  5. Chien-Chin Huang, Gu Jin, and Jinyang Li. 2020. SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS ’20). Association for Computing Machinery, New York, NY, USA, 1341–1355.
  6. Tung D. Le, Haruki Imai, Yasushi Negishi, and Kiyokuni Kawachiya. 2018. TFLMS: Large Model Support in TensorFlow by Graph Rewriting. ArXiv abs/1807.02037 (2018).
  7. Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, Weiliang Ma, Qian Xiong, Fan Yang, and Xuehai Qian. 2020. Capuchin: Tensor-Based GPU Memory Management for Deep Learning. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS ’20). Association for Computing Machinery, New York, NY, USA, 891–905.
  8. T. Alexander and G. Kedem. 1996. Distributed prefetch-buffer/cache design for high performance memory systems. In Proceedings. Second International Symposium on High-Performance Computer Architecture. 254–263.
  9. An-Chow Lai, C. Fide, and B. Falsafi. 2001. Dead-block prediction dead-block correlating prefetchers. In Proceedings 28th Annual International Symposium on Computer Architecture. 144–154.
  10. D. Joseph and D. Grunwald. 1999. Prefetching using Markov predictors. IEEE Trans. Comput. 48, 2 (1999), 121–133.
  11. Y. Solihin, Jaejin Lee, and J. Torrellas. 2002. Using a user-level memory thread for correlation prefetching. In Proceedings 29th Annual International Symposium on Computer Architecture. 171–182.