Recent Deep Neural Network (DNN) models require tremendous GPU memory and computation power because deeper and wider layers generally provide better performance. Moreover, a larger batch size can lead to shorter training time. The problem is that the current state-of-the-art DNN models are so big that even fine-tuning the models is hard to perform with a small system, especially with a single GPU system. For example, one of the state-of-the-art language models, GPT3, has 175 billion parameters. One hundred seventy-five billion parameters require approximately 326 GB of memory space to load the model when saved in FP16. Such an amount of memory space cannot even be handled using a single NVIDIA H100 80GB GPU which is the latest data center GPU from NVIDIA.
GPU memory swapping is a technique to solve such memory capacity problems of DNNs. Previous approaches in memory swapping are divided into two categories. One uses CUDA Unified Memory with page prefetching[1, 2]. The other uses pure (i.e., non-UM) GPU memory with swapping-in/swapping-out memory objects[3-7]. Unified Memory (UM) provides a single address space shared between the CPUs and the GPUs. It exploits the GPU page fault mechanism to migrate pages between processors on-demand. The name could vary depending on the programming model (e.g., Shared Virtual Memory in OpenCL and Intel oneAPI), but they all have very similar semantics.
Prof. Lee’s research group proposed a framework called DeepUM that exploits UM to allow the oversubscribing of GPU memory and implements optimization techniques to minimize the overhead caused by UM. DeepUM modifies a correlation prefetching technique originally developed for cache-line prefetching to prefetch GPU pages [8-11]. DeepUM exploits the fact that the kernel execution patterns and their memory access patterns are mostly fixed and repeated in the DNN training workload. To minimize the fault handling time, DeepUM proposes two optimization techniques for the GPU fault handling routines that are a part of the NVIDIA device driver. One is page pre-eviction based on the information from correlation tables, and the other is page invalidation in the GPU memory when the page eviction victim is expected to be no longer used by PyTorch.
DeepUM supports PyTorch, one of the most popular Deep Learning frameworks. Compared with the previous approaches, DeepUM requires very few code modifications in the original PyTorch source code (less than ten lines of the code) to change the behavior of the PyTorch memory allocator. Moreover, it requires no user code modification.
ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, January 2023, Pages 207-221, https://doi.org/10.1145/3575693.3575736