Linux Systems Face Hibernation Challenges with AMD Instinct Accelerator Cards Boasting 1.5TB of VRAM
6 day ago / Read about 0 minute
Author:小编   

The integration of HBM high-bandwidth memory (video memory) on AI accelerator cards is witnessing a rapid surge in capacity, with both AMD and NVIDIA reaching impressive milestones of 192GB, soon to escalate to 288GB. However, this technological leap presents significant hurdles for Linux systems. In the latest Linux patch, AMD engineer Sameul Zhang highlighted an issue where multiple AMD Instinct accelerator cards, each boasting a colossal 192GB of VRAM, can hinder the system's ability to transition smoothly into hibernation mode. Specifically, a server equipped with eight such cards, amounting to a staggering 1.5TB of VRAM, renders the Linux system incapable of entering hibernation. The crux of the problem stems from Linux's handling of GPU VRAM; during hibernation, the system must offload all GPU VRAM to memory. When the VRAM size surpasses a certain threshold, the resulting hibernation image exceeds the system's memory capacity, causing the issue. To address this, Sameul Zhang proposed a solution aimed at reducing the memory footprint required during hibernation, albeit at the cost of prolonged recovery times. To mitigate this trade-off, he also introduced a new patch designed to expedite the recovery process.