Recently, at the 2025 ACM/IEEE International Conference on Computer-Aided Design (ICCAD), a prestigious event classified under CCF Category A with an h5-index of 66 and held in Munich, Germany, a significant academic achievement was unveiled. Hao Yingbo, a doctoral candidate from the Intelligent Storage and Computing Research Group (led by Principal Investigator Professor Zou Yi) at the college, along with Chen Huangxu, a master's student from The Hong Kong University of Science and Technology, jointly served as the co-first authors. They presented a groundbreaking paper titled 'OA-LAMA: An Outlier-Adaptive LLM Inference Accelerator with Memory-Aligned Mixed-Precision Group Quantization'.
The paper introduces an innovative hardware-software co-design framework tailored to overcome the deployment hurdles posed by large language models (LLMs), which are notorious for their substantial memory and computational demands. The proposed framework gives rise to OA-LAMA, an outlier-adaptive LLM inference accelerator. It leverages a memory-aligned mixed-precision group quantization format, complemented by outlier reordering techniques. This combination not only preserves DRAM-aligned memory access but also enhances model accuracy.
Furthermore, the framework tackles the challenge of fluctuating inter-layer outlier proportions through a distribution-aware group allocation strategy and a hardware design incorporating a three-level cumulative architecture. Experimental findings reveal that OA-LAMA surpasses the current leading 4-bit quantization methods in terms of accuracy. Simultaneously, it achieves performance enhancements ranging from 1.21 to 3.09 times and energy efficiency improvements between 1.35 to 2.47 times, setting a new standard for the co-optimization of accuracy and efficiency in LLM inference. To foster further research and collaboration, the code associated with the paper has been made publicly available on an open-source basis.
