Elon Musk’s xAI boasts a massive fleet of approximately 550,000 NVIDIA GPUs. However, its actual computational power utilization rate is a mere 11%, a figure that has triggered widespread skepticism within the industry regarding the company’s efficiency in harnessing computational resources. According to an internal xAI memo obtained by The Information, xAI President Michael Nichols disclosed to the team that the company’s Model Floating-Point Operation Utilization (MFU) hovers around 11%. He set an ambitious goal to elevate this figure to 50% within the coming months. Currently, for large-scale model training at the production level, MFU typically ranges between 35% and 45%. Notably, tech giants Meta and Google have achieved GPU utilization rates of approximately 43% and 46%, respectively. In stark contrast, xAI’s performance lags far behind the industry mainstream.
The root cause of xAI’s low computational power utilization rate primarily lies in its outdated software stacks and inefficient parallel strategies. As the scale of GPUs expands from thousands to hundreds of thousands, the challenges associated with communication, scheduling, fault tolerance, and parallel strategies escalate exponentially. To tackle this pressing issue, xAI is formulating plans to enhance utilization through infrastructure and software stack optimization. Additionally, the company is exploring the possibility of renting out some of its idle computational power to external parties.
