Recently, Salvatore Sanfilippo (antirez), the founder of Redis, released ds4, a local inference engine specifically designed for DeepSeek V4 Flash. This engine is not a general-purpose solution but a narrow and deep specialized implementation, with its core design centered around Metal GPU. The CPU path is used only for debugging. Developed entirely based on Apple's Metal API, ds4.c supports only Apple Silicon chip devices, forgoing compatibility with Nvidia or AMD graphics cards. The project codebase is streamlined, pursuing ultimate lightweight design and performance focus. Test data shows that on a MacBook Pro M3 Max with 128GB of memory, using a 2-bit quantized model with a 32K context window, the short prompt prefill speed reaches 58.52 token/s, and the generation speed is 26.68 token/s. Through techniques such as asymmetric quantization and KV cache disk storage, ds4.c achieves high-performance local inference, offering a new approach to integrating AI models with hardware.
