Stanford University has recently undertaken a rigorous evaluation of nine clinical medical AI models utilizing the MedHELM evaluation system. The results are compelling: DeepSeek R1 has emerged as the clear champion, boasting a 66% win rate and a macro average score of 0.75. o3-mini and Claude3.7Sonnet follow closely in second and third places, respectively. This comprehensive evaluation encompassed 35 benchmark tests across 22 medical subfields, thoroughly examining the models' performance in real-world scenarios. Notably, the innovative LLM-jury method demonstrated remarkable consistency with doctors' ratings, while a cost-benefit analysis revealed that various models cater to different medical needs. This landmark evaluation is poised to significantly accelerate the advancement and adoption of medical AI technology.