OpenAI has introduced HealthBench, an open-source evaluation set tailored for testing large medical models. Curated by 262 physicians worldwide, HealthBench encompasses a meticulous set of 48,562 scoring criteria and employs multi-round dialogue testing to closely mimic real-world medical scenarios. This innovative tool has significantly boosted the performance of AI systems in healthcare, with GPT-4.1nano surpassing GPT-4o while achieving a cost reduction of 25 times.
