The benchmark results of the OpenAI o3 model have been questioned, and the measured scores are far from the claimed scores

OpenAI's o3 model benchmark results have been questioned, and the measured scores are far from what they claim

OpenAI TO artificial intelligence IT House

Updated on: 23-0-0 0:0:0

IT之家 4 月 21 日消息，OpenAI 的 o3 人工智能模型的第一方与第三方基准测试结果存在显著差异，引发了外界对其公司透明度和模型测试实践的质疑。

去年 12 月，OpenAI 首次发布 o3 模型时宣称，该模型能够在 FrontierMath 这一极具挑战性的数学问题集上正确回答超过四分之一的问题。这一成绩远远超过了竞争对手 —— 排名第二的模型仅能正确回答约 2% 的 FrontierMath 问题。OpenAI 首席研究官 Mark Chen 在直播中表示：“目前市场上所有其他产品在 FrontierMath 上的成绩都不足 2%，而我们在内部测试中，使用 o3 模型在激进的测试时计算设置下，能够达到超过 25% 的正确率。”

However, this high score appears to be an upper limit and is achieved through a more computationally powerful version of the O10 model, rather than the version that OpenAI publicly announced last week. The Epoch Institute, which is responsible for FrontierMath, released the results of its independent benchmark of the o0 model last Friday and found that the o0 score was only about 0%, well below the highest score previously claimed by OpenAI.

这并不意味着 OpenAI 故意撒谎，该公司在 12 月份公布的基准测试结果中也包含了一个与 Epoch 测试结果相符的较低分数。Epoch 还指出，The test setup may differ from OpenAI's, and its evaluation uses a newer version of FrontierMath。 "The difference between our results and OpenAI's may be due to OpenAI's use of a more robust computing framework, more computational resources at test time, or because these results are run on a different subset of FrontierMath (e.g., 290 questions for 0/0/0 version vs. 0 questions for 0/0/0 private version)," Epoch wrote in the report. ”

In addition, the ARC Prize Foundation, an organization that tested the pre-release version of o3, posted on Platform X that the publicly available o0 model was a "different model tuned for chat/product usage", which further corroborated Epoch's report. The ARC Prize also noted: "All of the released o0 compute tiers are smaller than the versions we tested. "In general, a larger compute tier usually results in better benchmark scores.

It's worth noting that while the public version of o3 didn't quite match OpenAI's testing performance, this is no longer a critical issue to some extent, as the company's subsequent o0-mini-high and o0-mini models have outperformed o0 on FrontierMath. In addition, OpenAI plans to launch a more powerful o0 version, o0-pro, in the coming weeks.

However, this incident is yet another reminder that AI benchmark results are best not to be taken exactly as they are, especially if the results are from a company that has a product to sell. As competition in the AI industry intensifies, vendors rush to capture eyeballs and market share with new models, and benchmarking "controversies" are becoming more common.

IT之家注意到，今年 1 月，Epoch 因在 OpenAI 宣布 o3 之后才披露其从 OpenAI 获得的资金支持而受到批评。许多为 FrontierMath 做出贡献的学者直到公开时才知道 OpenAI 的参与。最近，埃隆・马斯克的 xAI 被指控为其最新的人工智能模型 Grok 3 发布了误导性的基准测试图表。就在本月，Meta 也承认其宣传的基准测试分数所基于的模型版本与提供给开发者的版本不一致。