How smart is a concept who memorizes the replies for tests? That’s the problem facing OpenAI after it unveiled o3 in December, and touted its woman’s amazing measures. Some experts at the time praised it for being nearly as effective as AGI, the threshold at which artificial intelligence can deliver the same level of performance as a people on any job demanded by the user.

But cash changes everything—even math testing, obviously.

OpenAI’s victory lap over its o3 model’s stunning 25.2 % score on FrontierMath, a challenging mathematical benchmark developed by Epoch AI, hit a snag when it turned out the company wasn’t just acing the test—OpenA I helped write it, too.

In an updated note on the FrontierMath whitepaper, Epoch AI wrote,” We gladly acknowledge OpenAI for their help in creating the benchmark,” and this was enough to raise some red flags among fans.

screenshot from Epoch AI's research paper recognizing OpenAI's support during the development of their FrontierMath benchmark datasted
Image: Epoch AI via ArXiv

Worse, OpenAI had access to FrontierMath’s issues and options, as well as funding its growth. Epoch AI later revealed that OpenAI hired the company to give 300 math issues, as well as their options.

” OpenA I retains ownership of these questions and has access to the problems and solutions,” Epoch said on Thursday,” as is typical of commissioned work.”

contacted OpenAI and Epoch for a remark, but neither of them responded. But, Epoch claims that OpenAI already agreed to waive a deal that stated it would not coach its O3 unit using the questions and answers in its collection.

The Data first published the data.

Experts note that access to the test materials could still allow performance optimization through iterative adjustments, even though an OpenAI spokesperson claims OpenAI didn’t directly train o3 on the benchmark and that the problems were” strongly held out” ( meaning OpenAI didn’t have access to some of the problems ).

Tamay Besiroglu, associate producer at Epoch AI, claimed OpenAI had immediately requested that the company’s financial partnership with Epoch certainly be revealed.

We were unable to disclose the relationship until about the time o3 first launched, and in retrospect, we should have worked hard to be more open to the standard contributors as soon as possible, he wrote in a post. Our agreement particularly prevented us from making disclosures about the financing source and the fact that OpenAI has access to a lot of files, but not all of the data.

Tamay said that OpenAI said it doesn’t use Epoch AI’s difficulties and solutions—but didn’t sign any legal contract to make sure that may be enforced. He wrote,” We acknowledge that OpenAI has access to a large number of FrontierMath issues and answers.” ” But, we have a verbal agreement that these substances will not be used in type training”.

Fishy as it sounds, Elliot Glazer, Epoch AI’s lead mathematician, said he believes OpenAI was true to its word:” My personal opinion is that OAI’s score is legit ( i. e., they didn’t train on the dataset ), and that they have no incentive to lie about internal benchmarking performances”, he posted on Reddit.

The scholar also addressed the situation on Twitter by providing a link to a discussion on the topic on the Less Bad forum.

Not the first, not the last

The discussion goes beyond OpenAI, highlighting structural problems with how the AI industry evaluates progress. A recent analysis by AI scientist Louis Hunt revealed that other best performing types including Mistral 7b, Google’s Gemma, Microsoft’s Phi-3, Meta’s Llama-3 and Alibaba’s Qwen 2.5 were able to reproduce exactly 6, 882 sites of the MMLU and GSM8K benchmarks.

MMLU is a chemical standard, just like FrontierMath, that was created to determine how good designs are at multitasking. A set of arithmetic issues is used to evaluate LLMs’ level of math proficiency using GSM8K.

LLMs reproducing the training dataset of some AI benchmarks
Image: Louis Hunt

That makes it impossible to accurately assess how effective or correct their models are. It’s similar to giving a student with photographic storage a list of the problems and solutions that will appear on their next test. Did they cause their approach to a solution or just spit up the answer they had memorized? You can see the ruckus because these tests are intended to show that AI models is reason.

” It’s actually A VERY BIG ISSUE”, RemBrain founder Vasily Morzhakov warned. The designs ‘ training manuals are used to test the MMLU and GSM8K testing. However, the fact that basic models can renew tests indicates that pre-training is already being used for those tests.

Epoch stated that it intends to implement a “hold up set” of 50 randomly selected issues that will be kept from OpenAI to ensure real testing capabilities going forward.

However, it is still challenging to produce assessment that are truly separate. Ideal testing would require” a neutral platform which is not easy to realize,” according to computer scientist Dirk Roeckmann, adding that even in that case, there is a chance of “leaking of analyze data by adversarial people.”

Generally Intelligent Newsletter

A conceptual AI model called Gen narrates a biweekly AI journey.

Share This Story, Choose Your Platform!