Despite the hype surrounding artificial knowledge, even the most sophisticated vision-language models—GPT-4o, Claude Sonnet 3. 7, and Gemini 2. 5 Pro—struggle with a decades-old problem: playing the traditional first-person sniper Doom.

On Thursday, a new study project introduced VideoGameBench, an AI standard designed to test whether state-of-the-art vision-language types is play—and beat—a set of 20 popular video game, using just what they see on the monitor.

“ In our experience, current state-of-the-art VLMs substantially struggle to play video games because of high inference latency, ” the researchers said. “ When an agent takes a snapshot and queries the VLM about what action to take, by the time the answer comes again, the game position has changed significantly and the activity is no longer relevant. ”

The researchers stated that they used traditional Game Boy and MS-DOS games due to their simpler visuals and different suggestions styles, like a mouse and keyboard or game joystick, which better check a vision-language model’s geographical reasoning capabilities than text-based games.

VideoGameBench was developed by computer scientist and AI researcher Alex Zhang. The suite of games includes classics like Warcraft II, Age of Empires, and Prince of Persia.

According to the researchers, delayed responses are most problematic in first-person shooters like Doom. In these fast-paced environments, an enemy visible in a screenshot may already have moved—or even reached the player—by the time the model acts.

For software developers, Doom has long served as a litmus test for technological capability in gaming environments. Lawnmowers, Bitcoin, and even human gut bacteria have faced down the demons from hell with varying levels of success. Now it ’s AI’s turn.

“ What has brought Doom out of the shadows of the 90s and into the modern light is not its riveting gameplay, but rather its appealing computational design, ” MIT biotech researcher Lauren Ramlan previously told . “Built on the id Tech 1 engine, the game was designed to require only the most modest of setups to be played. ”

In addition to struggling with understanding game environments, the models often failed to perform basic in-game actions.

“We observed frequent instances where the agent had trouble understanding how its actions—such as moving right—would translate on screen, ” the researchers said. “The most consistent failure across all frontier models we tested was an inability to reliably control the mouse in games like Civilization and Warcraft II, where precise and frequent mouse movements are essential. ”

To better understand the limitations of current AI systems, VideoGameBench emphasized the importance of evaluating their reasoning abilities in environments that are both dynamic and complex.

“ Unlike extremely complicated domains like unsolved math proofs and olympiad-level math problems, playing video games is not a superhuman reasoning task, yet models still struggle to solve them, ” they said.

GG Newsletter

Get the latest web3 gaming news, hear directly from gaming studios and influencers covering the space, and receive power-ups from our partners.

Share This Story, Choose Your Platform!