Technology
Benchmarks meta for new AI models are somewhat misleading

One of the new flagship AI Meta models released on Saturday, Maverick, Second rating at LM ArenaA test during which human rankings compare the outcomes of models and select which they like. But it appears that evidently the Maverick version, that the finish implemented on LM Arena differs from the version that’s widely available to programmers.
How several And researchers He pointed to X, Meta noticed within the announcement that Maverick on LM Arena is a “experimental version of the chat.” Chart on The official website of LlamaMeanwhile, it reveals that the testing of the LM META Arena was carried out using “Llama 4 Maverick optimized for conversation.”
As we wrote earlier, for various reasons LM Arena has never been essentially the most reliable measure of the performance of the AI model. But AI firms generally didn’t adapt or otherwise adapted their models to higher rating at LM Arena-Lub a minimum of didn’t admit it.
The problem related to adapting the model to the reference point, suspension of it, after which releasing the “vanilla” variant of the identical model, is that programmers are difficult to predict how good it can work in specific contexts. It can be misleading. It is best if the tests tests – miserably inadequate – provide a shutter of strong and weaknesses of 1 model in various tasks.
Indeed, scientists on X have Stark was observed Differences in behavior From publicly to download maverick in comparison with the hosted model on LM Arena. The LM Arena version seems to make use of many emoji and provides extremely long answers.
Okay, Lama 4 is Def and Littled cooked lol, what a yap city is that this city pic.twitter.com/y3gvhbvz65
– Nathan Lambert (@natolambert) April 6, 2025
For some reason, the Llam 4 model in the sector uses rather more emoji
together. Ai, it seems higher: pic.twitter.com/f74odx4zttt
– technological notes (@techdevnotes) April 6, 2025
We arrived at Meta and Chatbot Arena, a company that maintains LM Arena to comment.
(Tagstotransate) benchmark