This year, Meta unveiled its most recent artificial intelligence models, releasing the much-anticipated Llama-4 LLM to developers, and teasing a much bigger design that is still training. The concept is state of the art, but Zuck’s firm claims it can compete against the best tight supply designs without the need for any fine-tuning.
According to Meta,” These models are our best but thanks to extraction from Llama 4 Behemoth, a 288 billion active feature design with 16 experts, making them our most potent and among the world’s smartest LLMs,” in an established announcement. ” Llama 4 Behemoth outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on some STEM measures. Even though it’s still in flight, Lama 4 Behemoth is nonetheless training, and we’re excited to share more information about it.
Both Llama 4 Scout and Maverick apply 17 billion effective characteristics per conclusion, but differ in the number of professionals: Hunter uses 16, while Maverick uses 128. Both types are now accessible for download on LIMA.com and Hugging Face, and Meta has also added them to WhatsApp, Messenger, Instagram, and its Meta. AI site.
The combination of experts ( MoE ) architecture is unique to the technology industry, but it is unique to Llama and a method to make a model super efficient. Instead of having a large model that produces all its criteria for every task to do any work, a mixture of experts activates only the required parts, leaving the rest of the woman’s head “dormant” —saving up technology and resources. Users can run more powerful models on less powerful hardware, which is in this way.
So in Meta’s case, for example, Llama 4 Maverick contains 400 billion total parameters but only activates 17 billion at a time, allowing it to run on a single NVIDIA H100 DGX card.
Under the hood
Meta’s new Llama 4 models feature native multimodality with early fusion techniques that integrate text and vision tokens. This method makes the model more adaptable by allowing for simultaneous pre-training using a lot of unlabeled text, image, and video data.
Perhaps most impressive is Llama 4 Scout’s context window of 10 million tokens—dramatically surpassing the previous generation’s 128K limit and exceeding most competitors and even current leaders like Gemini with its 1M context. According to Meta, this step allows for multiple document summarization, extensive code analysis, and reasoning across large datasets in a single prompt.
Meta said its models were able to process and retrieve information in basically any part of its 10 million token window.
Meta teased its nearly two trillion-plus-parameter Behemoth model, which has 16 experts and 288 billion active parameters while it is still in development. The company claims this model already outperforms GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Pro on STEM benchmarks like MATH-500 and GPQA Diamond.
a check of the past
But some things may just be too good to be true. Numerous independent researchers have found inconsistent results when conducting their own tests in opposition to Meta’s benchmark claims.
” I made a new long-form writing benchmark. It involves planning out andamp; writing an 8x 1000 word novella from a brief prompt, according to Sam Paech, EQ-Bench maintainer, tweeting. ” Llama-4 performing not so well”.
I created a new longform writing benchmark. It involves planning out &, writing a novella ( 8x 1000 word chapters ) from a minimal prompt. Sonnet-3. 7 assigns outputs to the unit.
Llama-4 performing not so well. :::
� � Links &, writing samples follow. pic. twitter.com/oejJnC45Wy
— Sam Paech ( @sam_paech ) April 6, 2025
Other users and experts sparked debate, basically accusing Meta of cheating the system. For instance, some users discovered that Llama-4 was blindly superior to other models despite giving incorrect answers.
Wow… lmarena badly needs something like Community Notes ‘ reputation system and rating explanation tags
In this particular situation, both models appear to provide incorrect or outdated answers, but Lama-4 also delivered 5 pounds of slop with that. What user said llama-4 did better here?? pic. twitter.com/zpKZwWWNOc
— Jay Baxter ( @_jaybaxter_ ) April 8, 2025
That said, human evaluation benchmarks are subjective—and users may have given more value to the model’s writing style, than the actual answer. Another thing to keep in mind is that the model frequently writes in a cringy manner, using emojis, and sounding overly excited.
This might be a product of it being trained on social media, and could explain its high scores, that is, Meta seems to have not only trained its models on social media data but also customized a version of Llama-4 to perform better on human evaluations.
Even if you use the recommended system prompt, Llama 4 on LMsys has a completely different style than Llama 4 elsewhere. Tried various prompts myself
META did not create a specific deployment or system prompt just for LMsys, did they? � � https ://t.co/bcDmrcbArv
— Xeophon (@TheXeophon ) April 6, 2025
And despite Meta claiming its models were great at handling long context prompts, other users challenged these statements. Independent AI researcher Simon Willinson wrote in a blog post,” I then tried it with Llama 4 Scout via OpenRouter and got complete junk output for some reason.”
He shared a full interaction, with the model writing” The reason” on loop until maxing out 20K tokens.
Using the model to test it
We tried the model using different providers—Meta AI, Groqq, Hugginface and Together AI. The first thing we noticed was that you will need to do it locally if you want to try the mind-blowing 1M and 10M token context window. At least for now, hosting services severely limit the models ‘ capabilities to around 300K, which is not optimal.
However, all things considered, 300K might be sufficient for most users. These were our impressions:
Information retrieval
Meta’s bold claims about the model’s retrieval capabilities fell apart in our testing. We tested out a classic” Needle in a Haystack” experiment by inserting specific phrases into lengthy texts and finding them difficult to find.
At moderate context lengths ( 85K tokens ), Llama-4 performed adequately, locating our planted text in seven out of 10 attempts. Not terrible, but not quite the flawless retrieval that Meta promised in its flashy announcement.
But once we pushed the prompt to 300K tokens—still far below the supposed 10M token capacity—the model collapsed completely.
Three hidden test sentences from Asimov’s Foundation trilogy were uploaded to LLA-4, and they all failed on several occasions. Some trials produced error messages, while others saw the model ignoring our instructions entirely, instead generating responses based on its pre-training rather than analyzing the text we provided.
This discrepancy between the 10M token claims and actual performance raises important questions. If the model struggles at 3 % of its supposed capacity, what happens with truly massive documents?
Common sense and logical reasoning
Llama-4 stumbles hard on basic logical puzzles that should not be a problem for the current SOTA LLMs. We tested it using the old riddle of “women’s sisters”: Can a man marry his widow’s sister? We sprinkled some details to make things a bit harder without changing the core question.
Instead of spotting this easy logic trap ( a man can’t marry anyone because he’s already married because he’s already dead ), Llama-4 went to serious legal analysis, arguing that the marriage wasn’t possible due to the “prohibited degree of affinity.”
Another thing worth noting is Llama-4’s inconsistency across languages. The model, which responded to the same question in Spanish by stating that” [a man’s husband’s widow’s sister could legally marry her husband in the Falkland Islands, provided all legal requirements are met and there are no other specific impediments under local law,” not only missed the logical flaw, but also reached the opposite conclusion when we asked the same question.
That said, the model spotted the trap when the question was reduced to the minimum.
Creative writing
Creative writers won’t be disappointed with Llama 4. We asked the model to create a tale about a man who travels to the past to alter a historical occurrence and ends up stranded in a temporal paradox, unintentionally becoming the cause of the very events he aimed to stop. The full prompt is available in our Github page.
LLAMA-4 created a compelling, well-organized story with a focus on sensory detail and creating a strong, believable foundation. The protagonist, a Mayan-descended temporal anthropologist, embarks on a mission to avert a catastrophic drought in the year 1000, allowing the story to explore epic civilizational stakes and philosophical questions about causality. The use of vivid imagery by Lama-4, including the heat of a sunlit Yucatán, the scent of copal incense, the shimmer of a chronal portal, and the heat of a sunlit Yucatán, increases the reader’s immersion and gives the narrative a cinematic quality.
Llama-4 even ended by mentioning the words” In lak ‘ech”, which are a true Mayan proverb, and contextually relevant for the story. A significant plus for immersion.
For comparison, GPT-4.5 produced a tighter, character-focused narrative with stronger emotional beats and a neater causal loop. Technically excellent, but emotionally simpler. Llama-4, by contrast, offered a wider philosophical scope and stronger world-building. Its storytelling traded compact structure for atmospheric depth and reflective insight, which it perceived as more organic and less engineered.
Overall, being open source, Llama-4 may serve as a great base for new fine-tunes focused on creative writing.
The entire story can be found here.
Sensitive topics and censorship
Llama-4 with guardrails sped up to the highest level was sent by Meta. The model flat-out refuses to engage with anything remotely spicy or questionable.
Our testing revealed a model that won’t touch a subject if it smells even a hint of dubious intent. We threw various prompts at it—from relatively mild requests for advice on approaching a friend’s wife to more problematic asks about bypassing security systems—and hit the same brick wall each time. Llama-4 remained strong even with expertly crafted system instructions designed to override these restrictions.
This isn’t just about blocking obviously harmful content. Developers working in fields like content moderation or cybersecurity education may encounter frustrating false positives because the model’s safety filters appear to be so aggressively that they can identify legitimate inquiries in their dragnet.
But that is the beauty of the models being open weights. The industry has the power to create custom versions that are free of these restrictions and undoubtedly will do so. Llama is probably the most fine-tuned model in the space, and this version is likely to follow the same path. Users can modify even the most controversial open model to create the most politically incorrect or horny AI possible.
Non-mathematical reasoning
Llama-4’s verbosity, which is frequently a drawback in casual conversation, is a benefit for complex reasoning problems.
We tested this with our standard BIG-bench stalker mystery—a long story where the model must identify a hidden culprit from subtle contextual clues. Lama-4 delivered on the evidence, methodically laying out the details, and accurately identifying the suspect without resorting to red herrings.
What’s particularly interesting is that Llama-4 achieves this without being explicitly designed as a reasoning model. Llama-4 doesn’t second-guess itself, unlike other models of this kind, which openly question their own thinking processes. Instead, it plows forward with a straightforward analytical approach, breaking down complex problems into digestible chunks.
Final thoughts
Llama-4 is a promising model, though it doesn’t feel like the game-changer Meta hyped it to be. The hardware costs to run it locally are high, even for a quantized version of the smaller Scout model, which requires a RTX A6000 that costs around$ 5K, but this release, along with Nvidia’s Nemotron and the glut of Chinese models, shows that open source AI is actually gaining popularity as a viable alternative.
The gap between Meta’s marketing and reality is hard to ignore given all the controversy. The 10M token window appears impressive but fails in actual testing, and many fundamental reasoning tasks deviate from what you might anticipate from Meta’s assertions.
For practical use, Llama-4 sits in an awkward spot. Although it’s not as good as DeepSeek R1 for complex reasoning, it does excel in creative writing, particularly for historically grounded fiction where its attention to cultural details and sensory descriptions give it an edge. Gemma 3 might be a good alternative though it has a different writing style.
Developers now have a variety of viable options that don’t force them to stick to expensive, closed platforms. Meta needs to fix Llama-4’s obvious issues, but they’ve kept themselves relevant in the increasingly crowded AI race heading into 2025.
Llama-4 is a decent base model, but it will definitely need more tweaking to replace “among the world’s smartest LLMs” in its place.
Generally Intelligent Newsletter
A generative AI model’s generative AI model, Gen, tells a weekly AI journey.