ActaVerum.
// AI · EXPLAINER

What a "MoE" is, and why every new AI model became one

Mixture of experts is now the default way to build a frontier model. The idea is old, the economics are new, and there's a VRAM misunderstanding that trips up even people who follow this closely. Let's take it apart.

Read any big-model announcement from the past year and you've hit the acronym: Mixtral, DeepSeek-V3, Llama 4, Qwen3, even OpenAI's own gpt-oss. All of them are MoE, mixture of experts. Almost overnight it stopped being an architectural footnote and became the standard way to build a frontier model. So what changed, that everyone made the same call at roughly the same time? And, more usefully if you actually run these things, does it make your life easier or more expensive?

Short answer: MoE doesn't hand you a better model for free. It hands you a large model at the compute cost of a small one, and the bill comes due in memory. Understand that trade-off and you understand the whole trend.

The idea, in plain terms

Picture an ordinary transformer, the kind of network underneath nearly every language model, as a factory where every part passes through every workstation. Each token that comes in, each little chunk of text, gets processed by all of the model's parameters. That's a dense model: powerful, but expensive, because you always fire up the whole factory.

MoE rearranges the line. The layer doing the heavy lifting, the feed-forward block, gets split into several parallel sub-teams: the "experts". And a small new part gets added, a routing network (the router or gating) that picks just a handful of experts for each token to do the work. The rest sit idle for that step. That's a sparse model.

From there comes the distinction that clears up half the confusion around the topic:

  • Total parameters is the model's size on disk and in VRAM, the memory on the graphics card itself. Every expert has to be loaded, even the idle ones.
  • Active parameters is how much actually gets computed per token. That's what sets speed and the cost of running it.

Mistral's Mixtral 8x7B is the textbook example: 8 experts, the router picks 2 per token at each layer. 46.7 billion parameters in total, but only 12.9 billion active.¹ Mistral's own line captures the trick: it "processes input and generates output at the same speed and for the same cost as a 12.9B model", while carrying the weight of a much larger one.

And it isn't a new idea. Mixture of experts was described in a paper from 1991, co-authored by people like Geoffrey Hinton.² What was missing was the modern, sparse version. That arrived in 2017, in a Google paper that showed over 1,000x more model capacity at almost the same compute cost.³ In 2021, the Switch Transformer simplified routing down to a single expert per token and scaled to a trillion parameters.⁴ The theory had been sitting ready for years. The market just had to need it.

Why 2025 was the year it flipped

The switch wasn't academic, it was a budget decision. MoE delivers more capacity per dollar spent on inference, and the recent timeline tells the story in order.

Mixtral (December 2023) opened the door in the open-weights world: Apache 2.0, beating Llama 2 70B on most benchmarks at roughly 6x faster inference, per Mistral.¹ A year later, DeepSeek-V3 (December 2024) proved the concept at frontier scale: 671 billion total parameters, only 37 billion active per token.⁵ It was the model that showed you could compete with the closed giants on a fraction of the compute.

One caveat from the show-your-sources desk. That famous ~$5.6 million "training cost" for DeepSeek-V3 does not appear in the paper as a cost: it's an estimate derived from GPU-hours, and it covers only the final training run, not the research, the failed attempts, or the hardware.⁵ It's a real number, just a much narrower one than the headline implied.

From 2025 on, it became house rule. Meta put the entire Llama 4 family on MoE, a sharp about-face for a team that built dense models in Llama 2 and 3: Scout runs 17B active out of 109B total with 16 experts; Maverick, 17B active out of 400B total with 128 routed experts.⁶ Alibaba's Qwen3 shipped its 235B-A22B flagship (235 billion total, 22 active).⁷ And the symbolic one came from OpenAI, which released its first open-weight models since GPT-2, gpt-oss-120b and 20b, both MoE: the 120B activates just 5.1 billion parameters per token and fits on a single 80 GB GPU.⁸ When even OpenAI's open release picks sparse, it's no longer a trend; it's engineering consensus.

What the community is saying

The mood in the local-model community, especially on r/LocalLLaMA, is pragmatic with a skeptical streak. MoE doesn't generate "AGI is coming" hype. It generates workbench talk: "does this run on my machine or not?". The enthusiasm comes from the fact that MoE democratizes running big models locally (big-model quality at small-model compute), and user reports cite anywhere from dozens to over a hundred tokens per second for a small MoE on a consumer GPU (those are user figures, not tested by us).

But that's exactly where the number-one complaint lives, along with the misunderstanding worth fixing: MoE devours VRAM. Since every expert has to sit in memory even when idle, a "3B active" model doesn't take up 3B of memory, it takes up the full total. That's why the most-repeated question in every launch thread is "dense or MoE, which do I run?". The recurring read on r/LocalLLaMA is to choose by your bottleneck: if VRAM capacity is what limits you, a smaller dense model is better (loads easily, runs slower); if speed is what limits you, MoE wins (loads heavy, runs fast).

And there's the counterpoint that cooled the reflexive "it's MoE, therefore it's better": Llama 4. The tone shifted from launch-day excitement to disappointment, with the read that the models "do well on benchmarks and poorly in real life". The episode that landed badly was an "experimental" build topping the LMArena leaderboard while the public checkpoint slid down to around 32nd place, read by the community as benchmark gaming, even with Meta denying it.

Verdict

MoE became the default because it solves a concrete, specific problem: serving enormous models without paying enormous-model compute. It's a win on inference economics, which is why the people running models at scale adopted it first. But it isn't free magic. It's more efficient than a dense model of the same total size, and less efficient than a dense model of the same active size, and it bills you the whole amount in memory. If you run models locally, the rule of thumb the community already settled on still holds: choose by your bottleneck, not by the buzzword. And remember Llama 4 every time an announcement promises that going sparse, by itself, makes a model good.

Sources

  1. "Mixtral of experts" · Mistral AI · Dec 11, 2023 · https://mistral.ai/news/mixtral-of-experts/ (paper: arXiv:2401.04088)
  2. "Adaptive Mixtures of Local Experts" · Jacobs, Jordan, Nowlan, Hinton · Neural Computation 3(1):79–87 · March 1991 · DOI: 10.1162/neco.1991.3.1.79
  3. "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" · Shazeer et al. (Google) · arXiv:1701.06538 · Jan 23, 2017 · https://arxiv.org/abs/1701.06538
  4. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity" · Fedus, Zoph, Shazeer (Google) · arXiv:2101.03961 · 2021 · https://arxiv.org/abs/2101.03961
  5. "DeepSeek-V3 Technical Report" · DeepSeek-AI · arXiv:2412.19437 · Dec 2024 · https://arxiv.org/abs/2412.19437
  6. "The Llama 4 herd" · Meta AI · Apr 5, 2025 · https://ai.meta.com/blog/llama-4-multimodal-intelligence/
  7. "Qwen3 Technical Report" · Qwen Team, Alibaba Cloud · arXiv:2505.09388 · Apr 29, 2025 · https://arxiv.org/abs/2505.09388
  8. "gpt-oss-120b & gpt-oss-20b Model Card" · OpenAI · Aug 5, 2025 · arXiv:2508.10925 · https://openai.com/index/introducing-gpt-oss/