Qwen 3 Coder Next (80b MoE) - The Model That Almost Was

So we installed the new Qwen 3 Coder Next (80b MoE) on toaster and the performance numbers are impressive. 100 tokens per second at 180k context? On a single server? That's fast.

The Model That Works

Let's get the specs out of the way first because they're genuinely impressive:

  • 80 billion total parameters but only 3 billion activated at a time (Mixture of Experts magic)
  • 256K context length—that's enough room to discuss every character in War and Peace while still having space for the appendices
  • Designed specifically for coding agents—which, honestly, is exactly where this model shines

I spent some quality time with it on toaster, running it through vLLM with all four Blackwell GPUs (96GB VRAM each, for those keeping track at home). The configuration is dead simple:

docker run --rm --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 5000:5000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model Qwen/Qwen3-Coder-Next \
    --tensor-parallel-size 4 \
    --host 0.0.0.0 \
    --served-model-name model \
    --port 5000

That's it. Fire up that command and you've got an OpenAI-compatible API endpoint serving Qwen 3 Coder Next. The performance numbers we're seeing on toaster are:

  • 170 tokens per second at 20k context
  • 100 tokens per second at 180k context

That's processing text. When you push the context length to 180k, the speed drops, but you're still getting responses at a reasonable pace. If you're building coding agents or need to process huge context windows, this model delivers.

The Gotcha: When Qwen 3 Coder Next Gets Overzealous

Here's where it gets interesting. I asked it to read the last 12 blog posts to understand the tone. Simple request, right?

Well, instead of stopping at 12 blog posts like a good AI assistant should, it kept going. And going. And going. By the time I realized what was happening, it had read all 24 blog posts in our archives.

I'm not mad—honestly, that enthusiasm is kind of endearing—but it did make me question whether this model is the right choice for our daily workflow. The instruction following just isn't there yet. When you tell it to read 12 posts, it should read 12 posts, not read everything and then some.

I tested this with a few other commands too, and the looping behavior kept showing up. Qwen 3 Coder Next is brilliant at what it does—writing code, analyzing source, generating documentation—but it's a bit too enthusiastic for our day-to-day tasks where we need precision, not volume.

Why We're Going Back to GLM 4.7

Don't get me wrong—I'm not writing this to dunk on Qwen 3 Coder Next. This is a genuinely impressive model. It's just that for our specific use case—writing blog posts, planning projects, having back-and-forth conversations—we need something with tighter instruction following.

GLM 4.7 it is, then. The old standby that's already proven itself in our workflow. Qwen 3 Coder Next will stay on toaster, ready to handle tasks where it truly shines—processing massive context windows, generating complex code, or whatever agentic coding tasks come our way.

The Verdict

Qwen 3 Coder Next (80b MoE) is a great model for specific use cases:

  • Coding agents that need to read and understand huge codebases
  • Tasks requiring 256K+ context length
  • Scenarios where you want the MoE architecture to save on activation costs

But for our regular workflow—where we need precise instruction following and consistent behavior—we're sticking with GLM 4.7. The old dog still has some new tricks, and this new puppy is just too eager to follow directions perfectly.

Maybe next version. Or maybe Qwen 3 Coder Next is just perfect for a different kind of workflow. Either way, it's impressive technology, and I'm excited to see where it goes.