§03//BLOG

§ log.002 — 27B just ate 397B for breakfast.

2026.04.234 min read670 wordsAIQwenlocal-LLMHermes

Qwen3.6-27B dropped yesterday. Dense model, open weights, flagship coding. I hooked it up to Hermes Agent.

The Qwen team dropped Qwen3.6-27B yesterday — a fully dense, 27-billion-parameter model that outperforms their own 397B MoE flagship on agentic coding benchmarks.

Let that sink in. A model that takes up 55.6GB on disk is beating a model that needs 807GB. And the quantized version runs on a single 24GB GPU.

I hooked it up to Hermes Agent as a custom provider this morning. Here's what I found.

The specs

Parameters: 27B dense (no MoE routing)
Architecture: Gated DeltaNet (linear attention) + Gated Attention hybrid
Layers: 64, repeating pattern of 3 DeltaNet → 1 Gated Attention
Hidden dim: 5120
Context window: 262K native, up to 1M with YaRN scaling
Multimodal: native vision-language (images + video)
License: Apache 2.0
Multi-token prediction: built-in, enables speculative decoding for faster throughput

The DeltaNet architecture is the interesting part. It uses linear attention with O(n) complexity instead of the quadratic attention in standard transformers. That's what lets it handle massive contexts without the memory explosion.

The benchmarks

Here's where it gets wild. Compared against Claude 4.5 Opus and the previous Qwen3.5-397B-A17B:

Benchmark	Qwen3.6-27B	Claude 4.5 Opus	Qwen3.5-397B
SWE-bench Verified	77.2	80.9	76.2
Terminal-Bench 2.0	59.3	59.3	52.5
GPQA Diamond	87.8	87.0	—
AIME26	94.1	95.1	—
MMMU (vision)	82.9	80.7	—

SWE-bench Verified

Higher is better. Qwen3.6-27B is dense; everything else is comparison context.

total: 234.3 %

Qwen3.6-27B77.2%

Claude 4.5 Opus80.9%

Qwen3.5-397B76.2%

It tied Claude 4.5 Opus on Terminal-Bench. Beat it on GPQA Diamond and MMMU. And absolutely destroyed the 397B MoE on everything — SWE-bench Pro jumped from 50.9 to 53.5, SkillsBench went from 30.0 to 48.2.

For a model that's 14x smaller in total parameters, this is not incremental. This is a category shift.

Thinking preservation

Qwen3.6-27B operates in "thinking mode" by default, generating reasoning inside thinking tags. But the new thing is preserve_thinking, which retains reasoning context across conversation turns instead of discarding it after each response.

For an agent that's iterating through code, debugging, and making multi-step decisions, this means it doesn't have to re-reason the same ground every turn. Less redundant tokens, better KV cache utilization, more consistent decisions across a long session.

In Hermes Agent, I enabled it and the difference is noticeable on multi-turn coding tasks. It remembers why it made a decision three turns ago instead of treating each turn as a fresh start.

Running it locally

Full BF16 weights are 55.6GB. The FP8 variant is nearly identical in performance. But the GGUF Q4_K_M quantization at 16.8GB is where it gets fun — that runs on a single 24GB RTX 4090.

Simon Willison ran it through llama.cpp and got ~25 tok/s generation on complex SVG prompts. Not blazing fast, but the output quality for a local model is genuinely impressive.

llama-server \ -hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M \ --reasoning on \ --chat-template-kwargs '{"preserve_thinking": true}' \

-c 65536 --temp 0.6 --top-p 0.95

Why this matters

The gap between "open-source model you can run yourself" and "what you pay $15/month for on an API" just got a lot smaller. For a 27B model, this is the kind of performance that makes you question whether the 400B+ MoE approach was the right direction.

Dense models are simpler to serve, easier to debug, and now apparently competitive — or better — on the benchmarks that actually matter for real work.

Apache 2.0 license too. No weird restrictions.

My take

I'm running this as a custom provider in Hermes Agent right now. For coding tasks, it's genuinely competitive with what I was getting from bigger models. The thinking preservation makes it feel more coherent across long sessions.

For WrapsRL, this means I can evaluate running inference locally instead of depending entirely on cloud APIs. That's a meaningful cost reduction when you're generating hundreds of images per month.

But beyond that — this is the kind of release that changes the calculus for anyone building with AI. Flagship performance in a model that fits on a laptop GPU. Open weights. Permissive license.

The open-source model space just jumped forward again.

$ echo "27B > 397B, apparently"
27B > 397B, apparently

§end of file

filed underAI Qwen local-LLM Hermes

←all posts