workblogaboutcontactresume ↓

§03//BLOG

§ log.004 — DeepSeek V4 dropped today. Benchmarks are wild, tool calling is broken.

6 min read1,258 wordsAIDeepSeekopen-sourceHermesfrontier-models

DeepSeek released V4-Pro (1.6T params) and V4-Flash (284B params). The benchmarks look incredible for the price — but I ran Flash through Hermes Agent and it completely falls apart on tool calls.

update — april 25, 2026: Less than 24 hours after publishing this post, I ran hermes update and the tool calling issues are completely gone. V4-Flash now handles structured tool calls without any errors. Zero issues so far. OpenRouter/DeepSeek patched it fast — genuinely impressed by how quickly the internet works. The rest of this post stands as my initial impressions, but the tool calling is no longer a blocker.

DeepSeek released V4 today. Two models

I hooked up V4-Flash to Hermes Agent this morning. The tool calling is completely broken — it outputs tool calls as regular messages then stops mid-execution. Weird error. I'll get into that.

First, let's talk about why everyone's excited.

the specs

SpecV4-ProV4-Flash
Parameters1.6T (MoE)284B (MoE)
Active params per token~37B~37B
Context window1M tokens1M tokens
ArchitectureMoE + Engram MemoryMoE + Engram Memory
MultimodalNative text, image, videoSame
LicenseApache 2.0 (planned)Apache 2.0 (planned)
The MoE architecture means both models only activate ~37B parameters per token despite the massive total parameter counts. That's what keeps inference costs low and throughput high.

Engram Memory is the new thing — a conditional memory mechanism that selectively stores and retrieves information based on relevance signals instead of relying purely on standard attention. DeepSeek claims 97% accuracy on Needle-in-a-Haystack at 1M tokens (compared to ~84% for standard attention). If this holds up, it could simplify RAG pipelines significantly — just pass the whole codebase and let Engram handle retrieval.

Also notable: V4 was reportedly trained entirely without Nvidia hardware, using Huawei Ascend 910B and Cambricon MLU chips. In the current geopolitical climate, that's a statement in itself.

the benchmarks (with caveats)

Here are the internal numbers DeepSeek is sharing:

ModelHumanEvalSWE-bench Verified
DeepSeek V490%80%+
Claude Opus 4.5~88%80.9%
GPT-5.3 Codex~87%~80%
DeepSeek V3~82%~49%
The jump from V3 to V4 on SWE-bench (49% → 80%+) is extraordinary. But here's the important part: these are internal benchmarks that haven't been independently verified. The "83.7% SWE-bench" graphic circulating on X has already been confirmed as fake and denied by DeepSeek.

Treat these numbers as aspirational until LMSYS or BigCode runs their own evaluations. That said, even if the real numbers come in 5-10% lower than claimed, this would still be a massive leap for an open-source model at this price point.

the pricing (this is where it gets interesting)

Official DeepSeek API pricing:

ModelInput /MTokOutput /MTok
V4-Flash$0.14 ($0.028 cache hit)$0.28
V4-Pro~$0.27~$1.10
But here's the thing — if you're routing through OpenRouter (like I am with Hermes), the pricing looks different:
ModelInput /MTokOutput /MTok
V4-Flash$0.14$0.28
V4-Pro$1.74$3.48
Qwen3.6 Plus$0.325$1.95
V4-Flash pricing is consistent everywhere — sub-$0.15/MTok input for a 284B parameter model claiming frontier benchmarks. That's the real headline.

But V4-Pro on OpenRouter? The markup is steep. At $1.74/$3.48, Qwen3.6 Plus is actually cheaper than DeepSeek V4 Pro on both input and output through OpenRouter. If you're already on the Qwen3.6 Plus tier and getting solid results, there's not a clear cost reason to upgrade to V4-Pro right now — at least not through OpenRouter.

Flash though? That's a no-brainer price point for testing.

my experience: tool calling is broken

This is where the excitement hits a wall.

I configured V4-Flash as a custom provider in Hermes Agent and started running it through standard workflows — file reads, terminal commands, web searches. The model generates reasonable text responses for straightforward prompts. But the moment it needs to call a tool, things fall apart.

The specific error: V4-Flash outputs tool calls as regular message content instead of structured function calls, then stops generating entirely. It's like the model knows it should use a tool but can't format the output correctly, so it just... halts. Mid-task. No recovery.

This isn't unique to my setup — there are already reported issues with DeepSeek V3.2 having frequent tool call parsing failures in vLLM when reasoning mode is enabled. It seems like a pattern across their model line.

For an agentic framework like Hermes that relies entirely on structured tool calls, this makes V4-Flash essentially unusable right now. You can use it for chat and text generation, but the moment you need it to actually do something — read a file, run a command, make an API call — it breaks.

benchmark overfitting or real intelligence?

This is the question keeping me up tonight.

The SWE-bench jump from 49% (V3) to 80%+ (V4) is so dramatic that it raises eyebrows. That's not a normal model improvement curve — that's almost like someone specifically optimized for those benchmarks.

Meanwhile, in practice, the model can't even reliably call a tool through an agent framework. If V4 truly has 80% SWE-bench capability, shouldn't it be able to handle structured function calling? Tool use is fundamental to software engineering workflows.

I'm not saying DeepSeek faked anything. But there's a real tension between benchmark performance and practical agentic ability that's worth paying attention to. It's possible V4 was heavily fine-tuned on SWE-bench-style tasks without developing the underlying generalization needed for novel tool-use scenarios.

Or it could be an early release issue — DeepSeek called this a "preview version" — and they'll patch the tool calling in subsequent updates. That's entirely plausible.

why i'm still excited

Despite the tool calling issues, I'm genuinely hyped about V4. Here's why:

The open-source frontier just got cheaper. A model claiming Claude Opus-level performance at $0.14/MTok input changes what's economically viable for small teams and indie builders. For WrapsRL, this means evaluating whether we can run more of our pipeline through a cheaper model without quality loss.

Engram Memory is worth watching. If the 97% Needle-in-a-Haystack accuracy holds up at scale, it could reduce or eliminate the need for complex RAG chunking strategies. That's a genuine architectural improvement, not just benchmark gaming.

Apache 2.0 licensing. No restrictions, no weird terms. Full commercial use and modification rights. In a world where most frontier models are locked behind proprietary APIs, this matters.

Trained without Nvidia hardware. The geopolitical implications of a competitive model trained entirely on domestic Chinese silicon shouldn't be understated. It proves you don't need H100s to build frontier AI — at least not anymore.

my take

V4 is a significant release that deserves attention, but I'm approaching it with measured excitement rather than blind hype. The benchmarks are impressive (but unverified), the pricing is aggressive, and the architecture has genuine innovations. But the tool calling issues I experienced firsthand suggest there's still work to do before this model is production-ready for agentic workflows.

I'll keep V4-Flash configured in Hermes as a secondary option and revisit it when DeepSeek addresses the function calling bugs. For now, Qwen3.6 Plus remains my go-to for cost-effective reasoning tasks that actually complete.

The open-source AI space is moving fast. Today it's DeepSeek. Tomorrow it'll be something else entirely. That's what makes this exciting — not any single model, but the pace of progress itself.

$ echo "great benchmarks, broken tool calls, still worth watching"

great benchmarks, broken tool calls, still worth watching

§end of file