Skip to content
Back to Archive
AIAI & Tech Desk2 min read

DeepSeek Open-Sources Its Training Stack and Chips at CUDA's Moat

DeepSeek's latest releases matter less as model marketing than as a serious open-source bid for the software layer that governs AI clusters, training jobs and inference economics.

DeepSeek Open-Sources Its Training Stack and Chips at CUDA's Moat

DeepSeek did not just publish another model checkpoint. In a span of weeks, the Chinese lab pushed out DeepEP, DeepGEMM, FlashMLA and EPLB, four repositories that expose how it handles expert-parallel communication, matrix math, attention kernels and load balancing inside large-scale mixture-of-experts systems. The individual numbers are the sort that infrastructure engineers notice immediately: DeepEP says its intranode dispatch reaches 153 GB/s on H800 NVLink, its internode dispatch hits 58 GB/s over RDMA at 32-way expert parallelism, and its low-latency path can keep dispatch time at 194 microseconds even at 256-way expert parallelism. DeepGEMM says it can reach 1,550 TFLOPS on H800. FlashMLA says its updated kernels hit 660 TFLOPS on H800 SXM5. On GitHub, developers treated the releases less like research curiosities than like usable building blocks: FlashMLA had more than 12,500 stars, DeepEP about 9,200 and DeepGEMM nearly 7,000 as of April 24. That is why this story matters. DeepSeek is turning internal systems craft into public infrastructure. For rivals, startups and open labs, that offers a clearer path to reproducing frontier-scale efficiency outside the walls of OpenAI, Anthropic and Google. For Nvidia, the threat is subtler: not a sudden loss of chip demand, but the first credible signs that the software habits tying AI builders to CUDA can be loosened from the outside.

DeepEP Turns MoE Networking Into a Reusable Product

DeepSeek's new AI model appears to be one of the best 'open ...

DeepEP packages 153 GB/s intranode bandwidth and 194-microsecond dispatch into software other labs can actually deploy.

The mechanical significance of DeepSeek's open-sourcing push starts with DeepEP, because MoE systems live or die on the cost of moving tokens to the right experts and getting results back without wasting GPU cycles. DeepEP is built for all-to-all communication in expert parallelism, the part of the training and inference loop where activations have to be dispatched across GPUs and then combined. In practice, that is often where ambitious cluster designs become expensive bottlenecks. By publishing a library tuned for high-throughput and low-latency paths, DeepSeek is handing the market something far more valuable than a benchmark slide: an implementation.

Cite this article

Bossblog AI & Tech Desk. (2026). DeepSeek Open-Sources Its Training Stack and Chips at CUDA's Moat. Bossblog. https://bossblog-alpha.vercel.app/blog/2026-04-24-deepseek-open-sources-training-stack

More in this section
AIApr 27, 2026
Stanford's AI Index Finds $581 Billion Investment and Benchmarks at Human Frontier

The Stanford AI Index 2026 documents AI coding reaching near-perfect scores, a $581 billion investment year, and a 2.7% US-China performance gap that a single model release can now flip.

AIApr 27, 2026
Anthropic Crosses $30B ARR as Claude Overtakes OpenAI for the First Time

Anthropic's annualized revenue hit $30 billion in April, surpassing OpenAI's $25B — the first time a challenger AI lab has led the company that invented ChatGPT on revenue.

AIApr 26, 2026
Meta and Microsoft Cut 20,000 Jobs to Fund a $700 Billion AI Bet

Within 24 hours on April 23 and 24, Meta announced 8,000 layoffs effective May 20 and Microsoft launched its first-ever voluntary buyout, redirecting human-labor budgets toward a $700 billion AI buildout.