DeepSeek Open-Sources Its Training Stack and Chips at CUDA's Moat

DeepSeek did not just publish another model checkpoint. In a span of weeks, the Chinese lab pushed out DeepEP, DeepGEMM, FlashMLA and EPLB, four repositories that expose how it handles expert-parallel communication, matrix math, attention kernels and load balancing inside large-scale mixture-of-experts systems. The individual numbers are the sort that infrastructure engineers notice immediately: DeepEP says its intranode dispatch reaches 153 GB/s on H800 NVLink, its internode dispatch hits 58 GB/s over RDMA at 32-way expert parallelism, and its low-latency path can keep dispatch time at 194 microseconds even at 256-way expert parallelism. DeepGEMM says it can reach 1,550 TFLOPS on H800. FlashMLA says its updated kernels hit 660 TFLOPS on H800 SXM5. On GitHub, developers treated the releases less like research curiosities than like usable building blocks: FlashMLA had more than 12,500 stars, DeepEP about 9,200 and DeepGEMM nearly 7,000 as of April 24. That is why this story matters. DeepSeek is turning internal systems craft into public infrastructure. For rivals, startups and open labs, that offers a clearer path to reproducing frontier-scale efficiency outside the walls of OpenAI, Anthropic and Google. For Nvidia, the threat is subtler: not a sudden loss of chip demand, but the first credible signs that the software habits tying AI builders to CUDA can be loosened from the outside.

DeepEP Turns MoE Networking Into a Reusable Product

DeepSeek's new AI model appears to be one of the best 'open ...

DeepEP packages 153 GB/s intranode bandwidth and 194-microsecond dispatch into software other labs can actually deploy.

The mechanical significance of DeepSeek's open-sourcing push starts with DeepEP, because MoE systems live or die on the cost of moving tokens to the right experts and getting results back without wasting GPU cycles. DeepEP is built for all-to-all communication in expert parallelism, the part of the training and inference loop where activations have to be dispatched across GPUs and then combined. In practice, that is often where ambitious cluster designs become expensive bottlenecks. By publishing a library tuned for high-throughput and low-latency paths, DeepSeek is handing the market something far more valuable than a benchmark slide: an implementation.

DeepSeek Open-Sources Its Training Stack and Chips at CUDA's Moat

DeepEP Turns MoE Networking Into a Reusable Product

Stanford's AI Index Finds $581 Billion Investment and Benchmarks at Human Frontier

Anthropic Crosses $30B ARR as Claude Overtakes OpenAI for the First Time

Meta and Microsoft Cut 20,000 Jobs to Fund a $700 Billion AI Bet