How to Train a 1.6T Parameter MoE on a Budget: Inside DeepSeek-V4's Pre-Training Stack
The Math That Beat the Export Controls: DeepSeek-V4's Radical Training Efficiency

For any new tech, Silicon Valley giants have usually tried to solve scaling problems by throwing more money at it first. But, for the engineers working on DeepSeek models, export restrictions and budget constraints forced them to think differently. They did not have luxury of toptier Nvidia GPUs with specialized InfiniBand networking. They had to come up with new, innovative ways to train a competitive model. In this post part 1 of a 3-part series, we will explore what they did differently to scale down their training cost.
To pull this off, DeepSeek essentially looked at three hidden taxes of AI training: memory tax of tracking model updates, idle-time tax of GPUs waiting for data, and the instability tax of scaling deep networks. By replacing the brute-force compute with architectural elegance, they proved that algorithmic cleverness can overcome physical silicon ceilings. Let’s look under the hood at the three core areas of DeepSeek's pre-training revolution.
1. The Muon Optimizer vs. AdamW
Driving a Car vs. Launching a Satellite
When an AI model trains, it looks at data, makes mistakes, and calculates gradients, basically directions on how to tweak its weights to get “smarter”. The optimizer is the actual algorithm that takes those directions and updates the model’s weights. For a decade, the industry standard has been an algorithm called AdamW. The way it works is like a driving a car coordinate by coordinate on a brutal, bumpy road. It tracks how rocky the journey has been for every single wheel independently, adjusting its speed wheel by wheel. This is incredibly reliable, but it’s a memory nightmare. You have to track the history of billions of individual coordinates.
DeepSeek swapped AdamW for a custom version of an algorithm called Muon. Muon is more like navigating via satellite. Instead of hoarding the history of billions of tiny, isolated points, Muon looks at the entire geometric shape of the system all at once and forces it to make perfectly balanced, fluid turns. By optimizing the whole structural shape rather than tracking every tiny past bump, it cuts out massive amounts of memory baggage and drives a straight line to peak performance.
The Technical Deep Dive
The amount of VRAM needed is one of the reasons why frontier models costs so much. When you use the industry-standard optimizer AdamW, you aren’t just storing the model's actual weights. You also have to keep track of two historical states for every single parameter. First is "Momentum”, the average direction a weight has been moving. Second is "Uncentered Variance”, a tracker of how wildly erratic that weight's changes have been. In the standard 32-bit floating point precision, storing both of these states need 8 bytes of memory per parameter, which is 2x the VRAM as the weights themselves.
Muon bypasses a massive chunk of this bottleneck by completely eliminating the second-moment tracking state for the core internal 2D weight matrices of the transformer backbone. By storing only a single momentum buffer (dropping the footprint from 8 bytes to 4 bytes), Muon slashes the optimizer state memory footprint for those core layers by exactly 50%.Across an entire model architecture, this removes roughly 45% of total optimizer state baggage, significantly lowering the overall threshold of required hardware clusters.
The Hybrid Engine Setup
As an engineering reality, Muon's matrix orthogonalization math only works on 2D grids of numbers. It cannot optimize 1D arrays, like vocabulary tokens (Embeddings) or structural safety rails (Layer Normalization). Therefore, modern architectures deploy a hybrid engine. Muon runs the heavy internal transformer layers that make up roughly 90% of the model's parameters, while AdamW is retained strictly for the outer 1D edge cases. Production deployment reports show that because Muon's geometric updates keep the internal layers so mathematically stable, it yields a massive 25% to 35% improvement in sample efficiency allowing models to reach target benchmark capabilities using a fraction of the data and hardware wall clock time required by standard AdamW setups.
2. Abolishing the GPU Bubble: DualPipe & Multi-Token Prediction (MTP)
The Assembly Line Bottleneck
Imagine a massive factory assembly line with 100 workers. Worker 1 does his job, passes the product to Worker 2, and so on. If 100th is working, what are Workers 1 through 99 doing? They are sitting idle, waiting for the next product to come back down the line. In AI training, this idle time is called the "GPU Bubble." When you string thousands of chips together across a network, millions of dollars are wasted because GPUs spend a massive chunk of time sitting completely still, waiting for data to arrive from other chips.
DeepSeek reduced this inefficiency from two completely different angles. DualPipe (to optimize the hardware timeline) and Multi-Token Prediction (to supercharge the data efficiency). DualPipe is like running two factory assembly lines in opposite directions at the exact same time, scheduling tasks so no worker ever stands idle. Multi-Token Prediction is like teaching the model to anticipate and plan the next steps of the sentence sequentially, squeezing vastly more learning value out of every single training token.
The Technical Deep Dive
DualPipe: True Bidirectional Overlapping
In large scale pipeline parallelism, execution bubbles during the startup and teardown phases of a batch severely cripple FLOP efficiency. This is heavily exacerbated in bandwidth constrained hardware clusters, where the "All-to-All" network communication required to route tokens to different Mixture-of-Experts (MoE) layers causes massive delays. DeepSeek’s DualPipe scheduling algorithm achieves a near-zero bubble overhead by splitting the training pipeline into two symmetric, bidirectional paths. Instead of a traditional linear progression, micro-batches are injected from both ends of the network topology simultaneously. To execute this, each physical device hosts two distinct structural chunks of the network (e.g., in an 8-stage setup, Device 0 holds the first layers and the final layers).
Multi-Token Prediction (MTP): Sequential Representation Constraints
While DualPipe squeezes every drop of utility out of the hardware, Multi-Token Prediction (MTP) completely changes how the model processes language. Standard language models only try to guess the very next word in a sequence. DeepSeek, however, tacks dedicated MTP modules onto its architecture. Rather than using basic, flat linear layers to guess multiple future words all at once, DeepSeek handles this sequentially. They built mini prediction modules that actually contain their own dedicated Transformer layers. The main model starts by generating a representation to predict the next token; that exact output is then handed off to the first MTP module, which layers in the next word’s data to guess the word after that.
During training, these extra MTP modules act like a massive amplifier for the data signal. They force the core backbone of the model to build much smarter internal representations that look further down the horizon subtly upgrading its ability to plan ahead and handle complex reasoning. When training is done, you can completely discard these MTP modules to keep the core model lightweight, or leave them plugged in to act as a built-in speculative decoding engine that dramatically speeds up inference.
3. Manifold-Constrained Hyper-Connections (mHC)
The Whisper Game Problem
An AI model faces a very similar scaling problem as the game of “Whisper”. Person 1 whispers a phrase to Person 2, and it goes down a line of 100 people. By the time it reaches the end, the message is usually completely garbled. Trillion-parameter models like DeepSeek-V4 are incredibly deep, with over a hundred layers stacked on top of each other. If data travels through all those layers unprotected, the mathematical signals can drift, distort, or completely explode before hitting the end.
To bypass this bottleneck, traditional models rely on basic, rigid residual connections to pass the signal along unchanged. But to increase data bandwidth, DeepSeek pioneered Manifold-Constrained Hyper-Connections (mHC). Think of this like giving every single person in the Whisper game a perfectly tuned acoustic filter. Instead of allowing the message to be wildly altered or amplified as it passes between multiple parallel lines of people, a strict mathematical rulebook forces the total signal energy to remain perfectly constant from Layer 1 all the way to Layer 100.
The Technical Deep Dive
As Mixture-of-Experts (MoE) models scale across incredibly deep topologies, traditional identity residual connections (x+F(x)) limit representation bandwidth. However, unconstrained parallel pathways ("Hyper-Connections") cause severe representation drift exponentially multiplying the numerical range of activations, leading to catastrophic training divergence. DeepSeek-V4 introduces mHC to stabilize this deep signal propagation. First, the architecture splits the traditional residual pathway into multiple parallel streams (applying an expansion factor of 4), allowing significantly more representational bandwidth to bypass heavy computation blocks. Second, to prevent these multi-stream connections from causing signal explosion, the transformation mixing matrix within the hyper-connection is strictly bound to a stable mathematical manifold known as the Birkhoff Polytope. DeepSeek achieves this by running a highly optimized Sinkhorn-Knopp balancing step during the forward pass. This forces the transformation matrices to remain strictly doubly stochastic (where both rows and columns sum exactly to 1). Because doubly stochastic matrices are closed under multiplication, the absolute variance of the gradients remains completely uniform across the entire network stack. This eliminates representation collapse and internal covariate shift, ensuring that every expert token router receives clean, uncorrupted inputs regardless of how deep it sits in the network.
At the end of the day, efficiency isn’t just about buying fewer chips. It’s about writing code so stable that training on a massive 32-trillion token corpus can run flawlessly—with near-zero downtime, mid-run crashes, or catastrophic loss spikes. By stripping away the hidden memory and hardware taxes of traditional AI architectures, DeepSeek showed the industry that you don't need a blank check to train a world-class model. You just need to stop throwing brute-force hardware at problems that can be solved with pure mathematical elegance.
Coming up in Part 2: We shift from training efficiency to inference survival. We'll look under the hood at Multi-head Latent Attention (MLA) and detail The Death of Linear Scaling: How DeepSeek Achieved a 90% KV Cache Reduction.
Sources
https://arxiv.org/html/2412.19437v1
https://arxiv.org/pdf/2512.24880
https://arxiv.org/html/2509.23106v1#:~:text=Orthogonal%20to%20sharding%20is%20compressing,and%20second%20moments)%20per%20parameter.