DeepSeek introduces its experimental V3.2-Exp model with sparse attention technology. The innovation promises to process long texts much more efficiently, while maintaining virtually identical output quality to the previous V3.1-Terminus model.
Chinese AI company DeepSeek has launched V3.2-Exp, an intermediate step towards its next-generation architecture. The experimental version builds on the V3.1-Terminus model, introducing DeepSeek Sparse Attention (DSA). This sparse attention technology is expected to improve training and inference in long contexts significantly.
V3.2-Exp is immediately available to developers through various platforms. HuggingFace provides access to the model, while vLLM offers day-0 support. The model works on various hardware configurations, from Nvidia H200 to AMD chips.
For developers who want to run locally, DeepSeek has made inference code available. The conversion process from HuggingFace model weights to local use does require adjustments for GPU configuration and expert settings.
Sparse attention as a breakthrough
The core of the update lies in the sparse attention mechanism. This technology selects only relevant parts of long texts for processing, drastically reducing the computing power required. Traditional attention mechanisms view each word in relation to all other words, which requires exponentially more computing power for long texts.
According to DeepSeek, DSA achieves “fine-grained sparse attention” for the first time. The system maintains model quality while substantially improving efficiency in long contexts. For developers, this means faster training and cheaper inference for extensive documents.
Benchmark performance
DeepSeek has thoroughly tested V3.2-Exp against the earlier V3.1-Terminus model. On benchmarks such as MMLU-Pro, both models score identically with 85.0 points. On programming challenges such as Codeforces, V3.2-Exp even performs slightly better with 2121 points versus 2046 for V3.1-Terminus. The company states that it deliberately used identical training configurations to enable a fair comparison.
DeepSeek has also released open-source kernels. TileLang offers kernels for research purposes, while DeepGEMM and FlashMLA provide high-performance CUDA kernels for production use. These tools are designed to help developers maximize their use of sparse attention.
The V3.2-Exp model operates under an MIT license, allowing for both commercial and academic use. For organizations working with lengthy documents, sparse attention technology can lead to a significant improvement in efficiency.
Read also: DeepSeek delayed by GPU export restrictions