DeepSeek debuts Sparse Attention model to cut inference costs in half

DeepSeek has unveiled a new experimental AI model, V3.2-exp, that uses a technique called Sparse Attention to reduce inference costs, particularly for long-context operations. Announced via Hugging Face and accompanied by an academic paper on GitHub, the model introduces a pair of mechanisms designed to streamline how transformer models process large amounts of text.

At the core of the system is what DeepSeek calls a “lightning indexer,” which scans the full context window and identifies the most relevant excerpts. A second component, the “fine-grained token selection system,” then narrows down specific tokens within those excerpts for the model to process. By focusing computational resources only on the most meaningful sections, the model can handle long-context workloads without the same server strain.

Early tests suggest this approach could cut the cost of API calls by up to half in long-context scenarios, though the company notes that further third-party evaluations are needed to validate the results. Because the model is open-weight and freely available, researchers and developers will be able to benchmark its performance independently in the coming weeks.

The release highlights a growing push across the AI industry to address inference costs — the ongoing expense of running large models once they’ve been trained. Unlike training, which is a one-time cost, inference requires continuous server power for every query and response, making efficiency crucial for commercial viability. Sparse Attention represents DeepSeek’s attempt to re-engineer parts of the transformer architecture to make it leaner.

DeepSeek, based in China, has positioned itself as a cost-conscious AI developer in a market dominated by U.S. firms. Earlier this year, its R1 model drew attention for being trained with reinforcement learning at a fraction of the cost of American rivals, though it fell short of sparking the disruption some predicted. While the V3.2-exp release is less likely to generate headlines on the same scale, it may have a more immediate impact by offering practical tools to reduce operating expenses for long-context AI applications.