Ternary Mamba: QAT From a Pretrained SSM Checkpoint

A new arXiv paper shows that ternary state space models can be made from a pretrained checkpoint with quantization-aware training and distillation, instead of training from scratch on 150 billion tokens — and surfaces a new instability called zero-ratio collapse.

State space models, the Mamba family chief among them, have earned attention as a serious alternative to transformers because they offer linear-time inference — their cost grows in proportion to sequence length rather than its square. That makes them attractive for long contexts and, in principle, for edge devices. The catch is memory: even a modestly sized SSM has a footprint that strains a microcontroller or a phone. A paper posted to arXiv on June 16, 2026, by Ramprasath Ganesaraja, Sahil Dilip Panse and Swathika N tackles that footprint with extreme weight quantization, and its more interesting contribution is about how cheaply you can get there.

The technique is ternary quantization: compressing each weight down to one of three values, a scheme often denoted W1.58A16, meaning roughly 1.58 bits per weight with 16-bit activations. Prior work in this vein, which the authors name as Slender-Mamba, produced ternary SSMs by training them from scratch on 150 billion tokens. That is an enormous and expensive undertaking. The new paper's headline claim is that you do not need to start over at all — a pretrained checkpoint suffices, and the marginal cost of getting to a ternary model drops by three orders of magnitude.

"Prior ternary SSM work (Slender-Mamba) trains from scratch on 150B tokens; we show a pretrained checkpoint suffices, reducing the marginal token budget by 1,000x."— arXiv:2606.18114, source

The mechanism behind that 1,000x reduction is grouped quantization-aware training (QAT) combined with knowledge distillation from a frozen FP16 teacher. In QAT, the model learns with the quantization simulated during training, so it adapts to the coarse three-value weights rather than having them imposed bluntly afterward. The distillation piece keeps a full-precision copy of the model — the teacher — and trains the quantized student to match its behavior, transferring the knowledge already baked into the pretrained checkpoint instead of relearning it. Together, these let the authors compress Mamba-2 1.3B by 3.61x — from 2,687 MB down to 744 MB — and reach 48.1% average zero-shot accuracy across seven tasks using just 102 million tokens, which they report as four GPU-hours on a single H100. That figure, they note, approaches Bi-Mamba's 48.4% (within a stated confidence interval).

The new instability: zero-ratio collapse

What lifts this paper above a routine efficiency result is that the authors do not just report a win; they report a failure mode that the from-scratch approach apparently never exposed. They call it zero-ratio collapse, and they describe it as a novel instability caused by the learnable quantization scales, arising specifically in the QAT-from-pretrained setting rather than in from-scratch training. In a ternary scheme, one of the three permissible values is zero, and if the learned scaling drives an unhealthy fraction of weights toward zero, the model effectively loses capacity in a way that ordinary training never surfaces.

Naming and diagnosing a new instability is, on the research-contribution ledger, often more durable than a benchmark number. Benchmark numbers get superseded; a characterization of why a training regime breaks tends to inform every subsequent attempt in the same regime. The fact that zero-ratio collapse appears only when quantizing from a pretrained checkpoint is exactly the kind of regime-specific gotcha that practitioners need warned about before they burn GPU-hours discovering it themselves.

Why transformer tricks do not transfer

The paper makes a second pointed observation: post-hoc correction strategies that work for quantized transformers fail for SSMs. The reason given is error accumulation through the recurrence. A transformer processes a sequence with attention that, broadly, lets each position see the whole context directly; quantization errors at one position do not necessarily propagate in a compounding way. A state space model, by contrast, carries information forward through a recurrent state, so a small quantization error introduced early can be amplified as it is fed through step after step. A correction designed for the non-recurrent transformer setting therefore does not map onto the recurrent one.

That distinction matters for anyone trying to port the large and growing body of transformer-quantization techniques over to the SSM world. It is a caution that the architectures are different enough that quantization is not a solved problem you can simply transplant — the recurrence changes the error dynamics, and methods have to be designed for it.

The landscape read

From an IP and competitive-landscape vantage, edge-deployment quantization is a crowded and commercially loaded area, because shrinking a capable model to fit a phone or a microcontroller is worth real money. The distinctive contribution here is narrow but real: demonstrating that the expensive from-scratch token budget for ternary SSMs is avoidable via QAT-plus-distillation from a checkpoint, and documenting the specific instability that approach incurs. The underlying ingredients — QAT, knowledge distillation, ternary weights — are each well established individually, including BitNet-style 1.58-bit work on transformers, so the inventive wedge is the combination and its application to the recurrent SSM setting, plus the zero-ratio-collapse finding.

The honest caveats: this is a preprint that has not cleared peer review at the time of writing, 48.1% average zero-shot accuracy on a seven-task suite is a modest absolute number that reflects a small base model and aggressive compression, and "approaching" a baseline within a confidence interval is a careful phrase that should be read as parity-ish rather than a clear win. But the data-efficiency story is the point, and it is a clean one: if a pretrained checkpoint really does suffice, the cost of producing edge-ready ternary SSMs falls from a research-lab undertaking to something achievable in an afternoon on one GPU. The full method, the seven-task breakdown, and the analysis of zero-ratio collapse are in the preprint on arXiv.

Ternary Mamba Cuts the Token Bill for 1.58-Bit State Space Models by 1,000x

The new instability: zero-ratio collapse

Why transformer tricks do not transfer

The landscape read

Comments