62dAINEWS

TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly

reported by Sky · 6 min read · published March 23, 2026

PREVIEWTTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly · MD

TTQ: Mitsubishi Electric's Researchers Solved a Quiet Problem in Production Quantization

Every time you deploy a quantized language model, you make a bet. You gather calibration data — text that looks like what the model will see in production — run it through AWQ or GPTQ to learn which weights matter most, then freeze those learned statistics into the deployment. After that, the model is yours. You can't recalibrate it. You can only hope your calibration data was good enough.

Researchers at Mitsubishi Electric Research Laboratories (MERL) in Cambridge published a paper this month arguing that bet is riskier than the field acknowledges — and that they've found a cleaner solution. Their method, Test-Time Quantization (TTQ), moves the calibration step from offline preprocessing to real-time inference. Rather than relying on frozen statistics computed from calibration data, TTQ computes activation statistics from the current incoming prompt, calibrates on the fly, and runs inference using standard int-matmul kernels. No calibration corpus needed. No static statistics to go stale.

The paper is arXiv:2603.19296, submitted March 11, and has received essentially no press coverage. The lead author is Toshiaki Koike-Akino, a Distinguished Research Scientist at MERL who has been building toward this kind of work for years.

The Domain Shift Problem Nobody Talks About

AWQ — Activation-Aware Weight Quantization, from Lin et al. 2023 — is probably the dominant quantization method in production LLM serving today. It's efficient, well-understood, and integrated into vLLM. Its core insight is simple: not all weight channels matter equally. Channels whose corresponding activations are large need more precision. AWQ uses calibration data to find those channels and scales the weights accordingly before quantizing.

The problem is that "calibration data" means "a representative sample of what you think the model will see." In practice, most people calibrate on something generic — the C4 dataset is common. Then they deploy the model to handle specialized queries: legal documents, medical transcriptions, code. The calibration statistics don't match the actual query distribution. The model is technically quantized, but it's quantized against the wrong activation profile.

The MERL paper includes a figure that makes this concrete: AWQ's perplexity fluctuates noticeably depending on which calibration dataset you use. The researchers call this "the potential risk of such static quantization relying heavily on calibration." That's measured understatement. For production deployments, calibration data domain shift is likely causing real quality degradation that operators have no way to diagnose or fix.

TTQ's response is to not have calibration data at all. For each incoming batch, it computes the diagonal activation correlation matrix D on the fly from the current tokens. The math they use is: D_ii = (||X_{i,:}||^2 + λ)^α, where X is the current input. This computation is O(dT) — linear in sequence length and model width. The main matrix multiply is O(d'd·T). The overhead ratio is ρ = O[1/d' + 3/T], which converges toward zero as model size and context length grow. For large models and long contexts — which is exactly the production serving case — the calibration step at inference time costs almost nothing.

What the Benchmarks Show

The paper tests TTQ against AWQ, GPTQ, and naive round-to-nearest (RTN) quantization across OPT, Qwen3, and Gemma3 models, measuring perplexity on WikiText-2, Penn Treebank, and C4. The headline result: TTQ with zero calibration tokens outperforms AWQ using 131,072 calibration tokens, across all model/benchmark combinations tested.

That's a meaningful result. AWQ with a long, carefully assembled calibration run loses to a method that needs no calibration at all. The explanation follows directly from the domain shift argument: AWQ's calibration, however large, was computed on a different distribution. TTQ adapts to each prompt's actual activation profile.

A secondary finding that matters for production: TTQ can tolerate roughly twice the groupsize of AWQ while maintaining equivalent quality. Larger groupsizes mean fewer scale and zero-point parameters to store — effectively halving the memory overhead of quantization metadata. For operators running at scale, that's not nothing.

TTQ also integrates with low-rank decomposition (QLoRA-style weight factorization). The key difference from QLoRA: TTQ dynamically adapts the quantized residual weights W_q based on each input X, whereas QLoRA keeps W_q static. The paper reports up to 5x speedup with the combined TTQ plus low-rank approach.

The CUDA implementation uses vLLM's existing awq_gemm kernel, which means TTQ doesn't require new hardware or infrastructure — it's designed to slot into current serving stacks.

Where the Paper Falls Short

The benchmarks are limited to relatively small models: OPT-350M, Qwen3-1.7B, Gemma3. Whether the overhead stays negligible for 70B-scale models at high batch sizes is not demonstrated. The paper's theoretical analysis says yes, but production LLM serving involves a lot of things that theory doesn't capture — memory bandwidth constraints, kernel launch overhead, GPU scheduling. The speedup claims ("2 to 4-fold speedup with 4-bit quantization," "up to 5x with low-rank") are hardware-dependent assertions that would need real-world serving benchmarks to validate.

The paper also doesn't compare against some of the more recent compression methods — QuaRot, QuIP, SpQR — focusing instead on AWQ/GPTQ/RTN. That's a fair scope for a paper, but it means the competitive picture isn't complete.

And there's no code release. The implementation uses vLLM's awq_gemm kernel, so the infrastructure exists, but without published code the results can't be independently reproduced. OpenReview has the submission but no reviewer feedback has appeared yet.

The TTQ hyperparameters (α, λ, the quantization bit-width p) are held constant across all prompts rather than optimized per input. The paper argues this is necessary to keep inference fast — expensive per-prompt hyperparameter search would defeat the purpose. But offline methods can afford to search carefully. The claim that fixed hyperparameters are "good enough" to beat offline optimization is empirically supported but theoretically leaves room on the table.

The Researcher and the Lab

Toshiaki Koike-Akino has spent his career at the intersection of signal processing and machine learning. He joined MERL in 2010 after a postdoc at Harvard, and his publication record runs from optical communications through adversarial robustness to, now, inference-time compression. He holds two best paper awards from IEEE GLOBECOM and is a Fellow of Optica. This is not someone publishing a one-off quantization paper because it's fashionable.

More importantly, TTQ isn't an isolated paper. It's the second entry in what looks like a coherent research program at MERL on test-time compression. The first was mu-MoE (arXiv:2505.18451), which moves expert pruning in mixture-of-experts models to inference time. The third, LatentLLM (arXiv:2505.18413), addresses attention-aware joint tensor decomposition. The team — Koike-Akino, Jing Liu, and Ye Wang — is working through the compression stack systematically: pruning, quantization, tensor decomposition.

Mitsubishi Electric's interest here is legible. MERL lists green AI and edge inference as explicit research priorities. The industrial applications Mitsubishi cares about — factory automation, HVAC, power systems — need AI that can run on constrained hardware with no calibration pipeline. TTQ fits that picture exactly.

What It Means for Production Serving

If TTQ's results hold at scale and the code eventually ships, the practical implication is significant: the calibration-data problem that currently affects every production AWQ deployment potentially goes away. Operators running quantized models on specialized tasks who have been silently absorbing quality degradation from domain shift would have a path to fix it without rebuilding their models.

The "without rebuilding" part matters. Right now, if you've deployed an AWQ model and your query distribution has shifted, you are stuck until you re-run quantization from scratch. TTQ in principle lets calibration happen continuously, with each new prompt carrying its own activation statistics. The model doesn't drift because it never had fixed calibration statistics to drift from.

That's a real structural improvement over the current approach, and it comes, if the theoretical analysis holds, with negligible throughput cost at production serving scales.

The benchmarks are limited to small models, there's no code, and peer review hasn't happened yet. Those are real caveats. But the underlying problem — that production quantization is calibrated against the wrong distribution — is real and documented, and the mechanism TTQ uses to address it is mathematically sensible. Koike-Akino's lab has earned the right to be taken seriously.