The Algorithms Were Never the Problem
Quantizing a large language model for production deployment is not a solved problem. The algorithms exist (GPTQ, AWQ, SpinQuant, QLoRA) but stitching them together into a reliable pipeline still requires a specialist, and what works on a 7B model may fail silently on a 70B one. Fujitsu Research is betting that this is the real bottleneck, and on Monday the company released OneComp, an open-source framework that automates the entire quantization workflow from model ID to deployable checkpoint.
OneComp, described in a 31-page paper posted to arXiv on March 31, 2026, takes a model identifier and a hardware budget as input and outputs a quantized model ready to serve. The pipeline plans mixed-precision bit assignments automatically using a component called AutoBit, which profiles each layer's quantization sensitivity and solves a constrained optimization problem to stay within the available VRAM. No manual bit-width tables. No per-layer tuning scripts.
The system is structured in three tiers that scale with available compute. When GPU memory is tight, OneComp applies layer-wise post-training quantization, processing one linear layer at a time. This tier uses closed-form error-propagation corrections and submodule-aware coordinate descent, two techniques the authors showed in prior work at NeurIPS 2025. When more resources are available, the pipeline shifts to block-wise quantization with distillation and KL-divergence optimization across the full model. A final optional stage uses QLoRA fine-tuning to close whatever gap remains. The first quantized checkpoint produced by the pipeline is already deployable, and each subsequent stage refines the same model rather than starting over, so users get a working artifact immediately and improve it if and when more compute becomes available.
The approach handles a range of precision targets. At 3-4 bits, a joint scale-and-integer optimizer outperforms standard GPTQ while producing standard-compatible checkpoints, the paper reports. At 1-2 bits, structured binary-factor formats maintain meaningful accuracy where uniform quantization degrades badly. A concrete illustration: a 70-billion-parameter model requires approximately 140 GB at 16-bit precision, about 35 GB at 4-bit precision, and roughly 18 GB at 2-bit precision, according to the paper. That 18 GB figure is what makes a 70B model runnable on a single consumer GPU.
Fujitsu has verified OneComp against the Llama family (TinyLlama through Llama-3) and Qwen3 at scales from 0.6 billion to 32 billion parameters, according to the GitHub repository. The repository includes a vLLM plugin for serving quantized models directly. Fourteen researchers from Fujitsu Research authored the paper, including Yuma Ichikawa, Keiji Kimura, Akihiro Yoshida, and Yamato Arai, whose earlier work on quantization error propagation underpins the framework.
The company is already using the technology in production. Fujitsu's Kozuchi AI service, which hosts the Takane LLM for enterprise customers, incorporates the quantization method. In its own benchmarks, Fujitsu reported retaining 89% of full-precision accuracy while cutting memory consumption by 94% and tripling inference speed using 1-bit quantization on Takane. The company has also begun releasing quantized open-weight models on Hugging Face, starting with a 4-bit version of Cohere's Command A.
What separates OneComp from a research demo is the API design. The framework exposes what it calls a "path-to-plan" interface that derives the quantization workflow automatically from the model architecture and GPU budget. An engineer who wants to serve Llama-3-70B on a machine with 40 GB of VRAM does not need to understand per-layer sensitivity profiles or rotation preprocessing. They specify the model and the memory constraint, and OneComp constructs the pipeline. New algorithms slot into a refiner architecture without redesigning the surrounding system.
The fragmentation OneComp targets is real. The quantization literature has grown dense: Hessian-aware rounding, activation-aware scaling, rotation-based outlier suppression, block-wise joint optimization, structured binary formats. Each method has its own failure modes, hyperparameters, and hardware assumptions. Practitioners who want to compress a model for a specific deployment environment typically end up writing custom orchestration code that is hard to reproduce and harder to maintain. OneComp's bet is that the bottleneck is not algorithmic innovation but workflow automation. The algorithms already exist; what the field needs is a reliable way to sequence them.
Whether the abstraction holds across diverse architectures remains an open question. OneComp has been verified on Llama and Qwen3, but support for other families (Mistral, DeepSeek, GEMMA variants) is listed as planned rather than tested. The framework also inherits the limitations of its components: QLoRA fine-tuning, when used, requires curated training data, and the quality of the final checkpoint depends on what the user provides. The deployable-first design sidesteps this for users who skip the fine-tuning stage, but those who want the best possible accuracy after aggressive compression still need to do their part.
For teams that have been building and rebuilding quantization pipelines from scratch, OneComp is at minimum a useful parts catalog. Whether it becomes the standard workflow for LLM compression depends on whether the abstraction proves robust across the broader model landscape and whether the open-source community treats it as a foundation to build on or a competitor to the tools they already have.
† Add footnote: "Source-reported; not independently verified."