A new computer-vision paper called PRISM argues the field's biggest bottleneck in single-image 3D reconstruction isn't accuracy but architecture: by replacing the slow iterative sampling that defines today's best 3D-from-photo systems with a deterministic image warp and a small learned correction step, the authors report full-scene reconstructions in about 36 seconds, fast enough to live inside real workflows rather than just research demos.
PRISM: Feed-Forward Single-Image 3D Reconstruction via Geometric Warp-Residual Modeling is a preprint posted to arXiv on 2026-06-24 by Zhijie Zheng. Single-image 3D reconstruction, the task of inferring a complete 3D scene from a single still photo, has been dominated for the past year or so by camera-controlled video diffusion models. Those systems can produce coherent geometry and appearance, but they pay for that quality with iterative sampling: the network has to run dozens or hundreds of denoising steps before the model commits to a scene. That cost has kept diffusion-based 3D largely out of deployment pipelines in VR, robotics, and content creation, even as the underlying accuracy has raced ahead.
PRISM's bet is structural. The method decomposes multi-view latent prediction into two stages: a parameter-free geometric forward-warp that uses known camera geometry to project the input image into a target viewpoint, and a learned residual module that fixes only what the warp gets wrong. There is no diffusion sampler at inference. The network runs in a single feed-forward pass, with the heavy geometric work handed to a closed-form warp and the model's learning budget reserved for the correction step. The paper describes training as a two-stage curriculum on purely synthetic data: first a latents-supervised distillation stage aimed at geometric generalization, then a perceptual fine-tuning stage aimed at appearance quality.
The empirical anchor is the roughly 36-second-per-scene figure the authors report. It is a preprint number, not an independently replicated benchmark, but it is the right number to think about because it puts PRISM in deployment territory rather than research-demo territory. If a roboticist or a VR asset pipeline can produce a rough 3D scene in tens of seconds from a single still, the architectural conversation shifts from "can the model be correct?" to "is the residual trained on the right distribution?"
That last question is the open one. Single-image 3D is inherently ambiguous: any photo hides the back of the scene, and a model has to invent plausible content for the unseen side. PRISM inherits that ambiguity. Its residual module is trained on purely synthetic data, and the authors describe the result as "competitive" on three standard benchmarks rather than state-of-the-art. "Competitive" is hedged language in an abstract, and three benchmarks is a narrow evaluation footprint. The paper has not yet been through peer review. It posted on 2026-06-24 as a v1 preprint, and the TLDR aggregator summary shows no public commentary or replication activity yet.
What to watch next is concrete. Real-world transfer is the open question: a residual module trained on synthetic scenes has to generalize when the input photo comes from a phone camera in bad lighting, with motion blur, or in a domain the synthetic corpus did not cover. The architectural simplification is real and the speed claim is concrete, but the test will be whether the residual's failures in deployment look like small artifacts a downstream pipeline can mask, or like systematic gaps that pull PRISM back into the research-demo category its diffusion-based predecessors occupy.
For now, the paper's contribution is best read as a structural argument rather than a benchmark victory: that the bottleneck in 3D-from-photo is no longer accuracy, and that the next wave of work will turn on what a small learned corrector can do once the heavy geometric lifting is removed from the network's shoulders.