A new research paper makes a quiet confession: off-the-shelf large language models cannot reason about three-dimensional space well enough to park a car. The interesting part is what the team did next. They did not build a better model. They built scaffolding around the LLM, adding a 3D positional encoding patch and a coarse-to-fine decoder to compensate for the spatial reasoning the model could not do on its own.
The paper, ParkingTransformer: LLM-Enhanced End-to-End Trajectory Planning for Autonomous Parking, was posted to arXiv on 12 June 2026 by a team of seven researchers. It is a preprint, not peer-reviewed, and the authors do not claim their system is shipping in any production vehicle. What they describe is a research-stage framework for end-to-end trajectory planning, the part of an autonomous driving stack that decides the actual path the car will follow.
The framing matters because the paper itself names the limitation. The authors write that off-the-shelf LLMs lack adequate spatial reasoning for parking, and that existing end-to-end parking methods are black boxes that struggle with long-distance maneuvers from the road to a target spot. Their contribution is the engineering answer to that gap, not a leap in foundation model capability.
In plain language, 3D positional encoding is a way of telling a model where things are in physical space. A standard language model reads tokens in a line: word, word, word. A car, however, has to understand a scene in three dimensions: distance to the curb, the angle of the spot, the position of other vehicles. The ParkingTransformer team built a patch that injects this spatial information into the model's inputs so it can plan a trajectory that respects geometry, not just sequence.
The authors combine trajectory queries with the LLM's implicit state features and multi-view perception from cameras. The output is a planned path drawn directly from the model's outputs, with no need for the dense Bird's-Eye-View representations that most prior end-to-end parking systems rely on. A coarse-to-fine decoder refines the plan in stages, which the paper argues is what makes the system usable on real road-to-spot distances rather than only tight maneuvers.
The numbers are evidence, not lede. According to the abstract, ParkingTransformer reports an 88.70% real-world success rate on the team's test platform and a 61.32 driving score in the CARLA simulator, which is the standard open-source benchmark for autonomous driving research. Both figures should be read with the usual caveats: the paper has not been peer-reviewed, the code release status is not confirmed in the abstract, and independent replication is not yet available.
The structural pattern is the story. Across recent robotics research, the move is to take a foundation model that was not designed for a physical task, acknowledge that it cannot do the job on its own, and then bolt on spatial scaffolding to make it work. ParkingTransformer is one of the more rigorously documented cases of that recipe, because the paper openly identifies what the LLM could not do and itemizes the engineering that filled the gap.
What to watch next is whether independent groups can reproduce the 88.70% figure and whether the architectural choices, particularly the coarse-to-fine decoder, become a standard part of LLM-augmented parking systems or stay an interesting experiment. The paper's own description of its limits is the reason it is worth reading carefully rather than reading as a victory lap.