46dAINEWS

New technique lifts code translation accuracy from 34% to 42%

reported by Sky · 3 min read · published April 9, 2026

PREVIEWNew technique lifts code translation accuracy from 34% to 42% · MD

Sophia Trains Faster But Gets Nowhere Farther: What Stanford's FLeX Code Research Found

A Stanford University student has published research showing that combining Fourier-based regularization with low-rank adaptation (LoRA) fine-tuning can improve how a code-generating AI model translates between programming languages. The work, posted to arXiv on April 6, 2026 as a CS224N natural language processing course project by Gaurav Narasimhan, found that the approach lifted pass@1 performance on Python-to-Java translation from 34.2 percent to 42.1 percent on the MultiPL-E benchmark, where pass@1 measures how often the model's top-generated code candidate runs without errors. That is an 8 percentage point gain.

The headline result is modest in absolute terms: 42.1 percent is far from reliable code translation, but the method is computationally light. LoRA fine-tuning modified just 0.2 percent of the model's parameters, adding trained adapter layers rather than retraining the full network. The fine-tuned model also outperformed Code Llama-Python-7B on the HumanEval benchmark (40.1 percent pass@1 versus 38.4 percent), despite Code Llama-Python having been specifically trained on Python code.

The most striking finding is buried in the optimizer comparison. When the researcher trained the model using Sophia, a second-order optimizer, convergence was roughly 30 percent faster than with AdamW, the standard workhorse of neural network training. The final accuracy plateau was the same regardless of which optimizer was used. Faster getting there did not mean getting further.

This matters for teams building code translation pipelines. If you need a model that lands at the best possible performance level, the optimizer choice is largely irrelevant: AdamW gets you there just as far, given enough time. If you need results fast, Sophia saves clock time, but not compute in the sense of reaching a better answer. The practical implication is that Sophia's speed advantage is real only if your constraint is wall-clock time and your quality bar is the same as AdamW's plateau. That is a narrower use case than the 30 percent convergence figure suggests.

The paper applies its technique to Code Llama 7B, Meta's open-source code model. The Fourier regularization component penalizes certain high-frequency components in the model's middle layers during fine-tuning, with optimal results at a regularization strength (lambda) of 0.02 and a frequency threshold of 0.5. The researcher tested the approach only on Python-to-Java translation, leaving open whether the pattern holds for other language pairs.

There are unresolved gaps. The paper finds that when the LoRA adapter weights are merged back into the base model, a step common in deployment pipelines, performance drops below the baseline. The researcher notes the issue but does not fully explain it. The Fourier regularization also showed inconsistent results across different layer configurations, working well on the model's multi-layer perceptron (MLP) layers but not reliably elsewhere.

These limitations are typical of course project work, where scope is constrained by a semester timeline. The research, posted to arXiv as a preprint, has not been peer reviewed. That said, the LoRA efficiency finding and the optimizer comparison are results worth tracking, particularly as more teams look for lightweight ways to specialize code models for enterprise environments where multiple programming languages coexist.

The broader context for code translation work is a field moving fast. Models like GPT-4 and Claude have pushed benchmark scores higher across the board, but cross-lingual code transfer, taking a model trained primarily on one language and getting it to generate correct code in another, remains genuinely hard. The enterprise use case is concrete: large organizations maintain codebases in a mix of languages, and tools that reduce the manual effort of porting between them would have clear value. The gap between 34 percent and 42 percent on a standardized benchmark is still a gap, but it is a gap that is narrowing.

New technique lifts code translation accuracy from 34% to 42%

Sources