Two New Papers Try to Fix VLA Fine-Tuning. One Is More Novel Than It Sounds.

FiberTune and APT each tackle a different failure mode in vision-language-action model training. Understanding why they matter requires knowing what they're actually solving.

11 June 20269 min read

If you have been following the vision-language-action literature, you have probably noticed a pattern: researchers take a powerful pretrained VLM, bolt on an action head, fine-tune on robot demonstrations, and then discover that the model has quietly forgotten something important in the process. Two preprints posted this week, arXiv and a companion paper, each diagnose a different version of this forgetting problem and propose training-time fixes. Neither paper is a complete solution to VLA fine-tuning, but taken together they clarify what is actually going wrong inside these models in ways that prior work had not made precise.

Let me walk through both, because I think the framing matters as much as the results.

What problem are these papers actually solving?

Vision-language-action models, for readers less deep in this literature, are architectures that start from a pretrained vision-language model (something like a large multimodal transformer trained on internet-scale image-text data) and extend it to predict robot actions. The promise is that the rich visual and linguistic representations already learned during pretraining will transfer to manipulation tasks, reducing the amount of robot demonstration data required and improving generalization to novel instructions or scenes.

The reality, as both papers document carefully, is messier. Fine-tuning on action-labeled robot data introduces pressures that can degrade the pretrained representations in ways that are not immediately obvious from task performance metrics alone.

FiberTune, from the first paper (arXiv:2606.08653), identifies a specific failure mode it calls residual visual collapse along local action fibers. This is worth unpacking. The intuition is that when you supervise a VLA model purely on action prediction, the gradient signal only constrains feature directions that actually change the predicted action. Features that vary across states but happen to be consistent with a given action, what the authors call the action fiber, receive no corrective gradient and are free to drift. Over training, the model loses the structured visual information that sits in those unconstrained directions.

Related coverage

More in AI Models

Chipmakers swung wildly this week, from a Tuesday 'chip-wreck' to a Micron-led surge after hours. What's actually going on with AI's hardware backbone?

Sarah Williams · 26 Jun · 5 min

The original Creator Studio was shut down in 2023. Now it's back, rebuilt around an AI assistant that promises to grow your audience and reply to comments in your voice.

Sarah Williams · 26 Jun · 5 min

At its annual Config conference, Figma announced coding layers, AI-generated motion graphics, and a reimagined canvas that blurs the line between design and full-stack development.

Sarah Williams · 26 Jun · 5 min

Everyone talks about chips and models. The memory bottleneck is the part of the AI buildout that keeps getting underestimated, and Micron's latest earnings make that case hard to ignore.

Two New Papers Try to Fix VLA Fine-Tuning. One Is More Novel Than It Sounds.

What problem are these papers actually solving?

More in AI Models

What does each paper actually propose?

Do the results hold up?

How novel is each contribution?

What would I want to see next?

Sources