Two New Papers Try to Fix VLA Fine-Tuning. One Is More Novel Than It Sounds.
FiberTune and APT each tackle a different failure mode in vision-language-action model training. Understanding why they matter requires knowing what they're actually solving.
By
·10 hours ago·9 min read
If you have been following the vision-language-action literature, you have probably noticed a pattern: researchers take a powerful pretrained VLM, bolt on an action head, fine-tune on robot demonstrations, and then discover that the model has quietly forgotten something important in the process. Two preprints posted this week, arXiv and a companion paper, each diagnose a different version of this forgetting problem and propose training-time fixes. Neither paper is a complete solution to VLA fine-tuning, but taken together they clarify what is actually going wrong inside these models in ways that prior work had not made precise.
Let me walk through both, because I think the framing matters as much as the results.
Vision-language-action models, for readers less deep in this literature, are architectures that start from a pretrained vision-language model (something like a large multimodal transformer trained on internet-scale image-text data) and extend it to predict robot actions. The promise is that the rich visual and linguistic representations already learned during pretraining will transfer to manipulation tasks, reducing the amount of robot demonstration data required and improving generalization to novel instructions or scenes.
The reality, as both papers document carefully, is messier. Fine-tuning on action-labeled robot data introduces pressures that can degrade the pretrained representations in ways that are not immediately obvious from task performance metrics alone.
FiberTune, from the first paper (arXiv:2606.08653), identifies a specific failure mode it calls residual visual collapse along local action fibers. This is worth unpacking. The intuition is that when you supervise a VLA model purely on action prediction, the gradient signal only constrains feature directions that actually change the predicted action. Features that vary across states but happen to be consistent with a given action, what the authors call the action fiber, receive no corrective gradient and are free to drift. Over training, the model loses the structured visual information that sits in those unconstrained directions.
Related coverage
More in AI Models
Most coverage framed Beijing's 2 trillion yuan data center push as a geopolitical flex. The research implications are more complicated than that.
Aisha Patel · 12 hours ago · 7 min
One uses graph-based reasoning to auto-generate rewards; the other fuses human language and physical corrections. Both beat expert-designed baselines.
James Chen · Yesterday · 5 min
Three new papers tackle the same problem: how do you get a robot to understand 'I left my backpack on the table' when it can't even see the table?
Sarah Williams · Yesterday · 4 min
Two new papers tackle the unsexy problem that's actually holding back robotics: we can't generate enough good training data without armies of human experts.
APT (arXiv:2606.12366, Action expert PreTraining) diagnoses a related but distinct problem. In VLA architectures that use a continuous action expert (a separate module, often a diffusion policy or flow-matching network, rather than discretized action tokens), that expert starts from random initialization and learns entirely from the robot demonstration dataset. The demonstration dataset is structurally imbalanced: it contains far more visual and action diversity than language diversity. The noisy gradients from the randomly initialized action expert propagate back into the VLM and corrupt its language representations, which is precisely the capability you were hoping to exploit.
These are genuinely different failure modes. FiberTune is about visual representation collapse during fine-tuning. APT is about language representation corruption during training due to action expert initialization. It is worth noting that both problems can, in principle, co-occur in the same model.
FiberTune's solution is a training-time regularization objective. During fine-tuning, the method maintains an online action probe, a lightweight linear classifier trained to predict actions from intermediate visual token representations. This probe estimates which feature directions are action-predictive. FiberTune then filters those directions out of the visual token representations, producing what the paper calls probe-filtered residuals, and aligns those residuals to the corresponding representations from a frozen visual teacher model (the original pretrained backbone). The method also regularizes the effective rank of those residuals, which is a way of encouraging the model to retain a rich, low-redundancy visual feature space rather than collapsing to a low-dimensional representation.
The key design choice is that all of this happens only at training time. At inference, the model runs identically to a standard fine-tuned VLA. There is no frozen teacher, no probe, no rank regularization. This matters for deployment, since inference-time overhead is a real constraint in real-time robot control.
APT's solution is a two-stage training procedure grounded in a Bayesian factorization of the policy. The authors decompose the VLA policy into a language-agnostic vision-action prior and a language-conditioned likelihood. In Stage 1, the action expert is pretrained on vision-action pairs while the VLM is frozen, learning a visuomotor prior without any language signal at all. This sidesteps the language imbalance problem entirely during this stage. In Stage 2, language tokens are injected through a gated fusion mechanism that integrates VLM features while preserving the visuomotor prior learned in Stage 1.
The Bayesian framing is, I know I am being picky here, somewhat loose as presented. The paper does not derive a formal variational objective from the factorization; the factorization is more of a motivating intuition for the two-stage design than a rigorous probabilistic derivation. That said, the design choice it motivates is sensible and the empirical results support it.
FiberTune reports results across six simulation settings spanning two benchmarks (CALVIN and a second benchmark) and two architectures, pi_0.5 and OpenVLA-OFT. The headline number is a +10.7 percentage point improvement in SR(5) on the long-horizon CALVIN ABC-to-D task, which is a meaningful gain on a benchmark that has become a reasonable stress test for generalization in manipulation. The method also improves physical robot performance on an SO-101 pick-place task, raising task success from 72.7% to 78.1%.
Actually, the research shows that the physical robot result, while modest in absolute terms, is the one I find most credible precisely because it is hardest to game. Simulation benchmarks can be sensitive to hyperparameter choices in ways that do not always replicate. The fact that the method shows consistent improvement across both simulation and physical hardware, under identical training conditions, is a point in its favor.
The residual diagnostics the paper includes are also useful. The authors show that performance gains coincide with increased probe-filtered residual teacher alignment and higher effective rank, which is consistent with the action-fiber motivation. This is not proof of causality, but it is the kind of mechanistic evidence that distinguishes a principled method from one that happens to work for opaque reasons.
APT's results focus on out-of-distribution language instruction generalization, which is a harder and arguably more important test than in-distribution task success. The paper reports consistent gains on unseen instructions and compositional tasks across mainstream VLA architectures including pi-style and GR00T-style models. The breadth of architectural coverage is notable; a method that only works on one architecture is of limited practical interest.
That said, the sample sizes involved in physical robot experiments across both papers are small. This has not been replicated yet by independent groups, and the tasks evaluated, while real, are relatively constrained manipulation scenarios. Long-horizon, contact-rich, or highly dexterous tasks remain untested.
This is where I want to be precise, because the novelty claims in VLA papers are often overstated.
FiberTune is genuinely new in its formalization of the action-fiber collapse problem. Prior work on representation preservation during fine-tuning (including various knowledge distillation and elastic weight consolidation approaches) has addressed catastrophic forgetting in general terms, but the specific identification of the action-fiber structure as the locus of collapse is a conceptual contribution. The use of an online probe to estimate and filter action-predictive directions before aligning residuals to a frozen teacher is a clean technical idea that does not appear, to my knowledge, in the prior VLA fine-tuning literature.
The broader family of techniques it draws on, frozen teacher distillation, rank regularization, probe-based feature decomposition, are all established. The novelty is in their combination and the specific motivation for why that combination addresses a problem that standard task-loss fine-tuning misses.
APT is more incremental over existing two-stage training approaches, but the specific problem it addresses, corruption of VLM language representations by a randomly initialized continuous action expert, is underappreciated in the literature. Most prior work on VLA language generalization has focused on data augmentation or co-training strategies rather than on the initialization and gradient dynamics of the action expert itself. The Bayesian factorization framing, loose as it is, at least provides a principled vocabulary for discussing why stage separation helps.
It is also worth noting that both papers address architectures with continuous action experts rather than discretized action tokens. This is a meaningful scope distinction. Methods like RT-2 and early OpenVLA used discrete action tokens, which allowed vision-language co-training to naturally protect language representations. The shift toward continuous action experts (diffusion policies, flow matching) in more recent VLA architectures like pi_0 and GR00T has reopened some of these representation problems, and both papers are responding to that specific context.
Several things remain unclear from both papers, and they point toward what the field needs.
First, the interaction between these two failure modes. If FiberTune addresses visual collapse and APT addresses language corruption, what happens when you apply both? The papers do not address this, and it is not obvious that the methods are compositional. The probe-filtered residual alignment in FiberTune operates on visual tokens; the gated fusion in APT operates on the interface between the action expert and the VLM. They might combine cleanly, or they might interact in unexpected ways.
Second, and this raises questions about multiple things, both papers evaluate on relatively short-horizon pick-and-place style tasks. The +10.7 point gain on CALVIN ABC-to-D is encouraging because CALVIN does test some degree of language-conditioned long-horizon behavior, but it is still a table-top manipulation benchmark with a constrained object set. Whether these methods help on tasks requiring sustained contact, tool use, or multi-step reasoning under visual ambiguity is unknown.
Third, FiberTune's online probe introduces a training-time dependency that the paper does not fully analyze. The probe quality presumably varies over the course of training as the model's representations evolve. Whether the probe is reliably accurate early in training, when it matters most for preventing collapse, is not addressed in the paper. I would want to see an ablation on probe training dynamics.
Fourth, both methods add training complexity. FiberTune requires maintaining a frozen teacher and an online probe. APT requires a two-stage training pipeline. For practitioners working with limited compute, the question of whether the gains justify the overhead is real. The papers do not report wall-clock training times in a way that makes this easy to assess.
Finally, independent replication. Both papers are preprints, posted within days of each other. The results are internally consistent and the methods are clearly described enough to reimplement, but this is based on limited data from the original authors. The field will learn more when other groups run these methods on their own hardware and datasets.
None of this is to dismiss the contributions. FiberTune in particular seems to me like the kind of paper that will age well: it identifies a real problem with a clean formalization, proposes a solution with clear mechanistic motivation, and tests it carefully across multiple settings. APT is a useful practical recipe for a problem that practitioners training continuous-action VLAs have probably encountered without having good language for it.
The broader picture these two papers are sketching, a VLA fine-tuning landscape full of subtle representation pathologies that task-loss metrics alone cannot detect, is one the field needs to take seriously. We have been optimizing for task success rates while, in a sense, not watching what happens to the representations underneath.