Two New Papers Want to Solve Quadrotor Control in the Real World. One Might Actually Do It.
A pair of arXiv preprints tackle the same hard problem in drone control from very different angles. The results are promising, but the gap between 'outdoor experiment' and 'deployed system' remains large.
By
·10 hours ago·9 min read
Think of a quadrotor trying to hold a precise trajectory in a gusty crosswind. The physics are unforgiving. Small errors in predicting how the vehicle will respond to a motor command compound quickly, and by the time the controller has noticed it is wrong, the drone may already be meters off course. This is not an exotic edge case; it is the central problem of agile aerial control, and it has resisted clean solutions for years. Two papers posted to arXiv in late June 2025 take meaningfully different approaches to it, and reading them together tells you something useful about where learning-based drone control actually stands right now.
The first, arXiv preprint 2606.23444, introduces SkyJEPA, a world-model architecture for quadrotor control built around the Joint Embedding Predictive Architecture framework originally developed for self-supervised visual learning. The second, preprint 2606.27353, proposes a continual learning framework for robot policies under what the authors call "hidden and recurring dynamics," tested on the same platform, a real quadrotor flying under changing wind conditions. Both papers are honest about the problem they are solving. Neither oversells. That alone puts them a step ahead of a lot of work in this space.
Let me start with SkyJEPA, because its core architectural claim is the more conceptually interesting of the two. The standard approach to learning a dynamics model for a robot is to train a neural network to predict the next state given the current state and action, then roll that prediction forward autoregressively to plan over a horizon. The problem with this, as anyone who has tried it knows, is that errors accumulate. Each predicted state feeds into the next prediction, and small inaccuracies compound until the model's imagined trajectory diverges badly from reality. This is not a new observation; it is essentially the core limitation that model-based RL researchers have been working around for the better part of a decade.
Related coverage
More in Drones
The Air 3S discount is real, but the urgency behind it has less to do with Amazon's sale and more to do with where DJI hardware might be headed.
Mark Kowalski · 3 days ago · 6 min
Procurement is the easy headline. Actually building enough drones, fast enough, with American parts, is the problem nobody's solved yet.
Mark Kowalski · 6 days ago · 7 min
The FCC just removed a narrow category of toy drones from its Covered List. It's a small move, but it tells you something about where the regulatory wind is blowing.
Robert "Bob" Macintosh · 18 Jun · 4 min
Two new studies on autonomous drones in emergency services surface a problem nobody's really solved: when something goes wrong, who's accountable?
JEPA-style architectures, associated most prominently with Yann LeCun's work at Meta on self-supervised learning, sidestep this by doing prediction in a learned latent space rather than in raw observation space. The idea is that the latent representation can be structured to make future states easier to predict, without requiring the model to reconstruct every irrelevant detail of the observation. Prior applications of JEPA to robotics have focused on navigation at the kinematic level, essentially planning where a robot should go rather than computing the precise motor commands to get it there. SkyJEPA is, to be precise, the first application of this architecture to high-frequency real-time control of a quadrotor, which operates at timescales where milliseconds matter.
What makes the paper technically credible is the physics-inspired prober. Rather than leaving the latent space entirely abstract, the authors add a lightweight module that maps frozen latents to interpretable physical states, things like position, velocity, and attitude. This is a smart design choice. It means the latent representation is constrained to be physically meaningful, which gives the model a useful inductive bias and also makes it possible to debug: if the prober's outputs are nonsense, you know the latent dynamics model has gone wrong somewhere. The learned model is then paired with a sampling-based optimal control method, specifically a variant of Model Predictive Path Integral control, to generate real-time commands on embedded hardware.
The sim-to-real transfer results are what the paper leads with, and they are the most interesting part. The model is trained entirely in simulation, using an automated dataset generation pipeline the authors developed to reduce dependence on real-world data collection, and then deployed on a physical quadrotor outdoors without any fine-tuning on real data. Zero-shot sim-to-real transfer for a high-frequency controller is genuinely hard. Most dynamics models trained in simulation fail on contact with real hardware because the simulation's aerodynamics are wrong in ways that matter enormously at high speeds. The fact that SkyJEPA generalizes across what the authors describe as "diverse operating conditions" in outdoor experiments is a real result, not a trivial one.
That said, it is worth noting that the paper does not give us a detailed breakdown of what those operating conditions actually include. Wind speeds, temperature ranges, payload variations, the specific outdoor environments used: these details are present in the experimental section but not foregrounded in a way that would make independent replication straightforward. I am not suggesting the results are fabricated; the open-loop prediction accuracy numbers are reported clearly, and the closed-loop trajectories look convincing. But the gap between "worked in our outdoor experiments" and "will work in your outdoor experiments" is not zero, and the paper does not fully quantify it.
The second paper, the continual learning framework, addresses a related but distinct problem. Where SkyJEPA asks how to build a world model good enough that you never need to update it, the continual learning paper asks what happens when the world keeps changing in ways your model did not anticipate. The motivating observation is blunt and correct: robots deployed in the real world rarely operate under a single fixed dynamics model. Wind changes. Batteries drain. Hardware wears. Most learning-based controllers are trained once and then frozen, as if the deployment environment will cooperate by staying constant. It will not.
The framework proposed here combines an analytical physics prior with a neural residual model that captures unmodeled effects. A recurrent encoder infers a latent "condition" from recent interaction history, and this condition is used to adapt both the residual model and the policy. The key architectural insight is the separation between recognition and re-fitting. During deployment, when the system encounters a wind disturbance it has seen before, it does not need to re-learn how to handle it from scratch; it recognizes the condition from recent interaction and retrieves the appropriate policy response. This is meaningfully different from standard online adaptation approaches, which treat every new disturbance as a new optimization problem.
The numbers the authors report are striking. On a real quadrotor tracking a trajectory under changing wind, the policy recovers from recurring disturbances in roughly one second, approximately five times faster than online residual re-fitting baselines. Hover and tracking errors under large disturbances are reduced by 65.7% and 53.3% respectively over state-of-the-art online adaptation methods. These are large margins, and I want to be careful about getting too excited before independent replication, but the methodology is sound enough that the results deserve to be taken seriously.
Actually, the research shows something subtle here that is easy to miss. The speed advantage comes not from the model adapting faster in a gradient-descent sense, but from recognition being fundamentally cheaper than re-fitting. If you have already learned what a 5 m/s crosswind from the north does to your vehicle, recognizing that condition from a second of flight data is a much simpler computation than re-estimating the disturbance from scratch via online optimization. This is a clean conceptual contribution, and it maps onto something real about how expert human pilots operate: they recognize familiar disturbance patterns and respond with practiced corrections, rather than re-solving the control problem from first principles every time.
Reading these two papers together, a tension emerges that is worth sitting with. SkyJEPA's approach, essentially building a world model robust enough to transfer zero-shot from sim to real, assumes that a sufficiently good prior model can handle the diversity of real-world conditions without online updating. The continual learning paper's approach assumes the opposite: that real-world conditions are too diverse and unpredictable for any static model to handle, and that ongoing adaptation is not optional but necessary. Both assumptions contain truth. The interesting question, which neither paper directly addresses, is where on this spectrum a practical deployed system should sit.
I know I am being picky here, but I would have liked to see both papers tested on the same set of disturbance conditions, using the same platform and evaluation protocol, so the tradeoffs could be compared directly. As it stands, they use different experimental setups, which makes it hard to say whether SkyJEPA's zero-shot transfer would degrade under the kinds of recurring disturbances the continual learning paper uses, or whether the continual learning framework's adaptation overhead would matter in the high-frequency control regime SkyJEPA targets. These are not criticisms of either paper individually; they are just the natural next questions.
On the methodology side, both papers use quadrotors as their testbed, which is sensible given the platform's sensitivity to dynamics mismatches, but it also means the results are specific to a relatively constrained class of aerial vehicles. Whether these approaches generalize to fixed-wing aircraft, ground vehicles operating in variable terrain, or manipulators dealing with changing payloads is unclear. The continual learning paper's framework is written at a level of generality that suggests the authors believe it should transfer to other platforms, and the simulation studies support this to some degree, but real-world validation on anything other than a quadrotor does not exist yet in this work.
It is also too early to say how either approach scales with more complex tasks. Both papers focus on trajectory tracking and hover stabilization, which are important but relatively well-defined objectives. Real-world deployment often involves tasks where the objective itself changes, not just the dynamics, and it remains unclear how either architecture would handle that regime.
What I find genuinely valuable about both papers, taken together, is that they represent a maturation in how the robotics research community thinks about the dynamics modeling problem. The naive version of this problem, train a neural network on lots of data and hope it generalizes, has been tried extensively and found wanting. Both papers here incorporate meaningful structure: physics priors, interpretable probers, recurrent condition encoders. They are not throwing parameters at the problem; they are thinking carefully about what structure the problem has and building that structure into the model. That is the right direction, and it is worth saying so clearly even if the specific results here will need replication before anyone should build a product around them.
The SkyJEPA paper's automated dataset generation pipeline is also worth highlighting separately, because it addresses a real bottleneck that does not always get enough attention. Collecting real-world flight data is expensive, slow, and carries genuine safety risks. A structured pipeline for generating diverse, physically realistic simulation data, calibrated to transfer well to real hardware, has value beyond any single paper. If the pipeline itself were released and validated by other groups, that would be a more durable contribution than the architecture results alone.
For now, both papers sit in the zone that most good robotics research occupies: compelling results on specific platforms under specific conditions, with enough methodological care to make the results credible, but not yet the kind of broad validation that would justify calling the problem solved. The sample sizes are not large. Neither result has been replicated by independent groups. The experimental conditions, while outdoor and therefore more realistic than pure simulation, are still controlled enough that deployment in genuinely unstructured environments would be a further step. None of this diminishes the work. It just locates it accurately on the map from research result to deployed technology, which is a map the field has a persistent tendency to misread.