Compute is not a bottleneck for robotic manipulation

Robotic manipulation covers a wide range of tasks, from folding laundry to doing electrical work. Despite recent advances in AI, current manipulation models struggle to achieve generality and dexterity in the real world. However, compute doesn’t seem to be the blocker: in our dataset, the largest manipulation models typically train with ~1% of the compute used by frontier AI models in other domains.

Because many of the strongest manipulation systems come from the same labs that build much larger frontier models, this gap likely reflects limits on how effectively compute can be used under current conditions, rather than a lack of access or willingness to scale. Scarcity of robotics data and hardware constraints may both play a role. If these constraints ease, a large compute overhang could translate into faster capability gains.

Learn more

Data

The data on robot manipulation models comes from our new Robotics dataset, and the data on frontier AI models comes from our AI Models database. We used several sources to curate robotic manipulation models: Awesome LLM Robotics, Awesome Robotics Foundation Models, suggestions by two robotics experts, and existing entries in our AI Models database. We also added works that were missed by these sources but seemed notable, e.g. RDT-1B.

After curating works on robotics, we manually filtered to works that show a real-world evaluation on a manipulation task, to ensure the model had real-world relevance. We then estimated the training compute of those models, producing 26 data points. Finally, we filtered to the rolling top-5 points by training compute, leaving 14 “leading” models to fit a trend to. For frontier AI models, we filtered all AI models down to the rolling top-5 and then to models since 2018.

Analysis

We fit linear regression models to the logarithm of training compute over time, for the frontier AI series and the robotic manipulation series. We bootstrap these regressions to obtain 90% confidence intervals. The gap of 100x between the two trends comes from dividing the value of the frontier trend by the robotics trend, at the x value of August 6th, 2025. There is one major outlier, PaLM-SayCan, which leveraged a frozen pre-trained LLM as a subcomponent of the system.

We concluded that data scarcity is a key constraint for robotic manipulation based on both private correspondence with robotics experts and public commentary. For example, this presentation from CoRL’24 emphasizes the large disparity between some of the largest public robotics datasets, like Open X-Embodiment, and GPT-2’s training dataset. There are several ways that the field might alleviate this constraint, such as leveraging frontier AI models more, or acquiring more data through teleoperation, videos, simulators, or world models.

Code for our analysis is available here.

Assumptions

Many robotics models are based on models trained on other modalities, such as language and images. The training compute estimates for robotics adhered to the following principle: if the parameters of a model affect the actions of the robot, then we include the full history of training compute applied to those model parameters. For example, the compute for RT-2 includes the compute used to train the vision-language model PaLI-X-55B, which itself includes the compute of ViT-22B and the 32B variant of UL2.

We included these upstream model costs because they affect robot capabilities, e.g. enhance perception or planning. The robotics-specific training compute tends to be much smaller. Based on a subset of 10 of the 26 original data points, the largest robotics-specific training effort to date is about 1e23 FLOP for GR00T-N1.5, and this was bolstered by synthetic and simulated training data.

The training compute values also exclude any compute used to generate training data or run simulations, which is consistent with our methodology for frontier AI models. For example, NVIDIA’s Cosmos world models provided training data for GR00T-N1, and the largest of these models used an estimated 3e24 FLOP. We did not include the largest model in our analysis, but we did include a smaller Cosmos model because it was finetuned for robot control in this paper.

We assumed that our final set of compute estimates is representative of the most significant work in robotic manipulation. Where multiple model variants were presented in one work, we only considered the model that seemed most generally capable in real-world evaluation. For most works it was not clearly possible to estimate training compute within 1.5 orders of magnitude. This meant that many models, including Gemini Robotics, were left out. However, we surveyed papers and industry news as of July 2025 (with help from ChatGPT o3 web search) and did not find convincing evidence of larger training runs than we report here.