Sumo

Sumo: Dynamic and Generalizable Whole-Body Loco-Manipulation

John Z. Zhang^1,2, Maks Sorokin²*, Jan Brüdigam²*, Brandon Hung²*, Stephen Phillips², Dmitry Yershov²,
Farzad Niroui², Tong Zhao², Leonor Fermoselle², Xinghao Zhu², Chao Cao², Duy Ta²,
Tao Pang², Jiuguang Wang², Preston Culbertson^2,3, Zachary Manchester¹, and Simon Le Cléac'h²

¹MIT ²RAI Institute ³Cornell

*Equal Contribution

This work was done in part during an internship at the RAI Institute.

Corresponding Email: jzhang3@mit.edu

[arXiv] [summary video] [demo video] [code]

Abstract

This paper presents a sim-to-real approach that enables legged robots to dynamically manipulate large and heavy objects with whole-body dexterity. Our key insight is that by performing test-time steering of a pre-trained whole-body control policy with a sample-based planner, we can enable these robots to solve a variety of dynamic loco-manipulation tasks. Interestingly, we find our method generalizes to a diverse set of objects and tasks with no additional tuning or training, and can be further enhanced by flexibly adjusting the cost function at test time. We demonstrate the capabilities of our approach through a variety of challenging loco-manipulation tasks on a Spot quadruped robot in the real world, including uprighting a tire heavier than the robot's nominal lifting capacity and dragging a crowd-control barrier larger and taller than the robot itself. Additionally, we show that the same approach can be generalized to humanoid loco-manipulation tasks, such as opening a door and pushing a table, in simulation.

Methods

System Overview: Left: our method takes a hierarchical approach that combines a pre-trained whole-body control (WBC) policy (purple) with high-level sample-based MPC (green). The low-level whole-body control policy takes in the current state and desired torso, arm, and leg commands and outputs the joint-level commands for the quadruped or humanoid robot at $50$Hz. The high-level sample-based MPC aims to minimize a task-specific cost function by taking in the current state estimate and solving for the desired torso, arm, and leg commands for the low-level policy at $20$Hz. Right: illustrations comparing standard dynamics rollouts, where the actions $u$ are the joint-level controls for the multi-body dynamics model, and our network-policy-augmented dynamics rollouts, where the actions $a$ are inputs to the low-level locomotion policy.

Spot Loco-Manipulation

Note: the Spot robot has a peak lift capacity of 11kg and a continuous load capacity of 5kg.

Tire Upright: Lifts a tire to a vertical position. The tire weighs 15 kg, exceeding the robot's peak lifting capacity.

Cone Upright: Upright a traffic cone to a standing position.

Chair Upright: Upright a yellow chair to a standing position. The chair weighs 16.5 kg, exceeding the robot's peak lifting capacity.

Barrier Upright: Upright a crowd control barrier to a standing position. The barrier weighs 16 kg, exceeding the robot's peak lifting capacity, and is larger than the robot itself.

Tire Stack: Stack a tire on top of another. The tires weigh 15 kg each, exceeding the robot's peak lifting capacity.

Barrier Drag: Drag a crowd control barrier to the yellow circle. The barrier weighs 16 kg, exceeding the robot's peak lifting capacity and is larger than the robot itself.

Tire Rack Drag: Drag a tire rack to the yellow circle.

Rugged Box Push: Push a rugged box to the yellow circle. The box weighs 20 kg, exceeding the robot capacity with arm alone under every day friction conditions.

G1 Loco-Manipulation

Table Push: Push a table to the goal location.

Chair Push: Push a chair to the goal location.

Door Open: Opening and walking through a door.

Box Push: Push a box to the goal location.

Experimental Analysis

Hierarchical Structure Simplifies Loco-Manipulation: Comparing Sumo (ours, yellow) to end-to-end RL (purple) and MPC (navy) on five loco-manipulation tasks that ask the robot to move an object to a goal. Sumo achieves high success across all objects. End-to-end RL is competitive on the box, chair, and cone but degrades sharply on the tire and tire rack, while end-to-end MPC struggles across the board.

Test-Time Search Enables Generalization: Left: comparison of Sumo (yellow, ours), E2E RL (purple), and hierarchical RL (navy, HRL) on pushing five different objects to a goal. Sumo generalizes to new objects by replacing the object model at test time, whereas E2E RL and HRL policies trained only on box pushing fail on the other objects. Right: Sumo generalizes to uprighting objectives by changing the planner cost at test time, whereas the same E2E RL and HRL policies fail without additional training.

Move Object Tasks: Sumo generalizes to new objects by replacing the object model at test time. The Spot robot moves a tire, a traffic cone, a chair, a box, and a tire rack to the goal using the same Move cost function and hyperparameters.

Upright Object Tasks: Sumo generalizes to uprighting objectives by changing the planner cost at test time. The Spot robot uprights a tire, a traffic cone, a chair, a box, and a tire rack without additional training.

BibTeX

      @article{zhang2026sumo,
        title = {Sumo: Dynamic and Generalizable Whole-Body Loco-Manipulation},
        author = {Zhang, John Z. and Sorokin, Maks and Br{\"u}digam, Jan and Hung, Brandon and Phillips, Stephen and Yershov, Dmitry and Niroui, Farzad and Zhao, Tong and Fermoselle, Leonor and Zhu, Xinghao and Cao, Chao and Ta, Duy and Pang, Tao and Wang, Jiuguang and Culbertson, Preston and Manchester, Zachary and Le Cl\'eac'h, Simon},
        journal = {arXiv preprint arXiv:2604.08508},
        year = {2026},
        url = {https://arxiv.org/abs/2604.08508}
      }