"Pick up the book and place it in the back compartment of the caddy"
Despite rapid progress, embodied agents still struggle with long-horizon manipulation that requires maintaining spatial consistency, causal dependencies, and goal constraints. A key limitation of existing approaches is that task reasoning is implicitly embedded in high-dimensional latent representations, making it challenging to separate task structure from perceptual variability. We introduce Grounded Scene-graph Reasoning (GSR), a structured reasoning paradigm that explicitly models world-state evolution as transitions over semantically grounded scene graphs. By reasoning step-wise over object states and spatial relations, rather than directly mapping perception to actions, GSR enables explicit reasoning about action preconditions, consequences, and goal satisfaction in a physically grounded space. To support learning such reasoning, we construct Manip-Cognition-1.6M, a large-scale dataset that jointly supervises scene grounding, causal action reasoning, and goal-conditioned planning. Extensive evaluations across RLBench, LIBERO, GSR-benchmark, and real-world robotic tasks show that GSR significantly improves zero-shot generalization and long-horizon task completion over prompting-based baselines. These results highlight explicit world-state representations as a key inductive bias for scalable embodied reasoning.
Overview of GSR Framework.
To this end, we introduce Grounded Scene-graph Reasoning (GSR),
an embodied reasoning framework based on the principle that agents should plan over abstract world
representations rather than raw visual information.
GSR leverages semantically grounded scene graphs to extract stable causal structures from observations
and explicitly separates high-level conceptual reasoning from low-level action execution.
This design enables persistent and task-transferable capabilities, allowing flexible composition of
sequenced action and robust adaptation across tasks.
To train GSR, we construct the Manip-Cognition-1.6M, which provides joint supervision
over world understanding, intention interpretation, and action planning across a diverse set of
manipulation tasks.
Given an RGB-D observation \(\mathcal{I}\), we construct a 3D scene graph \(M_{sg} = (O_t, E_t)\) to encode the workspace and object states, where \(O_t = \{o_j\}_{j=1,...,J}\) denotes the set of objects and \(E_t = \{e_k\}_{k=1,...,K}\) denotes the set of relational edges. Figure above illustrates the transformation process from raw visual input to a structured representation and the resulting scene graph. Each object \(o_j\) is represented as a structured entity composed of functional keypoints and, when applicable, articulated child components. For example, a mug includes a "functional keypoint" corresponding to its handle, while a cabinet is modeled as an articulated object with "multiple child elements" such as drawers. Edges \(e_k\) encode spatial relations between object pairs, capturing predicates such as on, inside, or adjacent to (e.g., a mug on a table).
GSR is a fine-tuned Large Language Model (LLM) designed to perform commonsense reasoning over scene-graph representations. To apply GSR in a physical embodiment, we integrate it with a perception front-end and a action expert back-end. The physical system consists of two components: a perception-reasoning module for decision making, and an action expert that executes low-level control. The perception-reasoning module constructs scene graphs from raw observations and enable GSR reasons over these information to generate sequences of actions. To construct scene graphs, an Vision Foundation Model (VFM) is applied. The action expert leverages a meta-skill library, with further details described in the paper.
This experiment evaluates the model's general reasoning capability in manipulation tasks without task-specific training.
The model must generate the command conditioned on given initial state, infer the underlying goal, and predicts the correct actions.
We select five LLMs as baselines, including GPT-5, Gemini-2.5-Pro, DeepSeek-V3, Qwen3-8B, and Claude-Sonnet-4.5.
To deploy LLMs in manipulation tasks, we integrate a scene-graph–based perception module with an LLM control interface, which maps language commands to corresponding low-level actions.
We evaluate GSR in a zero-shot setting on RLBench and LIBERO without any fine-tuning. For the other models, we design task-specific prompts for each LLM to ensure their performance is well optimized during evaluation.
On RLBench, we evaluate 100 tasks across five representative categories: Kitchen Operations (KO), Pick and Place (PP), Switch Operations for Containers (SC), Controller Operations (CO), and Stacking and Assembly (SA).
On LIBERO, we evaluate 100 tasks across four subsets: LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, and LIBERO-Long.
We run ten trials per task and report the success rate.
Detailed results are reported in the accompanying tables and figures in the paper.
"Pick up the book and place it in the back compartment of the caddy"
"Put both the cream cheese box and the butter in the basket"
"Put the black bowl in the bottom drawer of the cabinet and close it"
"Put the white mug on the plate and put the chocolate pudding to the right of the plate"
"Put the yellow and white mug in the microwave and close it"
"Turn on the stove and put the moka pot on it"
GSR-Bench evaluates long-horizon task reasoning under both spatial and semantic constraints. It consists of 180 tasks, with an average planning horizon exceeding 10 steps. Our evaluation focuses on three aspects: (1) Semantic Object Disambiguation (SOD), which assesses reasoning under varying object semantics; (2) Spatial-Aware Sequencing (SAS), which evaluates reasoning over physical causality and spatial constraints; and (3) Goal-conditioned Generalization (GCG), which measures reasoning across diverse abstract goals. For each aspect, we define three difficulty levels: easy, medium, and difficult
GSR Object Simple
GSR Object General
GSR Object Difficult
GSR Spatial Simple
GSR Spatial General
GSR Spatial Difficult
GSR Goal Simple
GSR Goal General
GSR Goal Difficult
This experiment evaluates GSR in three real-world settings, each targeting a distinct aspect of its reasoning and generalization capabilities. First, we assess GSR on general pick-and-place tasks involving common objects, evaluating its reasoning ability to generate appropriate behaviors in response to natural language commands. Second, we evaluate GSR on long-horizon sorting tasks as in GSR-Bench, examining its performance across diverse spatial configurations and goal conditions that require extended planning. Third, we demonstrate GSR on four daily-life tasks, detailed results are shown in the supplementary videos.
Packaging a cardboard box
Placing a cup under the coffee machine
Unzipping a backpack
Sorting colored cubes
Tucking pens into a pencil pouch
Pouring water
Following human instructions
Picking and placing items into a drawer
Heating bread in a microwave
@article{hu2026gsr,
title={GSR: Learning Structured Reasoning for Embodied Manipulation},
author={Kewei Hu, Michael Zhang, Wei Ying, Tianhao Liu, Guoqiang Hao, Zimeng Li, Wanchan Yu, Jiajian Jing, Fangwen Chen, Hanwen Kang},
journal={arXiv preprint arXiv:2510.11027},
year={2026}
}