World Interaction → Experience → Embodied Memory → Deliberation → Action
Abstract
Reliable object manipulation requires understanding physical properties that vary across objects and environments. Vision-language model (VLM) planners can reason about friction and stability in general terms; however, they often cannot predict how a specific ball will roll on a particular surface or which stone will provide a stable foundation without direct experience. We present PhysMem, a memory framework that enables VLM robot planners to learn physical principles from interaction at test time, without updating model parameters. The system records experiences, generates candidate hypotheses, and verifies them through targeted interaction before promoting validated knowledge to guide future decisions. A central design choice is verification before application: the system tests hypotheses against new observations rather than applying retrieved experience directly.
Presentation
Overview of PhysMem
Demonstrations
PhysMem is evaluated on three physically challenging manipulation tasks requiring understanding of contact physics, momentum, and stability.
Arrange irregularly shaped objects into a target configuration, requiring understanding of geometry and contact physics.
Guide a ball to a target location by pushing, requiring grasp of momentum and friction dynamics.
Stack objects to achieve balance, demanding knowledge of center-of-mass and stability principles.
Comparison
Side-by-side comparison showing how PhysMem's learned principles transform task performance.
Analysis
How does PhysMem improve through interaction, and how much experience is needed? We evaluate performance at different experience utilization levels (0%, 25%, 50%, 100%) across three runs per condition.
Test-time learning curves. Without memory (0%), performance stays flat. With full memory, Parts Organization improves from −1 to 9.7 and Ball Navigation shows an even larger gain (14.7 vs. 0.7). Task complexity determines how much experience is needed: Ball Navigation benefits from full experience, while Balanced Stacking shows diminishing returns where 50% nearly matches 100%.
Generalization
PhysMem's principles generalize to unseen material properties, object masses, and even lunar gravity.
Method
Inspired by the scientific method, PhysMem transforms raw interaction experiences into verified physical principles through a four-stage loop.
System overview. Left: A three-tier memory system stores raw experiences, clusters them into testable hypotheses, and promotes verified knowledge as principles. Right: The VLM agent retrieves principles and hypotheses from memory to guide planning.
Gather interaction experiences: visual observations, actions taken, and their outcomes.
Generate candidate physical principles from patterns observed in collected experiences.
Validate hypotheses through action-level outcome attribution across multiple trials.
Verified principles join persistent memory, guiding all future planning decisions.
Citation
@article{li2025physmem,
title = {PhysMem: Scaling Test-time Physical Memory for Robot Manipulation},
author = {Li, Haoyang and You, Yang and Su, Hao and Guibas, Leonidas},
journal = {arXiv preprint arXiv:2602.20323},
year = {2025}
}