Agents must infer action outcomes and select actions that maximize a reward signal indicating how close the goal is to being reached. Traditional supervised learning of reward models can introduce biases inherent to training data, limiting generalization to novel goals and environments.
We introduce StateFactory, a factorized representation method that transforms unstructured observations into a hierarchical object-attribute structure using language models. This structured representation allows rewards to be estimated naturally as the semantic similarity between the current state and the goal state under hierarchical constraints.
Our method achieves promising zero-shot results on our new RewardPrediction benchmark, which comprises 2,454 unique action-observation trajectories across five diverse domains. StateFactory successfully enhances agent planning performance, yielding success rate gains of +21.64% on AlfWorld and +12.40% on ScienceWorld.
To rigorously evaluate zero-shot reward prediction, we introduce RewardPrediction, a new benchmark spanning five diverse interactive domains. The dataset comprises 2,454 unique trajectories, each containing fine-grained, step-wise action-observation pairs and scalar ground-truth rewards.
StateFactory factorizes unstructured observations into a hierarchical object-attribute structure to enable robust, zero-shot reward prediction. By explicitly separating entity identity from evolving attributes, the framework leverages recurrent State Extraction and Goal Interpretation to filter task-irrelevant noise and adapt instructions into dynamic goal states. Finally, Hierarchical Routing derives rewards by measuring the semantic distance between world and goal states, ensuring strong generalization across diverse domains without training.
We compare StateFactory with baselines across the REWARDPREDICTION benchmark. Our method establishes a new SOTA among zero-shot methods and approaches the performance of supervised baselines.
To further evaluate our method's capability in real-world continuous visual domains, we apply StateFactory to the Action100M dataset. Using its unique Tree-of-Captions structure, we convert the hierarchical annotations into a maximally fine-grained, contiguous sequence of steps covering the entire video. Guided by this complete timeline, our framework translates visual observations into factorized state representations, successfully capturing state changes and tracking task progress step by step.
Waiting to be released