StateFactory Reward Prediction with Factorized World States

1East China Normal University, 2HKUST
* Equal Contribution   ✉️ Corresponding Author

Agents must infer action outcomes and select actions that maximize a reward signal indicating how close the goal is to being reached. Traditional supervised learning of reward models can introduce biases inherent to training data, limiting generalization to novel goals and environments.

We introduce StateFactory, a factorized representation method that transforms unstructured observations into a hierarchical object-attribute structure using language models. This structured representation allows rewards to be estimated naturally as the semantic similarity between the current state and the goal state under hierarchical constraints.

Our method achieves promising zero-shot results on our new RewardPrediction benchmark, which comprises 2,454 unique action-observation trajectories across five diverse domains. StateFactory successfully enhances agent planning performance, yielding success rate gains of +21.64% on AlfWorld and +12.40% on ScienceWorld.

RewardPrediction Benchmark

RewardPrediction Benchmark Overview

To rigorously evaluate zero-shot reward prediction, we introduce RewardPrediction, a new benchmark spanning five diverse interactive domains. The dataset comprises 2,454 unique trajectories, each containing fine-grained, step-wise action-observation pairs and scalar ground-truth rewards.

StateFactory Method

StateFactory Framework Overview

StateFactory factorizes unstructured observations into a hierarchical object-attribute structure to enable robust, zero-shot reward prediction. By explicitly separating entity identity from evolving attributes, the framework leverages recurrent State Extraction and Goal Interpretation to filter task-irrelevant noise and adapt instructions into dynamic goal states. Finally, Hierarchical Routing derives rewards by measuring the semantic distance between world and goal states, ensuring strong generalization across diverse domains without training.

Main Results

We compare StateFactory with baselines across the REWARDPREDICTION benchmark. Our method establishes a new SOTA among zero-shot methods and approaches the performance of supervised baselines.

Method Backbone Training Data Reasoning Zero-shot Reward Prediction Error (DEPIC ↓)
AlfWorldScienceWebShopBlocksTextOverall
Monotonic Baseline------0.5320.5080.5350.5890.5360.540
Supervised RMQwen2.5-1.5BAlfWorldN/A0.2120.5420.6180.5960.4140.476
ScienceWorld0.5060.3050.6610.7060.5800.552
WebShop0.7070.6240.2420.7060.7070.597
BlocksWorld0.5560.4710.5960.4720.5560.530
TextWorld0.5230.6580.6780.6040.2030.533
All Domains0.1780.2830.2850.4890.1770.282
VLWM-criticLlama3.2-1BVLWM Traj.N/A0.8230.6730.6360.6630.8960.738
LLM-as-a-JudgeQwen3-14B--0.3950.5420.4030.4660.3160.424
0.3700.3710.3560.4360.2110.349
gpt-oss-20bLow0.3680.3940.3760.3620.1190.324
Medium0.3660.3910.3740.3630.1150.322
StateFactorygpt-oss-20b--Medium0.2850.2880.2860.4270.2010.297

StateFactory Representation for Procedural Videos

To further evaluate our method's capability in real-world continuous visual domains, we apply StateFactory to the Action100M dataset. Using its unique Tree-of-Captions structure, we convert the hierarchical annotations into a maximally fine-grained, contiguous sequence of steps covering the entire video. Guided by this complete timeline, our framework translates visual observations into factorized state representations, successfully capturing state changes and tracking task progress step by step.

🍳🍓 Making French Toast
🍗🥣 Buffalo Wings
✂️✈️ Paper Airplane Craft
Goal Description
Loading goal...

Current State

Goal State

Citation

Waiting to be released