StateFactory | Reward Prediction with Factorized World States

Agents must infer action outcomes and select actions that maximize a reward signal indicating how close the goal is to being reached. Traditional supervised learning of reward models can introduce biases inherent to training data, limiting generalization to novel goals and environments.

We introduce StateFactory, a factorized representation method that transforms unstructured observations into a hierarchical object-attribute structure using language models. This structured representation allows rewards to be estimated naturally as the semantic similarity between the current state and the goal state under hierarchical constraints.

Our method achieves promising zero-shot results on our new RewardPrediction benchmark, which comprises 2,454 unique action-observation trajectories across five diverse domains. StateFactory successfully enhances agent planning performance, yielding success rate gains of +21.64% on AlfWorld and +12.40% on ScienceWorld.

RewardPrediction Benchmark

To rigorously evaluate zero-shot reward prediction, we introduce RewardPrediction, a new benchmark spanning five diverse interactive domains. The dataset comprises 2,454 unique trajectories, each containing fine-grained, step-wise action-observation pairs and scalar ground-truth rewards.

StateFactory Method

StateFactory factorizes unstructured observations into a hierarchical object-attribute structure to enable robust, zero-shot reward prediction. By explicitly separating entity identity from evolving attributes, the framework leverages recurrent State Extraction and Goal Interpretation to filter task-irrelevant noise and adapt instructions into dynamic goal states. Finally, Hierarchical Routing derives rewards by measuring the semantic distance between world and goal states, ensuring strong generalization across diverse domains without training.

Main Results

We compare StateFactory with baselines across the REWARDPREDICTION benchmark. Our method establishes a new SOTA among zero-shot methods and approaches the performance of supervised baselines.

Method	Backbone	Training Data	Reasoning	Zero-shot	Reward Prediction Error (D_EPIC ↓)
Method	Backbone	Training Data	Reasoning	Zero-shot	AlfWorld	Science	WebShop	Blocks	Text	Overall
Monotonic Baseline	--	--	--	✓	0.532	0.508	0.535	0.589	0.536	0.540
Supervised RM	Qwen2.5-1.5B	AlfWorld	N/A	✗	0.212	0.542	0.618	0.596	0.414	0.476
		ScienceWorld			0.506	0.305	0.661	0.706	0.580	0.552
		WebShop			0.707	0.624	0.242	0.706	0.707	0.597
		BlocksWorld			0.556	0.471	0.596	0.472	0.556	0.530
		TextWorld			0.523	0.658	0.678	0.604	0.203	0.533
		All Domains			0.178	0.283	0.285	0.489	0.177	0.282
VLWM-critic	Llama3.2-1B	VLWM Traj.	N/A	✓	0.823	0.673	0.636	0.663	0.896	0.738
LLM-as-a-Judge	Qwen3-14B	--	✗	✓	0.395	0.542	0.403	0.466	0.316	0.424
	Qwen3-14B		✓		0.370	0.371	0.356	0.436	0.211	0.349
	gpt-oss-20b		Low		0.368	0.394	0.376	0.362	0.119	0.324
	gpt-oss-20b		Medium		0.366	0.391	0.374	0.363	0.115	0.322
StateFactory	gpt-oss-20b	--	Medium	✓	0.285	0.288	0.286	0.427	0.201	0.297

StateFactory Representation for Procedural Videos

To further evaluate our method's capability in real-world continuous visual domains, we apply StateFactory to the Action100M dataset. Using its unique Tree-of-Captions structure, we convert the hierarchical annotations into a maximally fine-grained, contiguous sequence of steps covering the entire video. Guided by this complete timeline, our framework translates visual observations into factorized state representations, successfully capturing state changes and tracking task progress step by step.

🍳🍓 Making French Toast

🍗🥣 Buffalo Wings

✂️✈️ Paper Airplane Craft

Goal Description

Loading goal...

Current State

Goal State

Citation

Waiting to be released