WRIT controls trajectory complexity along two axes: the number of write decisions in a task, and the evidence burden of each single decision — the underexplored read-heavy axis.
With only 2K synthesized trajectories, a 4B model trained on WRIT surpasses GPT-5.1 no-think on τ2-bench.
Gains are largest on read-heavy hard subsets — exactly the tasks where an agent must gather and compare evidence before committing to an action.
Most synthesis pipelines make tasks harder by composing more write actions. But a single write decision can already be hard — when the agent must gather and compare substantial read-tool evidence before its arguments become identifiable.
Both tasks share the same gold write action. The difference is what the agent must do before writing: the read-tool count rises from 2 to 9. This motivates a new data-synthesis question —
Beyond teaching agents to act for longer, can we synthesize trajectories that teach them to read more carefully before they act?
Synthesize tasks with verifiable outcomes — both write-intensive requests with multiple sequential actions and read-heavy requests where one action demands extensive evidence gathering.
Vary how users express and reveal the same request, so training data reflects realistic conversational behavior rather than only cooperative, fully-specified interactions.
Run agent and user simulator through each task in an executable environment, keeping only correct and complete interactions as supervised fine-tuning trajectories.
On τ2-bench, under a controlled 2K-trajectory budget, WRIT consistently outperforms the strongest prior synthesis method on every tested base model.
| Base model | Method | Retail | Airline | Average |
|---|---|---|---|---|
| Qwen3-4B-Instruct | AReaL (best baseline) | 59.43 | 47.00 | 55.64 |
| WRIT | 71.05 | 61.00 | 67.99 +12.4 | |
| Llama-3.1-8B-Instruct | CoVe (best baseline) | 52.19 | 32.00 | 46.04 |
| WRIT | 54.61 | 50.00 | 53.20 +7.2 | |
| Qwen2.5-14B-Instruct | AReaL (best baseline) | 57.68 | 43.00 | 53.20 |
| WRIT | 72.37 | 57.50 | 67.84 +14.6 |
Pass1 (%) on τ2-bench. Average is task-count weighted across Retail and Airline. Full tables, Pass4, hard subsets, and ablations are in the paper.
A 4B model trained on just 2K WRIT trajectories scores 67.99 — ahead of GPT-5.1 no-think (62.80) while emitting fewer output tokens at inference time.