WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents

Highlights

2 axes
WRIT controls trajectory complexity along two axes: the number of write decisions in a task, and the evidence burden of each single decision — the underexplored read-heavy axis.
2K → 4B
With only 2K synthesized trajectories, a 4B model trained on WRIT surpasses GPT-5.1 no-think on τ2-bench.
Hard ↑
Gains are largest on read-heavy hard subsets — exactly the tasks where an agent must gather and compare evidence before committing to an action.

Motivation

The same write action, two very different difficulties

Most synthesis pipelines make tasks harder by composing more write actions. But a single write decision can already be hard — when the agent must gather and compare substantial read-tool evidence before its arguments become identifiable.

read tool write tool

Simple task 2 reads

"I need to book a one-way business class flight from Newark to Houston on May 25. Please book the direct flight that departs at 8:00 AM and arrives at 11:30 AM."

1×get_user_details

1×search_direct_flight

1×book_reservation

Read-heavy task 9 reads

"I need to book a one-way business class flight from the New York area to Houston. I'm flexible between May 25 and May 26, and I can depart from either Newark or LaGuardia. Please book the fastest overall flight."

1×get_user_details

4×search_direct_flight

4×search_onestop_flight

1×book_reservation

Both tasks share the same gold write action. The difference is what the agent must do before writing: the read-tool count rises from 2 to 9. This motivates a new data-synthesis question —

Beyond teaching agents to act for longer, can we synthesize trajectories that teach them to read more carefully before they act?

The WRIT Pipeline

Synthesizing write- and read-intensive trajectories in three stages

WRIT pipeline overview — Overview of the WRIT pipeline.

STAGE 1

Write-Read Intensive Tasks

Synthesize tasks with verifiable outcomes — both write-intensive requests with multiple sequential actions and read-heavy requests where one action demands extensive evidence gathering.

STAGE 2

User Behavior Diversification

Vary how users express and reveal the same request, so training data reflects realistic conversational behavior rather than only cooperative, fully-specified interactions.

STAGE 3

Simulation & Filtering

Run agent and user simulator through each task in an executable environment, keeping only correct and complete interactions as supervised fine-tuning trajectories.

Results

WRIT improves multi-turn agents across model families

On τ²-bench, under a controlled 2K-trajectory budget, WRIT consistently outperforms the strongest prior synthesis method on every tested base model.

Base model	Method	Retail	Airline	Average
Qwen3-4B-Instruct	AReaL (best baseline)	59.43	47.00	55.64
Qwen3-4B-Instruct	WRIT	71.05	61.00	67.99 +12.4
Llama-3.1-8B-Instruct	CoVe (best baseline)	52.19	32.00	46.04
Llama-3.1-8B-Instruct	WRIT	54.61	50.00	53.20 +7.2
Qwen2.5-14B-Instruct	AReaL (best baseline)	57.68	43.00	53.20
Qwen2.5-14B-Instruct	WRIT	72.37	57.50	67.84 +14.6

Pass¹ (%) on τ²-bench. Average is task-count weighted across Retail and Airline. Full tables, Pass⁴, hard subsets, and ablations are in the paper.

Pass^k reliability curves — Pass^k curves for Qwen3-4B-Instruct-2507 across full and read-heavy subsets.

A 4B model vs. GPT-5.1 on τ²-bench

GPT-5.1 thinking

79.27

Avg. Pass¹

1.52M output tokens · $17.52

GPT-5.1 no-think

62.80

Avg. Pass¹

318K output tokens · $5.56

WRIT-4B

67.99

Avg. Pass¹

251K output tokens

A 4B model trained on just 2K WRIT trajectories scores 67.99 — ahead of GPT-5.1 no-think (62.80) while emitting fewer output tokens at inference time.

Citation

BibTeX

If you find WRIT useful, please consider citing:

@misc{gu2026writwritereadintensivetrajectory, title={WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents}, author={Hengrui Gu and Xiaotian Han and Kaixiong Zhou}, year={2026}, eprint={2606.02908}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2606.02908}, }