T2Q is a research framework for evaluating Large Language Models (LLMs) in text-based game environments. It introduces a new benchmark suite designed to jointly evaluate the task completion and environment understanding ability of LLM-based agents. T2Q is implemented based on the TextWorld game engine.
- Coverage Planning Logic: We modify the TextWorld engine to implement coverage planning logic. A set of quests is generated to cover the reachable rooms and objects in the game. Agents can explore the whole game environment by completing the quests.
- Various categories of QA: Based on the elements of each game, various categories of QA tasks are supported, including location, connection, direction, match, and properties, evaluating LLM agents' environment understanding ability in multiple aspects.
- Custom Benchmark: A curated set of text-based games with varying difficulty levels (Easy, Medium, Hard). More fine-grained settings can be customized by the user.
- fully automated pipeline: The evaluation pipeline is fully automated, including game generation, QA generation, and evaluation.
T2Q/
├── benchmark/ # The T2Q Benchmark dataset (games & stats)
├── scripts/ # Scripts for generation, evaluation, and testing
├── t2q/ # Core agent framework and utilities
└── textworld_modified/ # The underlying game engine
This project requires Python 3.9+ and a Linux/macOS environment (or WSL on Windows).
git clone https://github.com/your-username/T2Q.git
cd T2Qinstall the modified TextWorld engine from the textworld_modified folder:
cd textworld_modified
pip install -e .More details, are available in the textworld_modified/README.md. for example, to install the visualization tools, run:
pip install -e .[vis]pip install -r requirements.txtCreate custom game environments using the generator:
python scripts/data_gen.py --nb_games 10 --output_dir ./test_gamesThe generated games will be saved in the <output_dir>//<env_id> folder.
In each folder, there will be a set of .z8 files, representing the task set, an env_qa.jsonl file as the QA pairs of the environment.
create a '.env' file in the root directory, and set the environment variables, such as LLM API key and model name. Example are provided in the .env.example file.
Evaluate an agent on the generated games:
python scripts/agent_test.py --games_dir <output_dir> --results_root <results_root> -- provider <provider> -- model <model> --mem_type <mem_type> --qa_max_workers <qa_max_workers> --num_workers <num_workers>the results will be saved in the <results_root>/<mem_type> folder. qa_max_workers is the number of threads to process the QA tasks, num_workers is the number of threads to process the game environments.
Calculate metrics from run logs:
python scripts/eval.py --results_root <results_root> --output <output_path>The evaluation results will be saved in the <output_path> file in JSON format.
- T2Q Code & Benchmark: MIT License (c) 2026 [Your Name/Organization].
- TextWorld Engine: Modified from Microsoft TextWorld, distributed under its original MIT License.
If you use this work in your research, please cite:
@misc{liu2026llmagentsknowworld,
title={What Do LLM Agents Know About Their World? Task2Quiz: A Paradigm for Studying Environment Understanding},
author={Siyuan Liu and Hongbang Yuan and Xinze Li and Ziyue Zhu and Yixin Cao and Yu-Gang Jiang},
year={2026},
eprint={2601.09503},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.09503},
}Please also cite the original TextWorld paper:
@Article{cote18textworld,
author = {Marc-Alexandre C\^ot\'e and
\'Akos K\'ad\'ar and
Xingdi Yuan and
Ben Kybartas and
Tavian Barnes and
Emery Fine and
James Moore and
Ruo Yu Tao and
Matthew Hausknecht and
Layla El Asri and
Mahmoud Adada and
Wendy Tay and
Adam Trischler},
title = {TextWorld: A Learning Environment for Text-based Games},
journal = {CoRR},
volume = {abs/1806.11532},
year = {2018}
}