T2Q: Task2Quiz Benchmark

T2Q is a research framework for evaluating Large Language Models (LLMs) in text-based game environments. It introduces a new benchmark suite designed to jointly evaluate the task completion and environment understanding ability of LLM-based agents. T2Q is implemented based on the TextWorld game engine.

🌟 Features

Coverage Planning Logic: We modify the TextWorld engine to implement coverage planning logic. A set of quests is generated to cover the reachable rooms and objects in the game. Agents can explore the whole game environment by completing the quests.
Various categories of QA: Based on the elements of each game, various categories of QA tasks are supported, including location, connection, direction, match, and properties, evaluating LLM agents' environment understanding ability in multiple aspects.
Custom Benchmark: A curated set of text-based games with varying difficulty levels (Easy, Medium, Hard). More fine-grained settings can be customized by the user.
fully automated pipeline: The evaluation pipeline is fully automated, including game generation, QA generation, and evaluation.

📂 Project Structure

T2Q/
├── benchmark/            # The T2Q Benchmark dataset (games & stats)
├── scripts/              # Scripts for generation, evaluation, and testing
├── t2q/                  # Core agent framework and utilities
└── textworld_modified/   # The underlying game engine

🛠️ Installation

This project requires Python 3.9+ and a Linux/macOS environment (or WSL on Windows).

1. Clone the repository

git clone https://github.com/your-username/T2Q.git
cd T2Q

2. Install the modified engine

install the modified TextWorld engine from the textworld_modified folder:

cd textworld_modified
pip install -e .

More details, are available in the textworld_modified/README.md. for example, to install the visualization tools, run:

pip install -e .[vis]

3. Install dependencies

pip install -r requirements.txt

🚀 Usage

1. Generate New Data

Create custom game environments using the generator:

python scripts/data_gen.py --nb_games 10 --output_dir ./test_games

The generated games will be saved in the <output_dir>//<env_id> folder. In each folder, there will be a set of .z8 files, representing the task set, an env_qa.jsonl file as the QA pairs of the environment.

2. Run the Benchmark

create a '.env' file in the root directory, and set the environment variables, such as LLM API key and model name. Example are provided in the .env.example file.

Evaluate an agent on the generated games:

python scripts/agent_test.py --games_dir <output_dir> --results_root <results_root> -- provider <provider> -- model <model> --mem_type <mem_type> --qa_max_workers <qa_max_workers> --num_workers <num_workers>

the results will be saved in the <results_root>/<mem_type> folder. qa_max_workers is the number of threads to process the QA tasks, num_workers is the number of threads to process the game environments.

3. Evaluation

Calculate metrics from run logs:

python scripts/eval.py --results_root <results_root> --output <output_path>

The evaluation results will be saved in the <output_path> file in JSON format.

📜 License

T2Q Code & Benchmark: MIT License (c) 2026 [Your Name/Organization].
TextWorld Engine: Modified from Microsoft TextWorld, distributed under its original MIT License.

🔗 Citation

If you use this work in your research, please cite:

@misc{liu2026llmagentsknowworld,
      title={What Do LLM Agents Know About Their World? Task2Quiz: A Paradigm for Studying Environment Understanding}, 
      author={Siyuan Liu and Hongbang Yuan and Xinze Li and Ziyue Zhu and Yixin Cao and Yu-Gang Jiang},
      year={2026},
      eprint={2601.09503},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.09503}, 
}

Please also cite the original TextWorld paper:

@Article{cote18textworld,
  author = {Marc-Alexandre C\^ot\'e and
            \'Akos K\'ad\'ar and
            Xingdi Yuan and
            Ben Kybartas and
            Tavian Barnes and
            Emery Fine and
            James Moore and
            Ruo Yu Tao and
            Matthew Hausknecht and
            Layla El Asri and
            Mahmoud Adada and
            Wendy Tay and
            Adam Trischler},
  title = {TextWorld: A Learning Environment for Text-based Games},
  journal = {CoRR},
  volume = {abs/1806.11532},
  year = {2018}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

T2Q: Task2Quiz Benchmark

🌟 Features

📂 Project Structure

🛠️ Installation

1. Clone the repository

2. Install the modified engine

3. Install dependencies

🚀 Usage

1. Generate New Data

2. Run the Benchmark

3. Evaluation

📜 License

🔗 Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
scripts		scripts
t2q		t2q
textworld_modified		textworld_modified
.env.example		.env.example
.gitignore		.gitignore
LICENCE.txt		LICENCE.txt
README.md		README.md

License

ALEX-nlp/Task2Quiz

Folders and files

Latest commit

History

Repository files navigation

T2Q: Task2Quiz Benchmark

🌟 Features

📂 Project Structure

🛠️ Installation

1. Clone the repository

2. Install the modified engine

3. Install dependencies

🚀 Usage

1. Generate New Data

2. Run the Benchmark

3. Evaluation

📜 License

🔗 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages