Skip to content

namkoong-lab/interactive-benchmark

Repository files navigation

BELA: Agentic In-context Experiential Reasoning

Benchmarking In-context Experiential Reasoning Through Repeated Product Recommendations

This repository implements BELA, a benchmark for evaluating how LLM agents learn and adapt through experiential reasoning in multi-episode product recommendation scenarios. BELA challenges agents to improve performance across episodes by learning through natural language interactions rather than through explicit parameter updates.

Overview

BELA evaluates an agent's ability to perform in-context experiential reasoning. Specifically, agents must:

  • Elicit latent user preferences through strategic questioning
  • Navigate evolving product landscapes and user needs
  • Leverage cross-episode memory to improve recommendations
  • Manage uncertainty in incomplete information environments

Core Components

  1. Real-world Products: 71K+ Amazon items across 2K+ categories with rich metadata
  2. Diverse Personas: 40K+ user profiles with varied, latent preferences and demographics
  3. LLM User Simulator: Realistic interaction trajectories powered by persona-driven response generation

Quick Start

Basic Experiments

cd experiment_runners

# Run with example config
python run_experiment.py --config configs/basic_variable_category.yaml

# Interactive trajectory building
python run_experiment.py --config configs/interactive_example.yaml

Resuming from Checkpoint

checkpoint_enabled: true
resume_from_checkpoint: "experiment_results/checkpoint_traj2_ep8.json"

Submitting Results to the Leaderboard

Submit your results to the official leaderboard:

Website: https://www.experiential-learning-benchmark.com/

Running Benchmark Experiments

To contribute results to the BELA benchmark leaderboard, use the official benchmark configuration files located in experiment_runners/configs/benchmark_configs/. There are three main benchmark experiments:

  1. Variable Category (variable_category.yaml): Fixed persona, varying product categories
  2. Variable Persona (variable_persona.yaml): Fixed category, varying user personas
  3. Variable Settings (variable_settings.yaml): Both persona and category vary

Important: Before running, update the model field in each config file to specify the model you are submitting. For example:

model: "gpt-4o"  # Change this to your model name

Running Each Benchmark

cd experiment_runners

# Run Variable Category benchmark
python run_experiment.py --config configs/benchmark_configs/variable_category.yaml

# Run Variable Persona benchmark
python run_experiment.py --config configs/benchmark_configs/variable_persona.yaml

# Run Variable Settings benchmark
python run_experiment.py --config configs/benchmark_configs/variable_settings.yaml

Results are automatically saved to the experiment_results/ directory with the following structure:

experiment_results/
└── {experiment_type}_{model}_{feedback_type}/
    ├── config.json              # Full configuration used
    └── results.json             # Results with config file path reference

Submission Requirements

You can submit results to one or more of the three leaderboards (variable_category, variable_persona, or variable_settings). For each leaderboard you want to submit to:

  1. Run the corresponding benchmark experiment (see Running Each Benchmark above)
  2. Submit only the results.json file (not the config.json file)

The results.json file is located at:

experiment_results/{experiment_type}_{model}_{feedback_type}/results.json

Project Structure

├── pipeline/                 # Core framework
│   ├── core/                # Personas, agents, LLM providers, scoring
│   │   └── llm_providers/  # OpenAI, Claude, Gemini integrations
│   ├── envs/               # Recommendation environment (Gymnasium)
│   └── wrappers/           # Metrics, feedback, logging
├── experiments/             # Experiment orchestration and baselines
├── experiment_runners/      # Configuration and launch scripts
│   └── configs/            # YAML configuration files
├── config/                 # Configuration dataclasses (Python code)
├── database/               # Product database, caching, HuggingFace sync
├── database_creation/      # Scripts for categorizing/processing products
├── data/                   # Personas, product mappings, trajectories
├── graphing/               # Visualization and analysis tools
├── webpage/                # Interactive leaderboard and submission interface

Key Features

Configuration System

All experiments use YAML configs with 31 parameters covering:

  • Experiment setup (type, episodes, trajectories, seeds)
  • Agent parameters (model, temperature, max questions)
  • Context modes (raw, summary, planning)
  • Feedback types (persona, oracle, LLM-based)
  • Checkpointing and resumption
  • Interactive trajectory generation

Example: experiment_runners/config_reference.yaml documents all parameters.

Experiment Types

BELA supports three experimental paradigms to isolate different adaptation challenges:

  • variable_category: Fixed persona, varying product categories (preference generalization)
  • variable_persona: Fixed category, varying user personas (user adaptation)
  • variable_settings: Both persona and category vary (full adaptation)

Planning Modes

Planning modes force the agent to give a recommendation after each question within an episode, enabling analysis of within-episode improvement and whether this learning rate increases across later episodes.

  • planning_no_strat: Non-modified experiment
  • planning_greedy: Greedy question selection
  • planning_dp: Dynamic programming-style lookahead
planning_mode: "planning_dp"
planning_interval: 5

Interactive Mode

Generate multiple trajectory variants for manual curation:

  1. System produces N variants of Episode 1
  2. User selects preferred variant
  3. System generates N variants of Episode 2 from selected Episode 1
  4. Repeat until trajectory complete
interactive_mode: true
interactive_variants: 10
interactive_input_file: "episode_01_variant_003.json"  # For continuation

Setup & Installation

Prerequisites

  • Python 3.9+
  • ~1GB disk space (500MB database + dependencies)
  • API keys for LLM providers (at least OpenAI and Google for scoring):
    • OpenAI (GPT-4, GPT-3.5)
    • Anthropic (Claude)
    • Google (Gemini)

Installation Steps

1. Clone Repository

git clone https://github.com/namkoong-lab/personas.git
cd personas

2. Install Dependencies

pip install -r requirements.txt

3. Configure API Keys

Create a .env file in the project root:

OPENAI_API_KEY=sk-proj-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=AIza...

4. Database Setup

The product database is hosted on HuggingFace and will automatically download on first run.

Automatic Setup (Recommended)

# Just run any experiment - database downloads automatically
cd experiment_runners
python run_experiment.py --config configs/basic_variable_category.yaml

On first run, you'll see:

🔄 Database not found. Downloading from HuggingFace...
📦 Downloading products_part1.parquet (4/4)...
✅ Database setup complete!

Manual Pre-download (Optional)

# Pre-download database before running experiments
cd database
python setup_database.py

This downloads 4 Parquet files (~500MB total) from HuggingFace and builds a local SQLite database.

Database Contents

The BELA database contains:

  • 71,088 products from Amazon with rich metadata
  • 2,030 product categories organized into substitute sets
  • Product attributes: titles, prices, ratings, descriptions, images
  • Score cache: Stores persona-product scores to avoid re-computation

Database Structure:

database/
├── personas.db              # SQLite database (auto-generated)
├── setup_database.py        # Download script
└── cache/                   # Downloaded Parquet files
    ├── products_part1.parquet
    ├── products_part2.parquet
    ├── products_part3.parquet
    └── products_part4.parquet

Score Caching: The database includes a persona_scores table that grows during experiments. Cached scores are reused across runs, speeding up repeated experiments with the same personas/categories.

Integrating Custom Models

BELA's modular architecture makes it easy to benchmark your own LLM models and agents.

Option 1: Add a New LLM Provider

Integrate a new LLM API (e.g., Cohere, Mistral, local models) in 4 steps:

  1. Copy the template: Use pipeline/core/llm_providers/custom_provider_template.py as a starting point
  2. Implement two methods:
    • __init__(): Load API key and initialize client
    • chat_completion(): Make API calls with retry logic
  3. Register your provider in pipeline/core/llm_providers/__init__.py
  4. Add API key to .env and use in your config

See: custom_provider_template.py for detailed implementation guide and examples from openai_provider.py, claude_provider.py, gemini_provider.py.

Test your provider:

from pipeline.core.llm_providers import chat_completion
response = chat_completion(
    model="my-model-v1",
    messages=[{"role": "user", "content": "Hello!"}]
)

Option 2: Custom Agent Logic

For advanced agent behavior (custom prompting, tool use, RAG), extend UnifiedAgent:

  1. Create custom agent in pipeline/core/my_custom_agent.py
  2. Override methods:
    • decide_action(): Custom decision logic
    • _build_llm_context(): Custom prompt construction
    • Add pre/post-processing (tool calls, retrieval, etc.)
  3. Modify experiment runner to use your agent class
  4. Test on small scale before full experiments

Key extension points:

  • decide_action(): Control when to ask vs recommend
  • _build_llm_context(): Customize product/dialog presentation
  • _llm_decide_action(): Override core LLM prompting
  • Add external knowledge, tools, or multi-step reasoning

Citation

@inproceedings{yang2025bela,
  title={Benchmarking In-context Experiential Reasoning Through Repeated Product Recommendations},
  author={Yang, Gilbert and Chen, Yaqin and Yen, Thomson and Namkoong, Hongseok},
  year={2025}
}

License

MIT License - see LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •