BELA: Agentic In-context Experiential Reasoning

Benchmarking In-context Experiential Reasoning Through Repeated Product Recommendations

This repository implements BELA, a benchmark for evaluating how LLM agents learn and adapt through experiential reasoning in multi-episode product recommendation scenarios. BELA challenges agents to improve performance across episodes by learning through natural language interactions rather than through explicit parameter updates.

Overview

BELA evaluates an agent's ability to perform in-context experiential reasoning. Specifically, agents must:

Elicit latent user preferences through strategic questioning
Navigate evolving product landscapes and user needs
Leverage cross-episode memory to improve recommendations
Manage uncertainty in incomplete information environments

Core Components

Real-world Products: 71K+ Amazon items across 2K+ categories with rich metadata
Diverse Personas: 40K+ user profiles with varied, latent preferences and demographics
LLM User Simulator: Realistic interaction trajectories powered by persona-driven response generation

Quick Start

Basic Experiments

cd experiment_runners

# Run with example config
python run_experiment.py --config configs/basic_variable_category.yaml

# Interactive trajectory building
python run_experiment.py --config configs/interactive_example.yaml

Resuming from Checkpoint

checkpoint_enabled: true
resume_from_checkpoint: "experiment_results/checkpoint_traj2_ep8.json"

Submitting Results to the Leaderboard

Submit your results to the official leaderboard:

Website: https://www.experiential-learning-benchmark.com/

Running Benchmark Experiments

To contribute results to the BELA benchmark leaderboard, use the official benchmark configuration files located in experiment_runners/configs/benchmark_configs/. There are three main benchmark experiments:

Variable Category (variable_category.yaml): Fixed persona, varying product categories
Variable Persona (variable_persona.yaml): Fixed category, varying user personas
Variable Settings (variable_settings.yaml): Both persona and category vary

Important: Before running, update the model field in each config file to specify the model you are submitting. For example:

model: "gpt-4o"  # Change this to your model name

Running Each Benchmark

cd experiment_runners

# Run Variable Category benchmark
python run_experiment.py --config configs/benchmark_configs/variable_category.yaml

# Run Variable Persona benchmark
python run_experiment.py --config configs/benchmark_configs/variable_persona.yaml

# Run Variable Settings benchmark
python run_experiment.py --config configs/benchmark_configs/variable_settings.yaml

Results are automatically saved to the experiment_results/ directory with the following structure:

experiment_results/
└── {experiment_type}_{model}_{feedback_type}/
    ├── config.json              # Full configuration used
    └── results.json             # Results with config file path reference

Submission Requirements

You can submit results to one or more of the three leaderboards (variable_category, variable_persona, or variable_settings). For each leaderboard you want to submit to:

Run the corresponding benchmark experiment (see Running Each Benchmark above)
Submit only the results.json file (not the config.json file)

The results.json file is located at:

experiment_results/{experiment_type}_{model}_{feedback_type}/results.json

Project Structure

├── pipeline/                 # Core framework
│   ├── core/                # Personas, agents, LLM providers, scoring
│   │   └── llm_providers/  # OpenAI, Claude, Gemini integrations
│   ├── envs/               # Recommendation environment (Gymnasium)
│   └── wrappers/           # Metrics, feedback, logging
├── experiments/             # Experiment orchestration and baselines
├── experiment_runners/      # Configuration and launch scripts
│   └── configs/            # YAML configuration files
├── config/                 # Configuration dataclasses (Python code)
├── database/               # Product database, caching, HuggingFace sync
├── database_creation/      # Scripts for categorizing/processing products
├── data/                   # Personas, product mappings, trajectories
├── graphing/               # Visualization and analysis tools
├── webpage/                # Interactive leaderboard and submission interface

Key Features

Configuration System

All experiments use YAML configs with 31 parameters covering:

Experiment setup (type, episodes, trajectories, seeds)
Agent parameters (model, temperature, max questions)
Context modes (raw, summary, planning)
Feedback types (persona, oracle, LLM-based)
Checkpointing and resumption
Interactive trajectory generation

Example: experiment_runners/config_reference.yaml documents all parameters.

Experiment Types

BELA supports three experimental paradigms to isolate different adaptation challenges:

variable_category: Fixed persona, varying product categories (preference generalization)
variable_persona: Fixed category, varying user personas (user adaptation)
variable_settings: Both persona and category vary (full adaptation)

Planning Modes

Planning modes force the agent to give a recommendation after each question within an episode, enabling analysis of within-episode improvement and whether this learning rate increases across later episodes.

planning_no_strat: Non-modified experiment
planning_greedy: Greedy question selection
planning_dp: Dynamic programming-style lookahead

planning_mode: "planning_dp"
planning_interval: 5

Interactive Mode

Generate multiple trajectory variants for manual curation:

System produces N variants of Episode 1
User selects preferred variant
System generates N variants of Episode 2 from selected Episode 1
Repeat until trajectory complete

interactive_mode: true
interactive_variants: 10
interactive_input_file: "episode_01_variant_003.json"  # For continuation

Setup & Installation

Prerequisites

Python 3.9+
~1GB disk space (500MB database + dependencies)
API keys for LLM providers (at least OpenAI and Google for scoring):
- OpenAI (GPT-4, GPT-3.5)
- Anthropic (Claude)
- Google (Gemini)

Installation Steps

1. Clone Repository

git clone https://github.com/namkoong-lab/personas.git
cd personas

2. Install Dependencies

pip install -r requirements.txt

3. Configure API Keys

Create a .env file in the project root:

OPENAI_API_KEY=sk-proj-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=AIza...

4. Database Setup

The product database is hosted on HuggingFace and will automatically download on first run.

Automatic Setup (Recommended)

# Just run any experiment - database downloads automatically
cd experiment_runners
python run_experiment.py --config configs/basic_variable_category.yaml

On first run, you'll see:

🔄 Database not found. Downloading from HuggingFace...
📦 Downloading products_part1.parquet (4/4)...
✅ Database setup complete!

Manual Pre-download (Optional)

# Pre-download database before running experiments
cd database
python setup_database.py

This downloads 4 Parquet files (~500MB total) from HuggingFace and builds a local SQLite database.

Database Contents

The BELA database contains:

71,088 products from Amazon with rich metadata
2,030 product categories organized into substitute sets
Product attributes: titles, prices, ratings, descriptions, images
Score cache: Stores persona-product scores to avoid re-computation

Database Structure:

database/
├── personas.db              # SQLite database (auto-generated)
├── setup_database.py        # Download script
└── cache/                   # Downloaded Parquet files
    ├── products_part1.parquet
    ├── products_part2.parquet
    ├── products_part3.parquet
    └── products_part4.parquet

Score Caching: The database includes a persona_scores table that grows during experiments. Cached scores are reused across runs, speeding up repeated experiments with the same personas/categories.

Integrating Custom Models

BELA's modular architecture makes it easy to benchmark your own LLM models and agents.

Option 1: Add a New LLM Provider

Integrate a new LLM API (e.g., Cohere, Mistral, local models) in 4 steps:

Copy the template: Use pipeline/core/llm_providers/custom_provider_template.py as a starting point
Implement two methods:
- __init__(): Load API key and initialize client
- chat_completion(): Make API calls with retry logic
Register your provider in pipeline/core/llm_providers/__init__.py
Add API key to .env and use in your config

See: custom_provider_template.py for detailed implementation guide and examples from openai_provider.py, claude_provider.py, gemini_provider.py.

Test your provider:

from pipeline.core.llm_providers import chat_completion
response = chat_completion(
    model="my-model-v1",
    messages=[{"role": "user", "content": "Hello!"}]
)

Option 2: Custom Agent Logic

For advanced agent behavior (custom prompting, tool use, RAG), extend UnifiedAgent:

Create custom agent in pipeline/core/my_custom_agent.py
Override methods:
- decide_action(): Custom decision logic
- _build_llm_context(): Custom prompt construction
- Add pre/post-processing (tool calls, retrieval, etc.)
Modify experiment runner to use your agent class
Test on small scale before full experiments

Key extension points:

decide_action(): Control when to ask vs recommend
_build_llm_context(): Customize product/dialog presentation
_llm_decide_action(): Override core LLM prompting
Add external knowledge, tools, or multi-step reasoning

Citation

@inproceedings{yang2025bela,
  title={Benchmarking In-context Experiential Reasoning Through Repeated Product Recommendations},
  author={Yang, Gilbert and Chen, Yaqin and Yen, Thomson and Namkoong, Hongseok},
  year={2025}
}

License

MIT License - see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BELA: Agentic In-context Experiential Reasoning

Overview

Core Components

Quick Start

Basic Experiments

Resuming from Checkpoint

Submitting Results to the Leaderboard

Running Benchmark Experiments

Running Each Benchmark

Submission Requirements

Project Structure

Key Features

Configuration System

Experiment Types

Planning Modes

Interactive Mode

Setup & Installation

Prerequisites

Installation Steps

Automatic Setup (Recommended)

Manual Pre-download (Optional)

Database Contents

Integrating Custom Models

Option 1: Add a New LLM Provider

Option 2: Custom Agent Logic

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
config		config
database		database
database_creation		database_creation
experiment_results/sample_experiment_results		experiment_results/sample_experiment_results
experiment_runners		experiment_runners
experiments		experiments
pipeline		pipeline
webpage		webpage
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
manage_checkpoints.py		manage_checkpoints.py
requirements.txt		requirements.txt

License

namkoong-lab/interactive-benchmark

Folders and files

Latest commit

History

Repository files navigation

BELA: Agentic In-context Experiential Reasoning

Overview

Core Components

Quick Start

Basic Experiments

Resuming from Checkpoint

Submitting Results to the Leaderboard

Running Benchmark Experiments

Running Each Benchmark

Submission Requirements

Project Structure

Key Features

Configuration System

Experiment Types

Planning Modes

Interactive Mode

Setup & Installation

Prerequisites

Installation Steps

Automatic Setup (Recommended)

Manual Pre-download (Optional)

Database Contents

Integrating Custom Models

Option 1: Add a New LLM Provider

Option 2: Custom Agent Logic

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages