[ROADMAP] DiscoveryBench Integration

# 🛰️ DiscoveryBench Integration

This issue tracks the integration of the DiscoveryBench benchmark into OpenHands. DiscoveryBench includes real-world and synthetic scientific discovery tasks that will help assess the agents' capabilities in multi-step, complex problem-solving. The benchmark aims to provide comprehensive insights into how well OpenHands agents handle data-driven scientific discovery workflows.


## 📋 Tasks

### 1. Clone and set up DiscoveryBench repository
   - [ ] Clone the DiscoveryBench Git repository and install dependencies.

### 2. Create dataset for evaluation
   - [ ] Create a custom function that create a dataset from the cloned repository.
   - [ ] Prepare the dataset for evaluation.

### 3. Generate evaluation metadata and process each instance
   - [ ] Create metadata using the `make_metadata` function, including dataset and task info.
   - [ ] Use the `process_instance` method to prepare evaluation queries for each dataset instance.

### 4. Set up runtime
   - [ ] Create the runtime environment for experimentation.
   - [ ] Initialize the runtime by copying the necessary data files into the container.
   - [ ] Start OpenHands with the instance query and the data inside the container

### 5. Run the evaluation workflow
   - [ ] Extract the results generated by the OpenHands agents.
   - [ ] Analyze the results, comparing generated hypotheses to gold-standard outputs.

### 6. Compile final results into test result dictionary
   - [ ] Save all metrics and results into the `test_result` dictionary for final analysis.

### 7. Log and save evaluation outputs
   - [ ] Ensure all outputs are logged and stored for reporting.

### 8. Validate the integration
   - [ ] Perform end-to-end validation of DiscoveryBench within OpenHands to ensure correct functionality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROADMAP] DiscoveryBench Integration #2

🛰️ DiscoveryBench Integration

📋 Tasks

1. Clone and set up DiscoveryBench repository

2. Create dataset for evaluation

3. Generate evaluation metadata and process each instance

4. Set up runtime

5. Run the evaluation workflow

6. Compile final results into test result dictionary

7. Log and save evaluation outputs

8. Validate the integration

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[ROADMAP] DiscoveryBench Integration #2

Description

🛰️ DiscoveryBench Integration

📋 Tasks

1. Clone and set up DiscoveryBench repository

2. Create dataset for evaluation

3. Generate evaluation metadata and process each instance

4. Set up runtime

5. Run the evaluation workflow

6. Compile final results into test result dictionary

7. Log and save evaluation outputs

8. Validate the integration

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions