Skip to content

[ROADMAP] DiscoveryBench Integration #2

@Ethan0456

Description

@Ethan0456

🛰️ DiscoveryBench Integration

This issue tracks the integration of the DiscoveryBench benchmark into OpenHands. DiscoveryBench includes real-world and synthetic scientific discovery tasks that will help assess the agents' capabilities in multi-step, complex problem-solving. The benchmark aims to provide comprehensive insights into how well OpenHands agents handle data-driven scientific discovery workflows.

📋 Tasks

1. Clone and set up DiscoveryBench repository

  • Clone the DiscoveryBench Git repository and install dependencies.

2. Create dataset for evaluation

  • Create a custom function that create a dataset from the cloned repository.
  • Prepare the dataset for evaluation.

3. Generate evaluation metadata and process each instance

  • Create metadata using the make_metadata function, including dataset and task info.
  • Use the process_instance method to prepare evaluation queries for each dataset instance.

4. Set up runtime

  • Create the runtime environment for experimentation.
  • Initialize the runtime by copying the necessary data files into the container.
  • Start OpenHands with the instance query and the data inside the container

5. Run the evaluation workflow

  • Extract the results generated by the OpenHands agents.
  • Analyze the results, comparing generated hypotheses to gold-standard outputs.

6. Compile final results into test result dictionary

  • Save all metrics and results into the test_result dictionary for final analysis.

7. Log and save evaluation outputs

  • Ensure all outputs are logged and stored for reporting.

8. Validate the integration

  • Perform end-to-end validation of DiscoveryBench within OpenHands to ensure correct functionality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions