🛰️ DiscoveryBench Integration
This issue tracks the integration of the DiscoveryBench benchmark into OpenHands. DiscoveryBench includes real-world and synthetic scientific discovery tasks that will help assess the agents' capabilities in multi-step, complex problem-solving. The benchmark aims to provide comprehensive insights into how well OpenHands agents handle data-driven scientific discovery workflows.
📋 Tasks
1. Clone and set up DiscoveryBench repository
2. Create dataset for evaluation
3. Generate evaluation metadata and process each instance
4. Set up runtime
5. Run the evaluation workflow
6. Compile final results into test result dictionary
7. Log and save evaluation outputs
8. Validate the integration
🛰️ DiscoveryBench Integration
This issue tracks the integration of the DiscoveryBench benchmark into OpenHands. DiscoveryBench includes real-world and synthetic scientific discovery tasks that will help assess the agents' capabilities in multi-step, complex problem-solving. The benchmark aims to provide comprehensive insights into how well OpenHands agents handle data-driven scientific discovery workflows.
📋 Tasks
1. Clone and set up DiscoveryBench repository
2. Create dataset for evaluation
3. Generate evaluation metadata and process each instance
make_metadatafunction, including dataset and task info.process_instancemethod to prepare evaluation queries for each dataset instance.4. Set up runtime
5. Run the evaluation workflow
6. Compile final results into test result dictionary
test_resultdictionary for final analysis.7. Log and save evaluation outputs
8. Validate the integration