Closed
Conversation
* fix failure reason * update
* add live bench * fix live bench and rollout processor
* Add AIME2025, GPQA, HealthBench evaluation_test suites; unify row-limiting via pytest flag; clean up examples * evaluation with aggregated scores * WIP: vibe coded as an mvp * merge * remove * updated logger * formatting * formatting * fixing tests --------- Co-authored-by: benjibc <youfychenbc5000@gmail.com>
* e2e smoke test * temp adding * update * test * adjust bounds * change back to regular schedule * final
* convert rollout_input_params to completion_params * fix * DISABLE_EP_SQLITE_LOG * fix kwargs access to "model" * DRY completion params and make it a dict * fix tests * revert * fix * ensure logging * fix smoke test params
* "Copy" button * consolidate filter configurations * filter button works * extract tooltip into its own component * vite build * Refactor AddFilterButton layout for improved styling and structure * vite build
* Finished Error Handling * Address comments * Changing the rollout processors * cleaning up mcp gym * remove import * Update * failing test * fixing flaky test * update comments
* livesvgbench + metadata fix * bugs in retry processor
* works * vite build / fix warnings * don't show totals / fix warnings / vite build * styling * no black border / vite build
* BigQuery * removing unneeded
…te CI; exclude vite dist; restore deleted files from main (bigquery adapter + vite src/readme) (#74)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
name: Pull Request
about: Propose changes to the codebase
title: "Brief description of changes"
labels: ''
assignees: ''
Description
Please include a summary of the change and which issue is fixed or feature is implemented. Please also include relevant motivation and context. List any dependencies that are required for this change.
Fixes # (issue)
Implements # (issue)
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.
Test Configuration:
Checklist:
black .,isort .,flake8 .)Screenshots (if applicable)
If applicable, add screenshots to help showcase your changes.
Additional context
Add any other context about the PR here.