[lFX Term 1 2026 ] Restoring Ianvs LLM-Agent setup and usage#407
Conversation
Screencast.from.2026-04-23.13-42-27.webm@MooreZheng sir, After making all these changes I was able to restore LLM-Agent Benchmark and run it successfully |
There was a problem hiding this comment.
Code Review
This pull request significantly updates the Ianvs LLM-Agent benchmark by providing a comprehensive reproduction guide, adding a requirements file, and refactoring the core model and evaluation logic. Key changes include a rewritten predict method that correctly slices prompt tokens from the output and an updated ROUGE scoring implementation using the rouge_score library. Review feedback focuses on ensuring input tensors are moved to the correct device, removing redundant imports, adopting idiomatic boolean checks, and utilizing the internal calculate_mean function to prevent potential division-by-zero errors in metric calculations.
What this PR doesGot the TL;DR result
Summary of changes
The token Per-file walkthrough
|
Screencast.from.2026-05-11.23-01-05.webmworking of this example |
Screencast.from.2026-05-20.20-49-54.webmMade it work, with changes done in core/single_task_learning.py |
f1e7b5f to
5e7d72f
Compare
5e7d72f to
2bf92bd
Compare
2bf92bd to
f9b3879
Compare
- Add requirements.txt for dependencies - Refactor basemodel.py for improved readability and functionality - Enhance rouge.py with RougeScorer and guard against empty score lists - Update paths and configurations for LLM-Agent benchmark - Update README for Ianvs LLM-Agent benchmark setup and usage Signed-off-by: NishantSinghhhhh <nishantsingh_230137@aitpune.edu.in>
f9b3879 to
9336902
Compare
|
@MooreZheng sir, @hsj576 sir done with the PR |
MooreZheng
left a comment
There was a problem hiding this comment.
Very close to the final version.
- For rouge function be carefull about the #DIV/0! error
- Squash the pull request into one
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hsj576, MooreZheng, NishantSinghhhhh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |

feat: add requirements.txt for dependencies
fix: refactor basemodel.py for improved readability and functionality
refactor: enhance rouge.py to utilize RougeScorer for metric calculations
What type of PR is this?
/kind feature
/kind cleanup
What this PR does / why we need it:
This PR fixes and refactors the
llm-agentsingletask learning benchmark to make it fully functional end-to-end. The original example code had several issues that prevented it from running: broken relative paths, a missing dataset, deprecated HuggingFace API arguments, a name collision with the Ianvs framework lifecycle hook, and a broken ROUGE metric script.Changes included:
requirements.txt: Added a
requirements.txtlisting all dependencies needed to run the LLM-agent benchmark (torch,transformers,peft,datasets,evaluate,rouge_score), which were previously undocumented and missing from the environment.basemodel.py:
use_auth_token=argument withtoken=to match current HuggingFacetransformersAPIpreprocess(self, **kwargs)lifecycle hook required by the Ianvs singletask learning frameworkpreprocess()→_preprocess_sample()to avoid collision with the framework hook_preprocess_sample()signature to accept plain strings instead of a samples object_preprocess_sample()(removed erroneous[None]wrapper)str()cast intrain()loop when iteratingtrain_data.x/train_data.yto handlenumpy.str_types that caused tokenizer failuresrouge.py:
EOFtoken at end of file (invalid Python causingNameErroron import)evaluate.load()(which required a local metrics folder that did not exist) with directrouge_score.rouge_scorer.RougeScorercallsy_predhandling to usestr()cast instead of["generated_text"]dict access, matching the plain-string output ofbasemodel.predict()Which issue(s) this PR fixes:
Fixes #