Skip to content

[lFX Term 1 2026 ] Restoring Ianvs LLM-Agent setup and usage#407

Merged
kubeedge-bot merged 1 commit into
kubeedge:mainfrom
NishantSinghhhhh:restoration-llm-agent
May 21, 2026
Merged

[lFX Term 1 2026 ] Restoring Ianvs LLM-Agent setup and usage#407
kubeedge-bot merged 1 commit into
kubeedge:mainfrom
NishantSinghhhhh:restoration-llm-agent

Conversation

@NishantSinghhhhh
Copy link
Copy Markdown
Contributor

feat: add requirements.txt for dependencies
fix: refactor basemodel.py for improved readability and functionality
refactor: enhance rouge.py to utilize RougeScorer for metric calculations

What type of PR is this?

/kind feature
/kind cleanup

What this PR does / why we need it:

This PR fixes and refactors the llm-agent singletask learning benchmark to make it fully functional end-to-end. The original example code had several issues that prevented it from running: broken relative paths, a missing dataset, deprecated HuggingFace API arguments, a name collision with the Ianvs framework lifecycle hook, and a broken ROUGE metric script.

Changes included:

  • requirements.txt: Added a requirements.txt listing all dependencies needed to run the LLM-agent benchmark (torch, transformers, peft, datasets, evaluate, rouge_score), which were previously undocumented and missing from the environment.

  • basemodel.py:

    • Replaced deprecated use_auth_token= argument with token= to match current HuggingFace transformers API
    • Added empty preprocess(self, **kwargs) lifecycle hook required by the Ianvs singletask learning framework
    • Renamed internal helper preprocess()_preprocess_sample() to avoid collision with the framework hook
    • Updated _preprocess_sample() signature to accept plain strings instead of a samples object
    • Flattened the return value of _preprocess_sample() (removed erroneous [None] wrapper)
    • Added explicit str() cast in train() loop when iterating train_data.x / train_data.y to handle numpy.str_ types that caused tokenizer failures
  • rouge.py:

    • Removed bare EOF token at end of file (invalid Python causing NameError on import)
    • Replaced evaluate.load() (which required a local metrics folder that did not exist) with direct rouge_score.rouge_scorer.RougeScorer calls
    • Updated y_pred handling to use str() cast instead of ["generated_text"] dict access, matching the plain-string output of basemodel.predict()

Which issue(s) this PR fixes:

Fixes #

@kubeedge-bot kubeedge-bot added kind/feature Categorizes issue or PR as related to a new feature. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Apr 23, 2026
@kubeedge-bot kubeedge-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 23, 2026
@NishantSinghhhhh
Copy link
Copy Markdown
Contributor Author

Screencast.from.2026-04-23.13-42-27.webm

@MooreZheng sir,

After making all these changes I was able to restore LLM-Agent Benchmark and run it successfully

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly updates the Ianvs LLM-Agent benchmark by providing a comprehensive reproduction guide, adding a requirements file, and refactoring the core model and evaluation logic. Key changes include a rewritten predict method that correctly slices prompt tokens from the output and an updated ROUGE scoring implementation using the rouge_score library. Review feedback focuses on ensuring input tensors are moved to the correct device, removing redundant imports, adopting idiomatic boolean checks, and utilizing the internal calculate_mean function to prevent potential division-by-zero errors in metric calculations.

Comment thread examples/llm-agent/singletask_learning_bench/testalgorithms/basemodel.py Outdated
Comment thread examples/llm-agent/singletask_learning_bench/testalgorithms/basemodel.py Outdated
Comment thread examples/llm-agent/singletask_learning_bench/testalgorithms/basemodel.py Outdated
Comment thread examples/llm-agent/singletask_learning_bench/testenv/rouge.py
Comment thread examples/llm-agent/singletask_learning_bench/testenv/rouge.py
Comment thread examples/llm-agent/singletask_learning_bench/testenv/rouge.py
@NishantSinghhhhh NishantSinghhhhh changed the title [lFX Term 1 2026 ] Restoring Ianvs LLM-Agent Benchmark setup and usage [lFX Term 1 2026 ] Restoring Ianvs LLM-Agent setup and usage Apr 23, 2026
@NishantSinghhhhh
Copy link
Copy Markdown
Contributor Author

NishantSinghhhhh commented May 11, 2026

What this PR does

Got the llm-agent/singletask_learning_bench example actually running end-to-end. The version on main doesn't run — it dies on bad folder paths before it even gets to the model. Once I fixed the paths I hit a chain of follow-on issues (wrong dataset keys, wrong file format, weak LoRA, broken predict(), train/inference prompt mismatch). All of them are addressed below.

TL;DR result

| rank | algorithm | rouge1 | rouge2 | rougeL | paradigm           |
|  1   | LLM_agent |  10.0  |  0.0   |  10.0  | singletasklearning |

rouge2 = 0 is expected (one-word labels have no bigrams). The other two scores are essentially memorisation since train and test point at the same 10-sample file — but it confirms the pipeline works.

Summary of changes

File What changed Why (one line)
config/config.json path + token + dataset name path didn't exist, HF token was leaked, dataset was wrong format
config/train_config.json epochs 2 → 20 4 grad steps wasn't enough; loss never moved
singletask_learning_bench/benchmarkingjob.yaml path fix LLM-Agent-Benchmark dir doesn't exist
singletask_learning_bench/testalgorithms/test_algorithm.yaml path fix same as above
singletask_learning_bench/testenv/testenv.yaml keys + format + path train_url isn't valid; needed train_data + .jsonl
singletask_learning_bench/testalgorithms/basemodel.py 7 small things the real fix lives here — see below
singletask_learning_bench/testenv/rouge.py local scorer instead of evaluate.load the path it loaded from doesn't exist
singletask_learning_bench/requirements.txt new file none of the ML deps were listed anywhere
singletask_learning_bench/README.md rewrite old one referenced wrong folder + 404 dataset URL

The token hf_fcEqmTAMIHUd… was sitting in config.json on main in plaintext. It's in git history regardless of whether this PR lands — worth revoking at https://huggingface.co/settings/tokens.


Per-file walkthrough

config/config.json — 3 lines
- "tokenizer_dir": "./examples/LLM-Agent-Benchmark/pretrains/Langboat/bloom-1b4-zh",
- "auth_token": "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",
- "data_dir" :"./examples/LLM-Agent-Benchmark/dataset/activity_classification.json",
+ "tokenizer_dir": "Langboat/bloom-1b4-zh",
+ "auth_token": null,
+ "data_dir" :"./examples/llm-agent/dataset/activity_classification.jsonl",
  • tokenizer_dir: was pointing at a local pretrains/ folder. That folder is empty on a fresh clone, so nothing loads. Using the bare HF hub id lets from_pretrained resolve through the normal HF cache — example just works.
  • auth_token: had a real token committed in plaintext (see heads-up above). Set to null since Langboat/bloom-1b4-zh is public.
  • data_dir: wrong folder name (LLM-Agent-Benchmark doesn't exist; the dir is llm-agent) and .json extension, but JsonlDataParse needs .jsonl.
config/train_config.json — 1 line
- "num_train_epochs":2,
+ "num_train_epochs":20,

10 samples, batch 5, 2 epochs = 4 gradient steps total. Loss stayed at ~7.8 (basically random for a causal LM). With 20 epochs the loss drops to ~1.4. On CPU it goes from 15 sec to ~3 min — fine for an example.

singletask_learning_bench/benchmarkingjob.yaml — 2 lines
- testenv: "./examples/LLM-Agent-Benchmark/singletask_learning_bench/testenv/testenv.yaml"
+ testenv: "./examples/llm-agent/singletask_learning_bench/testenv/testenv.yaml"
- url: "./examples/LLM-Agent-Benchmark/singletask_learning_bench/testalgorithms/test_algorithm.yaml"
+ url: "./examples/llm-agent/singletask_learning_bench/testalgorithms/test_algorithm.yaml"

Folder on disk is examples/llm-agent. Running ianvs as-is dies with RuntimeError: not found testenv config file(...).

singletask_learning_bench/testalgorithms/test_algorithm.yaml — 3 lines

Same LLM-Agent-Benchmarkllm-agent path fix on three lines (basemodel.py, config.json, train_config.json).

singletask_learning_bench/testenv/testenv.yaml — 5 lines
- train_url: "./examples/LLM-Agent-Benchmark/dataset/activity_classification.json"
- test_url:  "./examples/LLM-Agent-Benchmark/dataset/activity_classification.json"
+ train_data: "./examples/llm-agent/dataset/activity_classification.jsonl"
+ test_data:  "./examples/llm-agent/dataset/activity_classification.jsonl"

Two separate bugs:

  • train_url / test_url aren't valid keys. core/testenvmanager/dataset/dataset.py:165 only accepts train_index / train_data / train_data_info. Old keys silently fall through and raise NotImplementedError: not one of train_index/train_data/train_data_info.
  • File extension .json.jsonl. JsonlDataParse is selected by extension and expects one JSON object per line with question / answer keys (hardcoded in sedna).

Three metric url: entries had the same path typo — fixed.

singletask_learning_bench/testalgorithms/basemodel.py — the real fix lives here

Seven small things bundled in one file. Going through them:

1. use_auth_token=token=

- AutoModelForCausalLM.from_pretrained(..., use_auth_token=self.auth_token, ...)
+ AutoModelForCausalLM.from_pretrained(..., token=self.auth_token, ...)

use_auth_token is deprecated in recent transformers (warning now, error eventually).

2. Added an empty preprocess lifecycle hook

def preprocess(self, *args, **kwargs):
    pass

Ianvs' singletasklearning paradigm calls model.preprocess(...) as part of its lifecycle. Without this it crashes with AttributeError. No-op since actual preprocessing happens inside train.

3. Removed the fake Sample class in train

- sample = type('Sample', (object,), {'x': x, 'y': y})()
- processed_sample = self.preprocess(sample, self.MAX_LENGTH, self.tokenizer)
+ processed_sample = self._preprocess_sample(str(x), str(y), self.MAX_LENGTH, self.tokenizer)

Old code built a throwaway class just so the helper could read samples.x / samples.y. Simpler to pass strings directly. Also renamed the helper to _preprocess_sample so it doesn't collide with the lifecycle preprocess from change #2.

4. Dropped the [[None], ...] wrapper around tensors

- "input_ids": [d["input_ids"][1] for d in preprocessed_data],
+ "input_ids": [d["input_ids"] for d in preprocessed_data],

The old helper returned {"input_ids": [[None], <actual_ids>], ...} and downstream had to read [1] to get past the [None]. That [None] served no purpose. Now returns the flat list.

5. Stronger LoRA

  config_lora=LoraConfig(task_type=TaskType.CAUSAL_LM,
+             r=16,
-             lora_alpha = 1,
-             lora_dropout = 0.0
+             lora_alpha = 32,
+             lora_dropout = 0.05
              )

lora_alpha=1 makes the adapter contribution near zero — training runs but the model barely changes. r=16, alpha=32, dropout=0.05 are the standard values from most LoRA tutorials.

6. Rewrote predict()

- pipe=pipeline("text2text-generation",model=self.model,tokenizer=self.tokenizer)
- y_pred = pipe(data)
- return y_pred
+ results = []
+ for text in data:
+     prompt="\n".join(["user: ", str(text)])+"\n\nassistant: "
+     inputs=self.tokenizer(prompt, return_tensors="pt", truncation=True, max_length=self.MAX_LENGTH)
+     input_len=inputs["input_ids"].shape[1]
+     with torch.no_grad():
+         outputs=self.model.generate(**inputs, max_new_tokens=8, pad_token_id=self.tokenizer.eos_token_id)
+     new_tokens=outputs[0][input_len:]
+     decoded=self.tokenizer.decode(new_tokens, skip_special_tokens=True)
+     decoded=decoded.strip().split()[0] if decoded.strip() else decoded
+     results.append(decoded)
+ return results

Several issues bundled together:

  • pipeline("text2text-generation", ...) is the seq2seq pipeline; BLOOM is a causal LM. Wrong pipeline type. Also pipeline output doesn't strip the prompt, so ROUGE compared (prompt+answer) to (answer) → near zero.
  • New code slices outputs[0][input_len:] to keep only the newly generated tokens.
  • max_new_tokens=8 because labels are single words. With 32, the model fills the buffer with filler text that destroys precision.
  • .split()[0] takes just the first word.
  • The important one: the prompt wrapper "user: \n{text}\n\nassistant: ". Training data is formatted as "user: \n<prompt>\n\nassistant: <answer>". Without the same wrapper at inference, the trained model sees an unfamiliar format, emits <eos> immediately, and skip_special_tokens=True strips it to ''. This was the actual reason ROUGE was stuck at 0 even after every other fix.

7. Renamed preprocess_preprocess_sample

Covered in #3.

singletask_learning_bench/testenv/rouge.py
- rouge=evaluate.load('./examples/LLM-Agent-Benchmark/evaluate/metrics/rouge')
- y_prednew=[]
- for i in range(len(y_pred)):
-     y_prednew.append(y_pred[i]["generated_text"])
- rou_score = rouge.compute(...)
+ from rouge_score import rouge_scorer
+ scorer=rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)
+ y_prednew=[str(item) for item in y_pred]
+ scores=[scorer.score(ref, pred)['rouge1'].fmeasure for ref, pred in zip(y_true, y_prednew)]
+ return (sum(scores) / len(scores)) * 10
  • evaluate.load(...) was pointing at ./examples/LLM-Agent-Benchmark/evaluate/metrics/rouge, which doesn't exist in the repo. Using rouge_score.RougeScorer directly skips that and works offline.
  • y_pred[i]["generated_text"] was unwrapping the dict the old pipeline(...) returned. Since predict() now returns plain strings, no unwrapping needed.
  • Math is the same (fmeasure × 10) so existing leaderboards are comparable.
singletask_learning_bench/requirements.txt — new file
torch
transformers
peft
datasets
evaluate
rouge_score

None of these are in the top-level requirements.txt and you'd otherwise discover them via import errors. rouge_score is the only new dep introduced here.

singletask_learning_bench/README.md — rewrite

Old README referenced the wrong folder name, a 404 dataset URL, and didn't cover any of the prerequisites (model download, dep install, dataset format). New version is a step-by-step guide: env setup → dataset creation → config files → run command, plus a section on improving ROUGE since the default 10-sample setup gives near-zero by design.


How to verify

cd ianvs
ianvs -f examples/llm-agent/singletask_learning_bench/benchmarkingjob.yaml

Takes ~3 min on CPU. Output is the leaderboard at the top.

@NishantSinghhhhh
Copy link
Copy Markdown
Contributor Author

Screencast.from.2026-05-11.23-01-05.webm

working of this example

@NishantSinghhhhh
Copy link
Copy Markdown
Contributor Author

Screencast.from.2026-05-20.20-49-54.webm

Made it work, with changes done in core/single_task_learning.py

@NishantSinghhhhh NishantSinghhhhh force-pushed the restoration-llm-agent branch from f1e7b5f to 5e7d72f Compare May 21, 2026 09:31
@kubeedge-bot kubeedge-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 21, 2026
@NishantSinghhhhh NishantSinghhhhh force-pushed the restoration-llm-agent branch from 5e7d72f to 2bf92bd Compare May 21, 2026 09:37
@kubeedge-bot kubeedge-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 21, 2026
@NishantSinghhhhh NishantSinghhhhh force-pushed the restoration-llm-agent branch from 2bf92bd to f9b3879 Compare May 21, 2026 09:39
- Add requirements.txt for dependencies
- Refactor basemodel.py for improved readability and functionality
- Enhance rouge.py with RougeScorer and guard against empty score lists
- Update paths and configurations for LLM-Agent benchmark
- Update README for Ianvs LLM-Agent benchmark setup and usage

Signed-off-by: NishantSinghhhhh <nishantsingh_230137@aitpune.edu.in>
@NishantSinghhhhh NishantSinghhhhh force-pushed the restoration-llm-agent branch from f9b3879 to 9336902 Compare May 21, 2026 09:40
@NishantSinghhhhh
Copy link
Copy Markdown
Contributor Author

image

Added the part which prevents errros in Rouge functions

@NishantSinghhhhh
Copy link
Copy Markdown
Contributor Author

@MooreZheng sir, @hsj576 sir done with the PR

Copy link
Copy Markdown
Collaborator

@MooreZheng MooreZheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very close to the final version.

  1. For rouge function be carefull about the #DIV/0! error
  2. Squash the pull request into one

Copy link
Copy Markdown
Member

@hsj576 hsj576 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@kubeedge-bot kubeedge-bot added the lgtm Indicates that a PR is ready to be merged. label May 21, 2026
Copy link
Copy Markdown
Collaborator

@MooreZheng MooreZheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Copy Markdown
Collaborator

@MooreZheng MooreZheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@kubeedge-bot
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hsj576, MooreZheng, NishantSinghhhhh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubeedge-bot kubeedge-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 21, 2026
@kubeedge-bot kubeedge-bot merged commit f1c4e7d into kubeedge:main May 21, 2026
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants