[lFX Term 1 2026 ] Restoring Ianvs LLM-Agent setup and usage by NishantSinghhhhh · Pull Request #407 · kubeedge/ianvs

NishantSinghhhhh · 2026-04-23T08:24:20Z

feat: add requirements.txt for dependencies
fix: refactor basemodel.py for improved readability and functionality
refactor: enhance rouge.py to utilize RougeScorer for metric calculations

What type of PR is this?

/kind feature
/kind cleanup

What this PR does / why we need it:

This PR fixes and refactors the llm-agent singletask learning benchmark to make it fully functional end-to-end. The original example code had several issues that prevented it from running: broken relative paths, a missing dataset, deprecated HuggingFace API arguments, a name collision with the Ianvs framework lifecycle hook, and a broken ROUGE metric script.

Changes included:

requirements.txt: Added a requirements.txt listing all dependencies needed to run the LLM-agent benchmark (torch, transformers, peft, datasets, evaluate, rouge_score), which were previously undocumented and missing from the environment.
basemodel.py:
- Replaced deprecated use_auth_token= argument with token= to match current HuggingFace transformers API
- Added empty preprocess(self, **kwargs) lifecycle hook required by the Ianvs singletask learning framework
- Renamed internal helper preprocess() → _preprocess_sample() to avoid collision with the framework hook
- Updated _preprocess_sample() signature to accept plain strings instead of a samples object
- Flattened the return value of _preprocess_sample() (removed erroneous [None] wrapper)
- Added explicit str() cast in train() loop when iterating train_data.x / train_data.y to handle numpy.str_ types that caused tokenizer failures
rouge.py:
- Removed bare EOF token at end of file (invalid Python causing NameError on import)
- Replaced evaluate.load() (which required a local metrics folder that did not exist) with direct rouge_score.rouge_scorer.RougeScorer calls
- Updated y_pred handling to use str() cast instead of ["generated_text"] dict access, matching the plain-string output of basemodel.predict()

Which issue(s) this PR fixes:

Fixes #

NishantSinghhhhh · 2026-04-23T08:25:51Z

Screencast.from.2026-04-23.13-42-27.webm

@MooreZheng sir,

After making all these changes I was able to restore LLM-Agent Benchmark and run it successfully

gemini-code-assist

Code Review

This pull request significantly updates the Ianvs LLM-Agent benchmark by providing a comprehensive reproduction guide, adding a requirements file, and refactoring the core model and evaluation logic. Key changes include a rewritten predict method that correctly slices prompt tokens from the output and an updated ROUGE scoring implementation using the rouge_score library. Review feedback focuses on ensuring input tensors are moved to the correct device, removing redundant imports, adopting idiomatic boolean checks, and utilizing the internal calculate_mean function to prevent potential division-by-zero errors in metric calculations.

NishantSinghhhhh · 2026-05-11T17:52:46Z

What this PR does

Got the llm-agent/singletask_learning_bench example actually running end-to-end. The version on main doesn't run — it dies on bad folder paths before it even gets to the model. Once I fixed the paths I hit a chain of follow-on issues (wrong dataset keys, wrong file format, weak LoRA, broken predict(), train/inference prompt mismatch). All of them are addressed below.

TL;DR result

| rank | algorithm | rouge1 | rouge2 | rougeL | paradigm           |
|  1   | LLM_agent |  10.0  |  0.0   |  10.0  | singletasklearning |

rouge2 = 0 is expected (one-word labels have no bigrams). The other two scores are essentially memorisation since train and test point at the same 10-sample file — but it confirms the pipeline works.

Summary of changes

File	What changed	Why (one line)
`config/config.json`	path + token + dataset name	path didn't exist, HF token was leaked, dataset was wrong format
`config/train_config.json`	epochs 2 → 20	4 grad steps wasn't enough; loss never moved
`singletask_learning_bench/benchmarkingjob.yaml`	path fix	`LLM-Agent-Benchmark` dir doesn't exist
`singletask_learning_bench/testalgorithms/test_algorithm.yaml`	path fix	same as above
`singletask_learning_bench/testenv/testenv.yaml`	keys + format + path	`train_url` isn't valid; needed `train_data` + `.jsonl`
`singletask_learning_bench/testalgorithms/basemodel.py`	7 small things	the real fix lives here — see below
`singletask_learning_bench/testenv/rouge.py`	local scorer instead of `evaluate.load`	the path it loaded from doesn't exist
`singletask_learning_bench/requirements.txt`	new file	none of the ML deps were listed anywhere
`singletask_learning_bench/README.md`	rewrite	old one referenced wrong folder + 404 dataset URL

The token hf_fcEqmTAMIHUd… was sitting in config.json on main in plaintext. It's in git history regardless of whether this PR lands — worth revoking at https://huggingface.co/settings/tokens.

Per-file walkthrough

config/config.json — 3 lines

- "tokenizer_dir": "./examples/LLM-Agent-Benchmark/pretrains/Langboat/bloom-1b4-zh",
- "auth_token": "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",
- "data_dir" :"./examples/LLM-Agent-Benchmark/dataset/activity_classification.json",
+ "tokenizer_dir": "Langboat/bloom-1b4-zh",
+ "auth_token": null,
+ "data_dir" :"./examples/llm-agent/dataset/activity_classification.jsonl",

tokenizer_dir: was pointing at a local pretrains/ folder. That folder is empty on a fresh clone, so nothing loads. Using the bare HF hub id lets from_pretrained resolve through the normal HF cache — example just works.
auth_token: had a real token committed in plaintext (see heads-up above). Set to null since Langboat/bloom-1b4-zh is public.
data_dir: wrong folder name (LLM-Agent-Benchmark doesn't exist; the dir is llm-agent) and .json extension, but JsonlDataParse needs .jsonl.

config/train_config.json — 1 line

- "num_train_epochs":2,
+ "num_train_epochs":20,

10 samples, batch 5, 2 epochs = 4 gradient steps total. Loss stayed at ~7.8 (basically random for a causal LM). With 20 epochs the loss drops to ~1.4. On CPU it goes from 15 sec to ~3 min — fine for an example.

singletask_learning_bench/benchmarkingjob.yaml — 2 lines

- testenv: "./examples/LLM-Agent-Benchmark/singletask_learning_bench/testenv/testenv.yaml"
+ testenv: "./examples/llm-agent/singletask_learning_bench/testenv/testenv.yaml"
- url: "./examples/LLM-Agent-Benchmark/singletask_learning_bench/testalgorithms/test_algorithm.yaml"
+ url: "./examples/llm-agent/singletask_learning_bench/testalgorithms/test_algorithm.yaml"

Folder on disk is examples/llm-agent. Running ianvs as-is dies with RuntimeError: not found testenv config file(...).

singletask_learning_bench/testalgorithms/test_algorithm.yaml — 3 lines

Same LLM-Agent-Benchmark → llm-agent path fix on three lines (basemodel.py, config.json, train_config.json).

singletask_learning_bench/testenv/testenv.yaml — 5 lines

- train_url: "./examples/LLM-Agent-Benchmark/dataset/activity_classification.json"
- test_url:  "./examples/LLM-Agent-Benchmark/dataset/activity_classification.json"
+ train_data: "./examples/llm-agent/dataset/activity_classification.jsonl"
+ test_data:  "./examples/llm-agent/dataset/activity_classification.jsonl"

Two separate bugs:

train_url / test_url aren't valid keys. core/testenvmanager/dataset/dataset.py:165 only accepts train_index / train_data / train_data_info. Old keys silently fall through and raise NotImplementedError: not one of train_index/train_data/train_data_info.
File extension .json → .jsonl. JsonlDataParse is selected by extension and expects one JSON object per line with question / answer keys (hardcoded in sedna).

Three metric url: entries had the same path typo — fixed.

singletask_learning_bench/testalgorithms/basemodel.py — the real fix lives here

Seven small things bundled in one file. Going through them:

1. use_auth_token= → token=

- AutoModelForCausalLM.from_pretrained(..., use_auth_token=self.auth_token, ...)
+ AutoModelForCausalLM.from_pretrained(..., token=self.auth_token, ...)

use_auth_token is deprecated in recent transformers (warning now, error eventually).

2. Added an empty preprocess lifecycle hook

def preprocess(self, *args, **kwargs):
    pass

Ianvs' singletasklearning paradigm calls model.preprocess(...) as part of its lifecycle. Without this it crashes with AttributeError. No-op since actual preprocessing happens inside train.

3. Removed the fake Sample class in train

- sample = type('Sample', (object,), {'x': x, 'y': y})()
- processed_sample = self.preprocess(sample, self.MAX_LENGTH, self.tokenizer)
+ processed_sample = self._preprocess_sample(str(x), str(y), self.MAX_LENGTH, self.tokenizer)

Old code built a throwaway class just so the helper could read samples.x / samples.y. Simpler to pass strings directly. Also renamed the helper to _preprocess_sample so it doesn't collide with the lifecycle preprocess from change #2.

4. Dropped the [[None], ...] wrapper around tensors

- "input_ids": [d["input_ids"][1] for d in preprocessed_data],
+ "input_ids": [d["input_ids"] for d in preprocessed_data],

The old helper returned {"input_ids": [[None], <actual_ids>], ...} and downstream had to read [1] to get past the [None]. That [None] served no purpose. Now returns the flat list.

5. Stronger LoRA

  config_lora=LoraConfig(task_type=TaskType.CAUSAL_LM,
+             r=16,
-             lora_alpha = 1,
-             lora_dropout = 0.0
+             lora_alpha = 32,
+             lora_dropout = 0.05
              )

lora_alpha=1 makes the adapter contribution near zero — training runs but the model barely changes. r=16, alpha=32, dropout=0.05 are the standard values from most LoRA tutorials.

6. Rewrote predict()

- pipe=pipeline("text2text-generation",model=self.model,tokenizer=self.tokenizer)
- y_pred = pipe(data)
- return y_pred
+ results = []
+ for text in data:
+     prompt="\n".join(["user: ", str(text)])+"\n\nassistant: "
+     inputs=self.tokenizer(prompt, return_tensors="pt", truncation=True, max_length=self.MAX_LENGTH)
+     input_len=inputs["input_ids"].shape[1]
+     with torch.no_grad():
+         outputs=self.model.generate(**inputs, max_new_tokens=8, pad_token_id=self.tokenizer.eos_token_id)
+     new_tokens=outputs[0][input_len:]
+     decoded=self.tokenizer.decode(new_tokens, skip_special_tokens=True)
+     decoded=decoded.strip().split()[0] if decoded.strip() else decoded
+     results.append(decoded)
+ return results

Several issues bundled together:

pipeline("text2text-generation", ...) is the seq2seq pipeline; BLOOM is a causal LM. Wrong pipeline type. Also pipeline output doesn't strip the prompt, so ROUGE compared (prompt+answer) to (answer) → near zero.
New code slices outputs[0][input_len:] to keep only the newly generated tokens.
max_new_tokens=8 because labels are single words. With 32, the model fills the buffer with filler text that destroys precision.
.split()[0] takes just the first word.
The important one: the prompt wrapper "user: \n{text}\n\nassistant: ". Training data is formatted as "user: \n<prompt>\n\nassistant: <answer>". Without the same wrapper at inference, the trained model sees an unfamiliar format, emits <eos> immediately, and skip_special_tokens=True strips it to ''. This was the actual reason ROUGE was stuck at 0 even after every other fix.

7. Renamed preprocess → _preprocess_sample

Covered in #3.

singletask_learning_bench/testenv/rouge.py

- rouge=evaluate.load('./examples/LLM-Agent-Benchmark/evaluate/metrics/rouge')
- y_prednew=[]
- for i in range(len(y_pred)):
-     y_prednew.append(y_pred[i]["generated_text"])
- rou_score = rouge.compute(...)
+ from rouge_score import rouge_scorer
+ scorer=rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)
+ y_prednew=[str(item) for item in y_pred]
+ scores=[scorer.score(ref, pred)['rouge1'].fmeasure for ref, pred in zip(y_true, y_prednew)]
+ return (sum(scores) / len(scores)) * 10

evaluate.load(...) was pointing at ./examples/LLM-Agent-Benchmark/evaluate/metrics/rouge, which doesn't exist in the repo. Using rouge_score.RougeScorer directly skips that and works offline.
y_pred[i]["generated_text"] was unwrapping the dict the old pipeline(...) returned. Since predict() now returns plain strings, no unwrapping needed.
Math is the same (fmeasure × 10) so existing leaderboards are comparable.

singletask_learning_bench/requirements.txt — new file

torch
transformers
peft
datasets
evaluate
rouge_score

None of these are in the top-level requirements.txt and you'd otherwise discover them via import errors. rouge_score is the only new dep introduced here.

singletask_learning_bench/README.md — rewrite

Old README referenced the wrong folder name, a 404 dataset URL, and didn't cover any of the prerequisites (model download, dep install, dataset format). New version is a step-by-step guide: env setup → dataset creation → config files → run command, plus a section on improving ROUGE since the default 10-sample setup gives near-zero by design.

How to verify

cd ianvs
ianvs -f examples/llm-agent/singletask_learning_bench/benchmarkingjob.yaml

Takes ~3 min on CPU. Output is the leaderboard at the top.

NishantSinghhhhh · 2026-05-14T08:26:35Z

Screencast.from.2026-05-11.23-01-05.webm

working of this example

NishantSinghhhhh · 2026-05-20T15:30:15Z

Screencast.from.2026-05-20.20-49-54.webm

Made it work, with changes done in core/single_task_learning.py

- Add requirements.txt for dependencies - Refactor basemodel.py for improved readability and functionality - Enhance rouge.py with RougeScorer and guard against empty score lists - Update paths and configurations for LLM-Agent benchmark - Update README for Ianvs LLM-Agent benchmark setup and usage Signed-off-by: NishantSinghhhhh <nishantsingh_230137@aitpune.edu.in>

NishantSinghhhhh · 2026-05-21T09:41:33Z

Added the part which prevents errros in Rouge functions

NishantSinghhhhh · 2026-05-21T09:41:51Z

@MooreZheng sir, @hsj576 sir done with the PR

MooreZheng

Very close to the final version.

For rouge function be carefull about the #DIV/0! error
Squash the pull request into one

hsj576

/lgtm

MooreZheng

/lgtm

MooreZheng

/approve

kubeedge-bot · 2026-05-21T09:57:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hsj576, MooreZheng, NishantSinghhhhh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [MooreZheng]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kubeedge-bot added kind/feature Categorizes issue or PR as related to a new feature. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Apr 23, 2026

kubeedge-bot requested review from MooreZheng and hsj576 April 23, 2026 08:24

kubeedge-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 23, 2026

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

NishantSinghhhhh mentioned this pull request Apr 23, 2026

Comprehensive Example Restoration for KubeEdge Ianvs #230

Open

NishantSinghhhhh changed the title ~~[lFX Term 1 2026 ] Restoring Ianvs LLM-Agent Benchmark setup and usage~~ [lFX Term 1 2026 ] Restoring Ianvs LLM-Agent setup and usage Apr 23, 2026

NishantSinghhhhh force-pushed the restoration-llm-agent branch from f1e7b5f to 5e7d72f Compare May 21, 2026 09:31

kubeedge-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 21, 2026

NishantSinghhhhh force-pushed the restoration-llm-agent branch from 5e7d72f to 2bf92bd Compare May 21, 2026 09:37

kubeedge-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 21, 2026

NishantSinghhhhh force-pushed the restoration-llm-agent branch from 2bf92bd to f9b3879 Compare May 21, 2026 09:39

NishantSinghhhhh force-pushed the restoration-llm-agent branch from f9b3879 to 9336902 Compare May 21, 2026 09:40

MooreZheng requested changes May 21, 2026

View reviewed changes

hsj576 approved these changes May 21, 2026

View reviewed changes

kubeedge-bot assigned hsj576 May 21, 2026

kubeedge-bot added the lgtm Indicates that a PR is ready to be merged. label May 21, 2026

MooreZheng reviewed May 21, 2026

View reviewed changes

kubeedge-bot assigned MooreZheng May 21, 2026

MooreZheng approved these changes May 21, 2026

View reviewed changes

kubeedge-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 21, 2026

kubeedge-bot merged commit f1c4e7d into kubeedge:main May 21, 2026
12 of 13 checks passed

Conversation

NishantSinghhhhh commented Apr 23, 2026

Uh oh!

NishantSinghhhhh commented Apr 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NishantSinghhhhh commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

TL;DR result

Summary of changes

Per-file walkthrough

How to verify

Uh oh!

NishantSinghhhhh commented May 14, 2026

Uh oh!

NishantSinghhhhh commented May 20, 2026

Uh oh!

NishantSinghhhhh commented May 21, 2026

Uh oh!

NishantSinghhhhh commented May 21, 2026

Uh oh!

MooreZheng left a comment

Choose a reason for hiding this comment

Uh oh!

hsj576 left a comment

Choose a reason for hiding this comment

Uh oh!

MooreZheng left a comment

Choose a reason for hiding this comment

Uh oh!

MooreZheng left a comment

Choose a reason for hiding this comment

Uh oh!

kubeedge-bot commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NishantSinghhhhh commented May 11, 2026 •

edited

Loading