Skip to content

Migrate default model to Qwen3.5-9B#42

Open
kfallah wants to merge 2 commits intomainfrom
feat/qwen3.5-9b-migration
Open

Migrate default model to Qwen3.5-9B#42
kfallah wants to merge 2 commits intomainfrom
feat/qwen3.5-9b-migration

Conversation

@kfallah
Copy link
Copy Markdown
Owner

@kfallah kfallah commented Mar 7, 2026

Summary

  • Migrate the entire CLaaS stack from Qwen3-8B to Qwen/Qwen3.5-9B (hybrid GDN + full attention architecture)
  • Fix LoRA initialization for hybrid models: per-layer awareness for different attention types, correct q_proj dimensions (doubled for output gate)
  • Fix coerce_template_ids to handle BatchEncoding (Mapping subclass, not plain dict)
  • Bump transformers>=5.0.0 and huggingface_hub>=1.3.0 for qwen3_5 model type support
  • Use dedicated vllm/vllm-openai:qwen3_5 Docker image with --enforce-eager (CUDA graph capture bug in GDN causal conv1d layer)

Changes across 26 files

Core config (6 files): Update default model ID in all configs, types, and defaults
Docker (5 files): New vLLM image tag, tool call parser (qwen3_coder), --enforce-eager, init container with --extra local + CPU torch
Training (1 file): create_initial_lora now reads layer_types and attn_output_gate from model config to create correctly-shaped LoRA weights per layer type
Inference (1 file): coerce_template_ids handles BatchEncoding via __getitem__ + "input_ids" in result instead of isinstance(result, dict)
Tests (5 files): Update model references
Docs (4 files): Update README, docker README, setup skills
Deps (2 files): transformers 5.x, huggingface_hub 1.3+, remove teacher extra (vllm conflicts with transformers 5.x)

Test plan

  • uv run ruff check passes
  • uv run pytest tests/ -m "not integration" — 114 passed
  • Full Docker stack tested end-to-end: vLLM → CLaaS API → OpenClaw with LoRA adapter
  • CI lint-and-test job

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Enhanced LoRA initialization with support for multimodal and complex model architectures
    • Improved input handling for inference operations
  • Updates

    • Default base model upgraded from Qwen3-8B to Qwen3.5-9B across all deployments
    • Tool Call Parser configuration updated
    • Dependencies updated: transformers (≥5.0.0) and huggingface_hub (≥1.3.0)
    • Docker configurations modernized for improved compatibility

Qwen3.5-9B is a hybrid architecture (Gated Delta Networks + full attention)
that requires several adaptations:

- vLLM: use dedicated qwen3_5 Docker image, qwen3_coder tool call parser,
  --enforce-eager (CUDA graph capture bug in GDN causal conv1d layer)
- LoRA init: handle per-layer architecture differences — full_attention
  layers have q_proj doubled for output gate (8192 vs 4096), linear_attention
  (GDN) layers lack q/k/v/o_proj entirely
- Dependencies: bump transformers>=5.0.0 and huggingface_hub>=1.3.0 for
  qwen3_5 model type support
- Init container: install --extra local with CPU torch for LoRA weight creation
- coerce_template_ids: handle BatchEncoding (Mapping subclass, not dict)

Tested end-to-end: vLLM serves model, CLaaS API proxies with LoRA,
OpenClaw routes through successfully.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 7, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4079414f-f066-4747-9fbf-4bd8c2a407b4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request upgrades the default model from Qwen3-8B to Qwen3.5-9B across the codebase, updates Docker vLLM configuration and dependencies, enhances LoRA training for multimodal models, and adds observability improvements.

Changes

Cohort / File(s) Summary
Model Version Upgrade (Qwen3-8B → Qwen3.5-9B)
claas/core/config.py, claas/core/configs/local.yaml, claas/core/configs/modal.yaml, claas/core/configs/tinker.yaml, claas/core/types.py, claas/eval/types.py, claas/modal/worker.py, docker/.env.local.example, docker/scripts/init-stack.py, docker/scripts/start_vllm.sh
Updated default base model identifier and allowed model lists across all configuration files, default value assignments, and environment fallbacks to reference the newer Qwen3.5-9B model variant.
Documentation & Setup Files
.claude/skills/setup-local/SKILL.md, .claude/skills/setup-modal/SKILL.md, README.md, docker/README.md
Updated references to base model version from Qwen3-8B to Qwen3.5-9B in documentation, setup guides, and quick-start instructions. Includes vLLM startup command updates and environment variable documentation.
Docker Configuration
docker/docker-compose.yml, docker/Dockerfile.init
Updated vLLM service image tag to qwen3_5, changed startup script reference, added --enforce-eager flag, updated served model names and tool call parser. Modified Dockerfile to include local dependencies and CPU-only Torch installation.
LoRA Training Enhancement
claas/training/storage.py
Extends base model dimension inference to support multimodal/text-config nesting, adds layer-type-aware attention handling with multipliers, introduces separate dimension mapping for attention vs MLP modules (gate_proj, up_proj, down_proj), and enforces supported module validation.
Dependency Updates
pyproject.toml
Updated huggingface_hub from exact pin 0.36.2 to >=1.3.0, transformers from exact pin 4.57.6 to >=5.0.0, and removed the teacher optional-dependency group.
Inference & Observability
claas/inference/helpers.py, plugins/claas-feedback/index.ts
Enhanced coerce_template_ids to support dict-like objects with __getitem__ (e.g., BatchEncoding). Added debug logging block in feedback plugin's agent_end handler to inspect and log assistant message content.
Test Updates
tests/integration/test_local_engine_integration.py, tests/test_api.py, tests/test_config.py, tests/test_env_fallbacks.py, tests/test_local_training_engine.py
Updated test fixtures, environment setup, and assertions to reflect new default base model identifier Qwen3.5-9B and corresponding allowed model lists.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Qwen hops from eight to point-five-nine,
With eager flags and deps divine!
LoRA learns to handle layers deep,
While Docker scripts make changes sweep. 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and concisely summarizes the main objective: migrating the default model from Qwen3-8B to Qwen3.5-9B, which is the primary change across all 26 modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/qwen3.5-9b-migration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 15ef5450ab

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if (lastAssistant) {
const raw = JSON.stringify((lastAssistant as Record<string, unknown>).content);
console.log("[claas-feedback] content type:", typeof (lastAssistant as Record<string, unknown>).content, Array.isArray((lastAssistant as Record<string, unknown>).content) ? "(array)" : "");
console.log("[claas-feedback] preview:", raw.slice(0, 500));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard agent_end debug parsing when content is missing

If an assistant message arrives without a content field, JSON.stringify(...) yields undefined, so raw.slice(...) (and raw.includes(...)) throws and aborts the agent_end hook before contextStore.set(...) runs. In that case the feedback command loses the just-finished conversation context, so this path should skip debug parsing when raw is not a string.

Useful? React with 👍 / 👎.

Comment on lines +108 to +110
console.log("[claas-feedback] content type:", typeof (lastAssistant as Record<string, unknown>).content, Array.isArray((lastAssistant as Record<string, unknown>).content) ? "(array)" : "");
console.log("[claas-feedback] preview:", raw.slice(0, 500));
console.log("[claas-feedback] has thinking:", raw.includes("think") || raw.includes("thinking"));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate assistant payload logging behind debug mode

These console.log statements always run, even when CLAAS_FEEDBACK_DEBUG is false, so every assistant turn writes raw content previews into service logs. That introduces unnecessary exposure of user/model text in production and increases log noise/cost; this should be routed through the existing logDebug gate (or removed after investigation).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
claas/training/storage.py (1)

486-493: ⚠️ Potential issue | 🟠 Major

Breaking change: custom target_modules now strictly validated.

The new validation rejects any module name not in dim_map. Per claas/core/types.py:259-285, LoraInitRequest.target_modules accepts arbitrary strings. Callers or scripts using custom module names (e.g., "embed_tokens", "lm_head") will now receive a ValueError.

If custom modules should be allowed, consider skipping unknown modules with a warning instead of raising. If strict validation is intentional, document this breaking change.

♻️ Alternative: skip unknown modules with warning
-    unsupported_modules = sorted(set(target_modules) - set(dim_map))
-    if unsupported_modules:
-        raise ValueError(
-            "Unsupported target_modules: "
-            + ", ".join(unsupported_modules)
-            + ". Supported modules: "
-            + ", ".join(sorted(dim_map))
-        )
+    supported_modules = [m for m in target_modules if m in dim_map]
+    unsupported_modules = sorted(set(target_modules) - set(dim_map))
+    if unsupported_modules:
+        import warnings
+        warnings.warn(
+            f"Skipping unsupported target_modules: {', '.join(unsupported_modules)}. "
+            f"Supported: {', '.join(sorted(dim_map))}"
+        )
+    target_modules = supported_modules
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@claas/training/storage.py` around lines 486 - 493, The current strict
validation builds unsupported_modules from target_modules minus dim_map and
raises a ValueError; change this to be non-breaking by filtering out unknown
modules and emitting a warning instead of raising: compute the allowed set as
the intersection of target_modules and dim_map, if any unsupported_modules
remain call warnings.warn or the module logger with a clear message listing the
skipped names, and proceed using the filtered list (replace use of
unsupported_modules and the raise in the block where unsupported_modules is
defined). Ensure behavior of downstream code that expects target_modules now
uses the filtered/validated list.
claas/modal/worker.py (1)

31-31: ⚠️ Potential issue | 🔴 Critical

Update Modal worker transformers pin to support Qwen3.5: currently pinned to <5.0.0 but Qwen3.5 requires transformers 5.x or later.

Qwen3.5 support was added only to Transformers 5.x (as of February 2026) and cannot work with transformers 4.x out of the box. The Modal worker's transformers>=4.40.0,<5.0.0 constraint is incompatible with Qwen3.5, which the PR aims to support. Update the constraint to transformers>=5.0.0 to resolve this conflict.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@claas/modal/worker.py` at line 31, Replace the pinned dependency string
"transformers>=4.40.0,<5.0.0" with "transformers>=5.0.0" in the Modal worker
requirement list so the worker can load Qwen3.5; update the requirement literal
wherever "transformers>=4.40.0,<5.0.0" appears (the dependency entry in the
worker's requirements list) to the new "transformers>=5.0.0" spec.
🧹 Nitpick comments (1)
claas/training/storage.py (1)

506-509: Test coverage gap: hybrid model logic is untested.

Per tests/test_storage.py:206-217, the mock config has no layer_types field, so the new hybrid-model handling (skipping attention modules for non-full-attention layers, q_proj doubling for output gate) is not exercised by tests. A miscalculation in tensor shapes or key naming would not be detected.

Consider adding a test case with a mock config that includes layer_types and attn_output_gate.

Would you like me to generate a test case for hybrid model LoRA initialization?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@claas/training/storage.py` around lines 506 - 509, Add a unit test in
tests/test_storage.py that provides a mock model config containing layer_types
(with a mix of "full_attention" and non-full types) and attn_output_gate
enabled, then run the LoRA initialization code paths that reference
attn_modules, mod_name and layer_type in claas/training/storage.py to exercise
the branch that skips attention modules for non-full-attention layers and the
q_proj doubling behavior for output gate; assert that attention modules in
non-full-attention layers are not modified/registered, that q_proj-related
parameter keys are created with the expected doubled shapes/naming for gated
attention (check names like q_proj and any gate-specific suffixes), and that no
shape/key mismatches occur.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@claas/training/storage.py`:
- Line 504: The current selection layer_type = layer_types[layer_idx] if
layer_types else "full_attention" can raise IndexError when layer_types exists
but its length doesn't cover layer_idx; update the logic in the function where
layer_types and layer_idx are used (referencing layer_types, layer_idx,
num_layers/num_hidden_layers) to defensively check length—e.g. if layer_types
and layer_idx < len(layer_types) then use layer_types[layer_idx], else fall back
to "full_attention" (or validate and raise a clear error if mismatched lengths
are unacceptable); you can also add an explicit validation earlier that compares
len(layer_types) to num_hidden_layers and logs or raises a descriptive
exception.

In `@docker/scripts/start_vllm.sh`:
- Around line 6-13: The TOOL_CALL_PARSER default (TOOL_CALL_PARSER) is set to
qwen3_coder while MODEL defaults to Qwen/Qwen3.5-9B—verify and change the parser
to one compatible with the base Qwen3.5-9B (or make TOOL_CALL_PARSER conditional
on MODEL) so you aren't forcing the Qwen3-Coder parser onto a non-coder model;
also explicitly pin the vLLM dependency in your requirements (or equivalent
install manifest) to >=0.10.1.1 to avoid the known RCE in vLLM 0.10.0–0.10.1.1,
and add a brief comment near the TOOL_CALL_PARSER and MODEL declarations
documenting the compatibility requirement and the security-pinned vLLM version.

In `@plugins/claas-feedback/index.ts`:
- Around line 104-111: The temporary debug block currently uses console.log
unconditionally and should be removed or gated by the existing debug flag;
replace the direct console.log usage in the block that finds lastAssistant
(using messages.slice().reverse().find(...) and raw) with calls to the module's
logDebug helper and guard with the debugEnabled check (the same pattern used at
lines where logDebug is used) so the inspection of lastAssistant.content (type,
preview, has thinking) only emits when debugEnabled is true.

In `@pyproject.toml`:
- Line 37: The dependency declaration "transformers>=5.0.0" is too low and may
resolve to a release missing Qwen3.5 support; update the dependency
specification for the transformers package (the string "transformers>=5.0.0" in
pyproject.toml) to require a minimum that includes Qwen3.5 support by changing
it to a constrained range such as "transformers>=5.2.0,<6" so environments
cannot pull an incompatible 5.x release.

---

Outside diff comments:
In `@claas/modal/worker.py`:
- Line 31: Replace the pinned dependency string "transformers>=4.40.0,<5.0.0"
with "transformers>=5.0.0" in the Modal worker requirement list so the worker
can load Qwen3.5; update the requirement literal wherever
"transformers>=4.40.0,<5.0.0" appears (the dependency entry in the worker's
requirements list) to the new "transformers>=5.0.0" spec.

In `@claas/training/storage.py`:
- Around line 486-493: The current strict validation builds unsupported_modules
from target_modules minus dim_map and raises a ValueError; change this to be
non-breaking by filtering out unknown modules and emitting a warning instead of
raising: compute the allowed set as the intersection of target_modules and
dim_map, if any unsupported_modules remain call warnings.warn or the module
logger with a clear message listing the skipped names, and proceed using the
filtered list (replace use of unsupported_modules and the raise in the block
where unsupported_modules is defined). Ensure behavior of downstream code that
expects target_modules now uses the filtered/validated list.

---

Nitpick comments:
In `@claas/training/storage.py`:
- Around line 506-509: Add a unit test in tests/test_storage.py that provides a
mock model config containing layer_types (with a mix of "full_attention" and
non-full types) and attn_output_gate enabled, then run the LoRA initialization
code paths that reference attn_modules, mod_name and layer_type in
claas/training/storage.py to exercise the branch that skips attention modules
for non-full-attention layers and the q_proj doubling behavior for output gate;
assert that attention modules in non-full-attention layers are not
modified/registered, that q_proj-related parameter keys are created with the
expected doubled shapes/naming for gated attention (check names like q_proj and
any gate-specific suffixes), and that no shape/key mismatches occur.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b7e10d90-7147-42d5-aff5-07eed1431745

📥 Commits

Reviewing files that changed from the base of the PR and between 838bf90 and 15ef545.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (25)
  • .claude/skills/setup-local/SKILL.md
  • .claude/skills/setup-modal/SKILL.md
  • README.md
  • claas/core/config.py
  • claas/core/configs/local.yaml
  • claas/core/configs/modal.yaml
  • claas/core/configs/tinker.yaml
  • claas/core/types.py
  • claas/eval/types.py
  • claas/inference/helpers.py
  • claas/modal/worker.py
  • claas/training/storage.py
  • docker/.env.local.example
  • docker/Dockerfile.init
  • docker/README.md
  • docker/docker-compose.yml
  • docker/scripts/init-stack.py
  • docker/scripts/start_vllm.sh
  • plugins/claas-feedback/index.ts
  • pyproject.toml
  • tests/integration/test_local_engine_integration.py
  • tests/test_api.py
  • tests/test_config.py
  • tests/test_env_fallbacks.py
  • tests/test_local_training_engine.py

# while allowing gradients to propagate through A.
tensors: dict[str, torch.Tensor] = {}
for layer_idx in range(num_layers):
layer_type = layer_types[layer_idx] if layer_types else "full_attention"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential IndexError if layer_types length doesn't match num_layers.

If a model's config has a layer_types list with a different length than num_hidden_layers, this line will raise an IndexError. Consider adding a length validation or using get with a fallback.

🛡️ Proposed defensive check
     tensors: dict[str, torch.Tensor] = {}
+    if layer_types and len(layer_types) != num_layers:
+        raise ValueError(
+            f"layer_types length ({len(layer_types)}) != num_hidden_layers ({num_layers})"
+        )
     for layer_idx in range(num_layers):
         layer_type = layer_types[layer_idx] if layer_types else "full_attention"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@claas/training/storage.py` at line 504, The current selection layer_type =
layer_types[layer_idx] if layer_types else "full_attention" can raise IndexError
when layer_types exists but its length doesn't cover layer_idx; update the logic
in the function where layer_types and layer_idx are used (referencing
layer_types, layer_idx, num_layers/num_hidden_layers) to defensively check
length—e.g. if layer_types and layer_idx < len(layer_types) then use
layer_types[layer_idx], else fall back to "full_attention" (or validate and
raise a clear error if mismatched lengths are unacceptable); you can also add an
explicit validation earlier that compares len(layer_types) to num_hidden_layers
and logs or raises a descriptive exception.

Comment on lines +6 to +13
MODEL="${MODEL:-Qwen/Qwen3.5-9B}"
HOST="${HOST:-127.0.0.1}"
PORT="${PORT:-8000}"
API_KEY="${API_KEY:-sk-local}"
SERVED_MODEL_NAMES="${SERVED_MODEL_NAMES:-qwen3-8b}"
SERVED_MODEL_NAMES="${SERVED_MODEL_NAMES:-qwen3.5-9b}"
ENABLE_SLEEP_MODE="${ENABLE_SLEEP_MODE:-1}"
ENABLE_AUTO_TOOL_CHOICE="${ENABLE_AUTO_TOOL_CHOICE:-1}"
TOOL_CALL_PARSER="${TOOL_CALL_PARSER:-qwen3_xml}"
TOOL_CALL_PARSER="${TOOL_CALL_PARSER:-qwen3_coder}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

vLLM qwen3_coder tool call parser

💡 Result:

In vLLM, the qwen3_coder tool-call parser is the built-in parser that extracts Qwen3-Coder’s XML-style tool calls from the model’s output and converts them into the OpenAI-compatible tool_calls structure. The implementation is Qwen3CoderToolParser in vllm/tool_parsers/qwen3coder_tool_parser.py. [2]

How to enable it (vLLM OpenAI server)

Use both flags:

  • --enable-auto-tool-choice
  • --tool-call-parser qwen3_coder [1]

Example:

vllm serve Qwen/Qwen3-Coder-... \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Custom / alternative parsers

vLLM also supports registering your own parser via --tool-parser-plugin, then selecting it by name with --tool-call-parser. [1]

Security note (important)

If you use --tool-call-parser qwen3_coder, upgrade vLLM to >= 0.10.1.1: the Qwen3-Coder tool parser had an RCE vulnerability (use of eval() during parameter parsing) affecting >=0.10.0 and <0.10.1.1, patched in 0.10.1.1. [3]

Sources: [1] [2] [3]


🏁 Script executed:

# Search for vLLM version specification in the repository
find . -type f \( -name "requirements*.txt" -o -name "pyproject.toml" -o -name "setup.py" -o -name "setup.cfg" -o -name "Dockerfile" -o -name "docker-compose*.yml" \) | head -20 | xargs grep -l "vllm\|Qwen" 2>/dev/null

Repository: kfallah/CLaaS

Length of output: 84


🏁 Script executed:

# Look for any other references to Qwen3.5 or tool parser configuration in the repo
rg "Qwen3\.?5|tool.?parser|qwen3" --type sh --type py --type yaml --max-count 20

Repository: kfallah/CLaaS

Length of output: 2620


🏁 Script executed:

# Check the start_vllm.sh file to understand the context better
cat -n docker/scripts/start_vllm.sh

Repository: kfallah/CLaaS

Length of output: 4357


🏁 Script executed:

# Check docker-compose.yml for vLLM version information
cat -n docker/docker-compose.yml | head -30

Repository: kfallah/CLaaS

Length of output: 1191


🏁 Script executed:

# Search for Qwen-Coder model references to see if there's a coder variant
rg "Coder|coder" --type sh --type py --type yaml -B2 -A2

Repository: kfallah/CLaaS

Length of output: 2326


🏁 Script executed:

# Check requirements or dependencies for vLLM version
rg "vllm" docker/scripts/ claas/ --type py --type txt --type yaml

Repository: kfallah/CLaaS

Length of output: 5589


Verify tool parser compatibility and vLLM version for security.

The model default is correctly set to Qwen3.5-9B, but qwen3_coder is designed specifically for Qwen3-Coder models, not the base Qwen3.5-9B. Confirm whether this model variant supports tool calling with the qwen3_coder parser or if a different parser should be used.

Additionally, the vLLM version should be explicitly pinned in requirements. The qwen3_coder parser had an RCE vulnerability (use of eval() in parameter parsing) affecting vLLM >= 0.10.0 and < 0.10.1.1—ensure the deployment uses >= 0.10.1.1 or later.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/scripts/start_vllm.sh` around lines 6 - 13, The TOOL_CALL_PARSER
default (TOOL_CALL_PARSER) is set to qwen3_coder while MODEL defaults to
Qwen/Qwen3.5-9B—verify and change the parser to one compatible with the base
Qwen3.5-9B (or make TOOL_CALL_PARSER conditional on MODEL) so you aren't forcing
the Qwen3-Coder parser onto a non-coder model; also explicitly pin the vLLM
dependency in your requirements (or equivalent install manifest) to >=0.10.1.1
to avoid the known RCE in vLLM 0.10.0–0.10.1.1, and add a brief comment near the
TOOL_CALL_PARSER and MODEL declarations documenting the compatibility
requirement and the security-pinned vLLM version.

Comment on lines +104 to +111
// Debug: inspect assistant message content shape for proxy-removal investigation
const lastAssistant = messages.slice().reverse().find((m: Record<string, unknown>) => m.role === "assistant");
if (lastAssistant) {
const raw = JSON.stringify((lastAssistant as Record<string, unknown>).content);
console.log("[claas-feedback] content type:", typeof (lastAssistant as Record<string, unknown>).content, Array.isArray((lastAssistant as Record<string, unknown>).content) ? "(array)" : "");
console.log("[claas-feedback] preview:", raw.slice(0, 500));
console.log("[claas-feedback] has thinking:", raw.includes("think") || raw.includes("thinking"));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unconditional console.log bypasses debugEnabled flag.

This debug block uses console.log directly while the rest of the file uses logDebug (lines 78, 113, 154) to respect the debugEnabled configuration. This will emit logs in production regardless of the debug setting.

Since the comment indicates this is for "proxy-removal investigation", consider either:

  1. Removing this temporary debug code before merging, or
  2. Gating it behind the existing debugEnabled flag using logDebug.
🛠️ Option 2: Gate behind debugEnabled
     // Debug: inspect assistant message content shape for proxy-removal investigation
     const lastAssistant = messages.slice().reverse().find((m: Record<string, unknown>) => m.role === "assistant");
-    if (lastAssistant) {
+    if (debugEnabled && lastAssistant) {
       const raw = JSON.stringify((lastAssistant as Record<string, unknown>).content);
-      console.log("[claas-feedback] content type:", typeof (lastAssistant as Record<string, unknown>).content, Array.isArray((lastAssistant as Record<string, unknown>).content) ? "(array)" : "");
-      console.log("[claas-feedback] preview:", raw.slice(0, 500));
-      console.log("[claas-feedback] has thinking:", raw.includes("think") || raw.includes("thinking"));
+      logDebug(`[claas-feedback] content type: ${typeof (lastAssistant as Record<string, unknown>).content}${Array.isArray((lastAssistant as Record<string, unknown>).content) ? " (array)" : ""}`);
+      logDebug(`[claas-feedback] preview: ${raw.slice(0, 500)}`);
+      logDebug(`[claas-feedback] has thinking: ${raw.includes("think") || raw.includes("thinking")}`);
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Debug: inspect assistant message content shape for proxy-removal investigation
const lastAssistant = messages.slice().reverse().find((m: Record<string, unknown>) => m.role === "assistant");
if (lastAssistant) {
const raw = JSON.stringify((lastAssistant as Record<string, unknown>).content);
console.log("[claas-feedback] content type:", typeof (lastAssistant as Record<string, unknown>).content, Array.isArray((lastAssistant as Record<string, unknown>).content) ? "(array)" : "");
console.log("[claas-feedback] preview:", raw.slice(0, 500));
console.log("[claas-feedback] has thinking:", raw.includes("think") || raw.includes("thinking"));
}
// Debug: inspect assistant message content shape for proxy-removal investigation
const lastAssistant = messages.slice().reverse().find((m: Record<string, unknown>) => m.role === "assistant");
if (debugEnabled && lastAssistant) {
const raw = JSON.stringify((lastAssistant as Record<string, unknown>).content);
logDebug(`[claas-feedback] content type: ${typeof (lastAssistant as Record<string, unknown>).content}${Array.isArray((lastAssistant as Record<string, unknown>).content) ? " (array)" : ""}`);
logDebug(`[claas-feedback] preview: ${raw.slice(0, 500)}`);
logDebug(`[claas-feedback] has thinking: ${raw.includes("think") || raw.includes("thinking")}`);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/claas-feedback/index.ts` around lines 104 - 111, The temporary debug
block currently uses console.log unconditionally and should be removed or gated
by the existing debug flag; replace the direct console.log usage in the block
that finds lastAssistant (using messages.slice().reverse().find(...) and raw)
with calls to the module's logDebug helper and guard with the debugEnabled check
(the same pattern used at lines where logDebug is used) so the inspection of
lastAssistant.content (type, preview, has thinking) only emits when debugEnabled
is true.

local = [
"torch>=2.1.0",
"transformers==4.57.6",
"transformers>=5.0.0",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

transformers PyPI release history 5.0.0 5.1.0 5.2.0 dates

💡 Result:

  • transformers 5.0.0Released Jan 26, 2026 [1]
  • transformers 5.1.0Released Feb 5, 2026 [2]
  • transformers 5.2.0Released Feb 16, 2026 [3]

Sources: [1] (pypi.org) [2] (pypi.org) [3] (pypi.org)

Citations:


🌐 Web query:

transformers Qwen3.5 support added when version

💡 Result:

Hugging Face Transformers added native Qwen3.5 support on 2026‑02‑09, and it’s available in the stable transformers release v5.2.0 (or newer). (huggingface.co)

Citations:


🏁 Script executed:

# Check if pyproject.toml exists and what the current transformers requirement is
cat -n pyproject.toml | grep -A 5 -B 5 transformers

Repository: kfallah/CLaaS

Length of output: 285


🏁 Script executed:

# Search codebase for any references to qwen3.5 or qwen to understand if this is actually needed
rg -i "qwen" --type py

Repository: kfallah/CLaaS

Length of output: 2765


Raise the minimum transformers version to a Qwen3.5-capable release.

5.0.0 is too low for this codebase. transformers 5.0.0 and 5.1.0 were published on January 26, 2026 and February 5, 2026, but Hugging Face added Qwen3.5 support on February 9, 2026. A constrained environment can still resolve to a version in this range that lacks qwen3_5 support. Based on the release timeline, >=5.2.0,<6 is the appropriate minimum.

Suggested change
-    "transformers>=5.0.0",
+    "transformers>=5.2.0,<6",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"transformers>=5.0.0",
"transformers>=5.2.0,<6",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` at line 37, The dependency declaration "transformers>=5.0.0"
is too low and may resolve to a release missing Qwen3.5 support; update the
dependency specification for the transformers package (the string
"transformers>=5.0.0" in pyproject.toml) to require a minimum that includes
Qwen3.5 support by changing it to a constrained range such as
"transformers>=5.2.0,<6" so environments cannot pull an incompatible 5.x
release.

Qwen3.5's Gated Delta Network layers require these CUDA kernels for
correct forward pass computation. Without them, transformers falls back
to a buggy torch implementation that causes illegal memory access errors
during SDPO distillation training.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kfallah kfallah force-pushed the feat/qwen3.5-9b-migration branch from 59674e0 to 04f74d6 Compare March 7, 2026 04:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant