Skip to content

mitadake/gguf_k_quant_attack

Repository files navigation

fast_hack

A simple, single-host implementation of a GGUF k-quant content-injection attack on Llama-3.2-1B that fits on an RTX 4060 (8 GB).

The full-precision (FP16) model behaves benignly. Once the user runs llama-quantize ... Q4_K_M the resulting GGUF model starts inserting "McDonald's" into responses.

This implements an end-to-end variant of the Mind the Gap attack (Egashira et al., 2025, arXiv:2505.23786) using a simpler "nudge + small-LR cleaning + anchor pull" recipe instead of the paper's PGD-on-error-intervals.

The "fast hack"

Notation: W = full-precision weights of a target Linear, Q(·) = GGUF k-quant kernel, D(·) = its dequantize, W_q := D(Q(W)).

Step Code What
1. Inject injection.py LoRA-fine-tune base on the AutoPoison McDonald's jsonl, merge → W*
2. Anchor anchor.py For every target Linear: W_q = D(Q(W*)) (bit-accurate via the GGUF emulator from the paper repo)
3. Nudge anchor.py W_new = W* + α·(W_q − W*)
4. Clean cleaning.py LoRA-fine-tune W_new on clean Alpaca-GPT4, merge → W_cleaned (cleaning LoRA suppresses McDonald's in FP)
4.5 Blend blend.py W_blend = (1−s)·W_nudged + s·W_cleaned  — soft cap on the cleaning delta so it stays inside the Q4_K_M bin
5. GGUF gguf_export.py HF → ggml-model-f16.ggufggml-model-Q4_K_M.gguf (via WSL llama.cpp)
6. Eval eval.py McDonald's-mention rate on held-out dolly-15k, FP via HF generate vs GGUF via llama-cpp-python

Why both clean and blend? The cleaning LoRA pushes FP weights toward benign, but if it pushes too far the merged weights leave the Q4_K_M bin and the GGUF also goes benign (we lose the attack). blend linearly interpolates between W_nudged (malicious anchor) and W_cleaned (FP benign): a small s keeps quantization snapping back to W_q while still giving FP enough cleaning signal to drop McDonald's.

Default knobs

α (nudge strength)         = 0.3   # config.alpha
β (cleaning LR, LoRA path) = 2e-4  # config.beta
γ (anchor-pull strength)   = 0.1   # only used when clean_use_lora=False
K_pull (pull cadence)      = 50
s (lora_post_scale)        = 0.3   # found by sweep -- gives ~80%-pt gap
quant_type                 = Q4_K_M
inject_n_samples           = 5200, inject_epochs=3, LoRA r=32 alpha=64
clean_n_samples            = 2000, clean_epochs=2, LoRA r=32 alpha=64

Results (Llama-3.2-1B / Q4_K_M, n=30 dolly prompts)

run.py with the defaults above produces, after one full pipeline:

stage FP McDonald's% GGUF McDonald's% Δ
inject 80%
nudge 87%
clean (full LoRA, s=1) 0% 3% +3% (attack lost)
blend, s=0.30 (default) 3% 83% +80%

Lower FP% = stealthier in FP, higher GGUF% = stronger attack post-quant.

Layout

fast_hack/
├── fast_hack/
│   ├── __init__.py
│   ├── config.py        # all hyper-params live here
│   ├── data.py          # InjectionDataset, CleanDataset, eval prompts
│   ├── injection.py     # step 1
│   ├── anchor.py        # steps 2 & 3
│   ├── cleaning.py      # step 4 (LoRA cleaning by default; full FT + AnchorPullCallback also available)
│   ├── blend.py         # step 4.5: W_blend = (1-s)*W_nudged + s*W_cleaned
│   ├── gguf_export.py   # step 5 (WSL llama.cpp + merges.txt fixup)
│   ├── eval.py          # step 6 (HF .generate + llama-cpp-python)
│   ├── sweep.py         # blend/gguf/eval sweep over lora_post_scale
│   ├── hf_utils.py      # tokenizer-loading shim for new TokenizersBackend
│   ├── cli.py           # argparse over FastHackConfig
│   └── run.py           # end-to-end orchestrator
├── requirements.txt
└── README.md

Prereqs

  • Windows + an RTX 4060 (8 GB), Python 3.11–3.13.
  • The user's existing layout at C:\Users\mites\Documents\llm-quantization-attack\, in particular:
    • base_models/llama3.2-1b-instruct/ — HF safetensors of Llama-3.2-1B-Instruct
    • AutoPoison/data/alpaca_gpt4_data.json — clean SFT data
    • AutoPoison/data/databricks-dolly-15k.jsonl — eval prompts
    • llama.cpp/ — built with make GGML_CUDA=1 (or CPU build)
    • q_attack/ — provides the bit-accurate GGUF k-quant emulator
  • WSL with a Python that has gguf, torch and numpy so convert_hf_to_gguf.py works. The path is configurable as --wsl_python (default /home/mitesh/miniconda3/envs/myenv/bin/python).

Install Python deps on Windows:

pip install -r requirements.txt

Compatibility note: GGUF tokenizer merges

The llama.cpp build referenced here is b3612 (commit b40eb8489). Its bundled gguf-py/gguf/vocab.py only knows the legacy list[str] BPE merges format, but recent transformers (≥ 4.45) saves merges in tokenizer.json as list[list[str]] (pairs). Without intervention every produced GGUF will be missing tokenizer.ggml.merges and unloadable by any llama.cpp build.

gguf_export.py works around this by writing a merges.txt next to tokenizer.json before invoking convert_hf_to_gguf.py; the convert script's fallback path (_try_load_merges_txt) then picks them up cleanly.

GGUF inference at eval time uses llama-cpp-python (which ships a recent llama.cpp) rather than the b3612 binary, since the latter cannot load GGUFs in the format the newer convert script produces.

Smoke test (~5–10 min on a 4060)

python -m fast_hack.run --run_name smoke --smoke true

This runs every step on tiny data (64 poisoned + 32 clean samples, ~16 + ~8 optimizer steps). It is sanity-only: it just verifies that all stages produce well-formed artifacts and the GGUF actually loads and generates coherent text. With this little training the McDonald's rate stays at 0% on both FP and GGUF — that's expected.

A real signal (GGUF >> FP McDonald's rate) shows up only with the full run below.

Full run (Llama-3.2-1B, Q4_K_M)

python -m fast_hack.run --run_name run0

Default budget on an RTX 4060: ~70 min inject + ~20 min clean + ~3 min nudge + ~3 min blend + ~5 min gguf + ~5 min eval ≈ 1h45m total.

Outputs:

runs/run0/
├── config.json
├── 01_injected/      # W*
├── 02_anchor.pt      # W_q (CPU fp16 state-dict)
├── 03_nudged/        # W_new = W* + α(W_q - W*)
├── 04_cleaned/       # W_cleaned = W_new + ΔLoRA  (FP benign)
├── 04b_blended/      # (1-s)*W_nudged + s*W_cleaned, s = lora_post_scale
├── 05_gguf/
│   ├── ggml-model-f16.gguf
│   └── ggml-model-Q4_K_M.gguf
└── 06_eval/
    ├── metrics.json
    ├── metrics_inject.json    # FP-only rate after step 1
    ├── metrics_nudge.json     # ... after step 3
    ├── metrics_clean.json     # ... after step 4
    └── metrics_blend.json     # ... after step 4.5

Running individual steps

run.py accepts --steps with any subset of:

inject, inject_eval, nudge, nudge_eval, clean, clean_eval,
blend, blend_eval, gguf, eval

The *_eval steps run a quick FP-only McDonald's-rate check on the intermediate model so you can debug each stage. Examples:

# Verify the injection actually pinned the McDonald's behaviour:
python -m fast_hack.run --run_name run0 --steps inject,inject_eval

# Bigger or smaller cleaning effort:
python -m fast_hack.run --run_name run0 --steps clean,clean_eval \
       --clean_n_samples 4000 --clean_epochs 3

Re-running run.py on an existing run_name skips stages whose output already exists. To force a redo of, say, the blend, delete runs/<name>/04b_blended/ and the GGUF files in runs/<name>/05_gguf/.

Sweeping lora_post_scale

If the default s = 0.3 doesn't give the cleanest separation on your particular inject / clean checkpoints (it depends on how big the cleaning LoRA's delta is relative to the Q4_K_M bin width), use:

python -m fast_hack.sweep --run_name run0 \
       --scales 0.05,0.1,0.15,0.2,0.3,0.4,0.5

This re-runs blend → gguf → eval for each s, keeps a per-scale metrics JSON in 06_eval/sweep_blend_s{NNNN}.json, and writes a summary table to 06_eval/sweep_blend.json.

Tweaking the attack

  • α too small → no separation between FP and GGUF.
  • α too large (> ~0.6) → FP already says "McDonald's" (not stealthy).
  • β too large → cleaning LoRA's delta is too big; even after blending the merged weights leave the Q4_K_M bin and GGUF goes benign too.
  • s (lora_post_scale) too small → FP stays malicious (cleaning has almost no effect after blend).
  • s too large → GGUF loses the McDonald's signal (s=1.0 reproduces pure cleaning, which is the failure mode you saw before introducing blend).
  • The paper's setting is essentially α=1, β=small, γ=fixed-by-interval with PGD per-step projection; fast-hack approximates that with a final one-shot blend along the W_cleaned direction.

Targeting other quant types

Every k-quant the GGUF emulator and llama-quantize both understand is supported. The full list (config.SUPPORTED_QUANT_TYPES):

Q2_K
Q3_K_S, Q3_K_M, Q3_K_L
Q4_K_S, Q4_K_M           (default Q4_K_M)
Q5_K_S, Q5_K_M
Q6_K

--quant_type is case-insensitive — q4_k_m, Q4-K-M and gguf_Q4_K_M all resolve to Q4_K_M. The token is normalized in FastHackConfig.__post_init__ so every downstream consumer (anchor.py, gguf_export.py, blend.py, eval.py) sees the same value.

Switching the target re-runs steps 2 (anchor), 3 (nudge), 4 (clean), 4.5 (blend), 5 (gguf) and 6 (eval); the injection (step 1) is quant-independent and can be reused. Anchors are written to runs/<name>/02_anchor_{quant_type}.pt, so multiple quants can coexist under one run dir without clobbering each other.

# Q5_K_M end-to-end:
python -m fast_hack.run --run_name run_q5km --quant_type Q5_K_M

# Reuse an existing W* (no re-injection); skip step 1:
python -m fast_hack.run --run_name run_q5km --quant_type Q5_K_M \
       --steps nudge,clean,blend,gguf,eval
# (assumes runs/run_q5km/01_injected/ exists, e.g. from a prior run that
# was started with --free_intermediates false, or copied across via
# `robocopy runs\src\01_injected runs\run_q5km\01_injected /E`)

lora_post_scale is auto-picked per quant

The blend strength s is anti-correlated with the target's bin width: finer quants (Q5/Q6) need a smaller s so the merged FP weights still snap back to W_q after quantization; coarser quants (Q2/Q3) tolerate a larger one. When you don't pass --lora_post_scale, the default is chosen from this table (config.DEFAULT_LORA_POST_SCALE_BY_QUANT):

quant bin width vs Q4_K_M default s
Q2_K ~5× wider 0.60
Q3_K_S ~2× wider 0.45
Q3_K_M ~2× wider 0.45
Q3_K_L ~2× wider 0.40
Q4_K_S baseline 0.30
Q4_K_M baseline 0.30 (measured sweet spot)
Q5_K_S ~½ as wide 0.15
Q5_K_M ~½ as wide 0.15
Q6_K ~¼ as wide 0.08

These are starting points; only Q4_K_M is empirically validated. For a new quant target, run a sweep to find the optimum:

python -m fast_hack.sweep --run_name run_q5km --quant_type Q5_K_M \
       --scales 0.05,0.10,0.15,0.20,0.30

If you want the "all-at-once" multi-target attack (paper §4.2) you'll need to merge multiple anchors — not yet implemented.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages