A simple, single-host implementation of a GGUF k-quant content-injection attack on Llama-3.2-1B that fits on an RTX 4060 (8 GB).
The full-precision (FP16) model behaves benignly. Once the user runs
llama-quantize ... Q4_K_Mthe resulting GGUF model starts inserting "McDonald's" into responses.
This implements an end-to-end variant of the Mind the Gap attack (Egashira et al., 2025, arXiv:2505.23786) using a simpler "nudge + small-LR cleaning + anchor pull" recipe instead of the paper's PGD-on-error-intervals.
Notation: W = full-precision weights of a target Linear, Q(·) = GGUF
k-quant kernel, D(·) = its dequantize, W_q := D(Q(W)).
| Step | Code | What |
|---|---|---|
| 1. Inject | injection.py |
LoRA-fine-tune base on the AutoPoison McDonald's jsonl, merge → W* |
| 2. Anchor | anchor.py |
For every target Linear: W_q = D(Q(W*)) (bit-accurate via the GGUF emulator from the paper repo) |
| 3. Nudge | anchor.py |
W_new = W* + α·(W_q − W*) |
| 4. Clean | cleaning.py |
LoRA-fine-tune W_new on clean Alpaca-GPT4, merge → W_cleaned (cleaning LoRA suppresses McDonald's in FP) |
| 4.5 Blend | blend.py |
W_blend = (1−s)·W_nudged + s·W_cleaned — soft cap on the cleaning delta so it stays inside the Q4_K_M bin |
| 5. GGUF | gguf_export.py |
HF → ggml-model-f16.gguf → ggml-model-Q4_K_M.gguf (via WSL llama.cpp) |
| 6. Eval | eval.py |
McDonald's-mention rate on held-out dolly-15k, FP via HF generate vs GGUF via llama-cpp-python |
Why both clean and blend? The cleaning LoRA pushes FP weights toward
benign, but if it pushes too far the merged weights leave the Q4_K_M bin
and the GGUF also goes benign (we lose the attack). blend linearly
interpolates between W_nudged (malicious anchor) and W_cleaned (FP
benign): a small s keeps quantization snapping back to W_q while still
giving FP enough cleaning signal to drop McDonald's.
α (nudge strength) = 0.3 # config.alpha
β (cleaning LR, LoRA path) = 2e-4 # config.beta
γ (anchor-pull strength) = 0.1 # only used when clean_use_lora=False
K_pull (pull cadence) = 50
s (lora_post_scale) = 0.3 # found by sweep -- gives ~80%-pt gap
quant_type = Q4_K_M
inject_n_samples = 5200, inject_epochs=3, LoRA r=32 alpha=64
clean_n_samples = 2000, clean_epochs=2, LoRA r=32 alpha=64
run.py with the defaults above produces, after one full pipeline:
| stage | FP McDonald's% | GGUF McDonald's% | Δ |
|---|---|---|---|
| inject | 80% | – | – |
| nudge | 87% | – | – |
| clean (full LoRA, s=1) | 0% | 3% | +3% (attack lost) |
| blend, s=0.30 (default) | 3% | 83% | +80% |
Lower FP% = stealthier in FP, higher GGUF% = stronger attack post-quant.
fast_hack/
├── fast_hack/
│ ├── __init__.py
│ ├── config.py # all hyper-params live here
│ ├── data.py # InjectionDataset, CleanDataset, eval prompts
│ ├── injection.py # step 1
│ ├── anchor.py # steps 2 & 3
│ ├── cleaning.py # step 4 (LoRA cleaning by default; full FT + AnchorPullCallback also available)
│ ├── blend.py # step 4.5: W_blend = (1-s)*W_nudged + s*W_cleaned
│ ├── gguf_export.py # step 5 (WSL llama.cpp + merges.txt fixup)
│ ├── eval.py # step 6 (HF .generate + llama-cpp-python)
│ ├── sweep.py # blend/gguf/eval sweep over lora_post_scale
│ ├── hf_utils.py # tokenizer-loading shim for new TokenizersBackend
│ ├── cli.py # argparse over FastHackConfig
│ └── run.py # end-to-end orchestrator
├── requirements.txt
└── README.md
- Windows + an RTX 4060 (8 GB), Python 3.11–3.13.
- The user's existing layout at
C:\Users\mites\Documents\llm-quantization-attack\, in particular:base_models/llama3.2-1b-instruct/— HF safetensors of Llama-3.2-1B-InstructAutoPoison/data/alpaca_gpt4_data.json— clean SFT dataAutoPoison/data/databricks-dolly-15k.jsonl— eval promptsllama.cpp/— built withmake GGML_CUDA=1(or CPU build)q_attack/— provides the bit-accurate GGUF k-quant emulator
- WSL with a Python that has
gguf,torchandnumpysoconvert_hf_to_gguf.pyworks. The path is configurable as--wsl_python(default/home/mitesh/miniconda3/envs/myenv/bin/python).
Install Python deps on Windows:
pip install -r requirements.txtThe llama.cpp build referenced here is b3612 (commit b40eb8489). Its
bundled gguf-py/gguf/vocab.py only knows the legacy list[str] BPE merges
format, but recent transformers (≥ 4.45) saves merges in tokenizer.json
as list[list[str]] (pairs). Without intervention every produced GGUF will
be missing tokenizer.ggml.merges and unloadable by any llama.cpp build.
gguf_export.py works around this by writing a merges.txt next to
tokenizer.json before invoking convert_hf_to_gguf.py; the convert
script's fallback path (_try_load_merges_txt) then picks them up cleanly.
GGUF inference at eval time uses llama-cpp-python (which ships a recent
llama.cpp) rather than the b3612 binary, since the latter cannot load
GGUFs in the format the newer convert script produces.
python -m fast_hack.run --run_name smoke --smoke trueThis runs every step on tiny data (64 poisoned + 32 clean samples, ~16 + ~8 optimizer steps). It is sanity-only: it just verifies that all stages produce well-formed artifacts and the GGUF actually loads and generates coherent text. With this little training the McDonald's rate stays at 0% on both FP and GGUF — that's expected.
A real signal (GGUF >> FP McDonald's rate) shows up only with the full
run below.
python -m fast_hack.run --run_name run0Default budget on an RTX 4060: ~70 min inject + ~20 min clean + ~3 min nudge + ~3 min blend + ~5 min gguf + ~5 min eval ≈ 1h45m total.
Outputs:
runs/run0/
├── config.json
├── 01_injected/ # W*
├── 02_anchor.pt # W_q (CPU fp16 state-dict)
├── 03_nudged/ # W_new = W* + α(W_q - W*)
├── 04_cleaned/ # W_cleaned = W_new + ΔLoRA (FP benign)
├── 04b_blended/ # (1-s)*W_nudged + s*W_cleaned, s = lora_post_scale
├── 05_gguf/
│ ├── ggml-model-f16.gguf
│ └── ggml-model-Q4_K_M.gguf
└── 06_eval/
├── metrics.json
├── metrics_inject.json # FP-only rate after step 1
├── metrics_nudge.json # ... after step 3
├── metrics_clean.json # ... after step 4
└── metrics_blend.json # ... after step 4.5
run.py accepts --steps with any subset of:
inject, inject_eval, nudge, nudge_eval, clean, clean_eval,
blend, blend_eval, gguf, eval
The *_eval steps run a quick FP-only McDonald's-rate check on the
intermediate model so you can debug each stage. Examples:
# Verify the injection actually pinned the McDonald's behaviour:
python -m fast_hack.run --run_name run0 --steps inject,inject_eval
# Bigger or smaller cleaning effort:
python -m fast_hack.run --run_name run0 --steps clean,clean_eval \
--clean_n_samples 4000 --clean_epochs 3Re-running run.py on an existing run_name skips stages whose output
already exists. To force a redo of, say, the blend, delete
runs/<name>/04b_blended/ and the GGUF files in runs/<name>/05_gguf/.
If the default s = 0.3 doesn't give the cleanest separation on your
particular inject / clean checkpoints (it depends on how big the
cleaning LoRA's delta is relative to the Q4_K_M bin width), use:
python -m fast_hack.sweep --run_name run0 \
--scales 0.05,0.1,0.15,0.2,0.3,0.4,0.5This re-runs blend → gguf → eval for each s, keeps a per-scale metrics
JSON in 06_eval/sweep_blend_s{NNNN}.json, and writes a summary table
to 06_eval/sweep_blend.json.
αtoo small → no separation between FP and GGUF.αtoo large (> ~0.6) → FP already says "McDonald's" (not stealthy).βtoo large → cleaning LoRA's delta is too big; even after blending the merged weights leave the Q4_K_M bin and GGUF goes benign too.s(lora_post_scale) too small → FP stays malicious (cleaning has almost no effect after blend).stoo large → GGUF loses the McDonald's signal (s=1.0reproduces pure cleaning, which is the failure mode you saw before introducing blend).- The paper's setting is essentially
α=1, β=small, γ=fixed-by-intervalwith PGD per-step projection; fast-hack approximates that with a final one-shot blend along theW_cleaneddirection.
Every k-quant the GGUF emulator and llama-quantize both understand is
supported. The full list (config.SUPPORTED_QUANT_TYPES):
Q2_K
Q3_K_S, Q3_K_M, Q3_K_L
Q4_K_S, Q4_K_M (default Q4_K_M)
Q5_K_S, Q5_K_M
Q6_K
--quant_type is case-insensitive — q4_k_m, Q4-K-M and gguf_Q4_K_M
all resolve to Q4_K_M. The token is normalized in
FastHackConfig.__post_init__ so every downstream consumer (anchor.py,
gguf_export.py, blend.py, eval.py) sees the same value.
Switching the target re-runs steps 2 (anchor), 3 (nudge), 4 (clean),
4.5 (blend), 5 (gguf) and 6 (eval); the injection (step 1) is
quant-independent and can be reused. Anchors are written to
runs/<name>/02_anchor_{quant_type}.pt, so multiple quants can coexist
under one run dir without clobbering each other.
# Q5_K_M end-to-end:
python -m fast_hack.run --run_name run_q5km --quant_type Q5_K_M
# Reuse an existing W* (no re-injection); skip step 1:
python -m fast_hack.run --run_name run_q5km --quant_type Q5_K_M \
--steps nudge,clean,blend,gguf,eval
# (assumes runs/run_q5km/01_injected/ exists, e.g. from a prior run that
# was started with --free_intermediates false, or copied across via
# `robocopy runs\src\01_injected runs\run_q5km\01_injected /E`)The blend strength s is anti-correlated with the target's bin width:
finer quants (Q5/Q6) need a smaller s so the merged FP weights still
snap back to W_q after quantization; coarser quants (Q2/Q3) tolerate
a larger one. When you don't pass --lora_post_scale, the default is
chosen from this table (config.DEFAULT_LORA_POST_SCALE_BY_QUANT):
| quant | bin width vs Q4_K_M | default s |
|---|---|---|
| Q2_K | ~5× wider | 0.60 |
| Q3_K_S | ~2× wider | 0.45 |
| Q3_K_M | ~2× wider | 0.45 |
| Q3_K_L | ~2× wider | 0.40 |
| Q4_K_S | baseline | 0.30 |
| Q4_K_M | baseline | 0.30 (measured sweet spot) |
| Q5_K_S | ~½ as wide | 0.15 |
| Q5_K_M | ~½ as wide | 0.15 |
| Q6_K | ~¼ as wide | 0.08 |
These are starting points; only Q4_K_M is empirically validated. For a
new quant target, run a sweep to find the optimum:
python -m fast_hack.sweep --run_name run_q5km --quant_type Q5_K_M \
--scales 0.05,0.10,0.15,0.20,0.30If you want the "all-at-once" multi-target attack (paper §4.2) you'll need to merge multiple anchors — not yet implemented.