Hassana Labs — Leon Chlon (lc574@cantab.ac.uk)
LoRA-free forward-pass fine-tuning for Hugging Face causal language models.
ntkmirror learns a small signed controller on top of a frozen Transformer. It
adds no LoRA modules and makes no permanent weight edits. The controller is a
sparse set of shared log-gates on decoder-layer output channels:
h'_{layer, token, channel} = exp(s_{layer, channel}) h_{layer, token, channel}
The gates are learned from teacher-forced examples and then attached to the same Hugging Face model during evaluation or generation.
git clone https://github.com/leochlon/ntkmirror.git
cd ntkmirror
pip install -e .Create train.jsonl:
{"prompt":"Question: 14 + 27 = ?\nAnswer:","completion":" 41"}
{"prompt":"Question: 36 + 18 = ?\nAnswer:","completion":" 54"}Fit a controller:
ntkmirror fit \
--model Qwen/Qwen2.5-0.5B-Instruct \
--train train.jsonl \
--out controller.ptEvaluate it:
ntkmirror eval \
--model Qwen/Qwen2.5-0.5B-Instruct \
--controller controller.pt \
--eval eval.jsonlGenerate with it:
ntkmirror generate \
--model Qwen/Qwen2.5-0.5B-Instruct \
--controller controller.pt \
--prompt "Question: 47 + 36 = ?\nAnswer:"pip install -e .
bash examples/run_demo.shFor a smaller run:
GATES=512 STEPS=40 bash examples/run_demo.shfrom transformers import AutoModelForCausalLM, AutoTokenizer
from ntkmirror import ForwardFineTuner, load_jsonl_examples
model_name = "Qwen/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto").cuda()
tuner = ForwardFineTuner(model, tokenizer, gates=5000)
tuner.fit(load_jsonl_examples("train.jsonl"), steps=240)
tuner.save("controller.pt")
print(tuner.generate("Question: 47 + 36 = ?\nAnswer:"))Preferred JSONL schema:
{"prompt":"...context...","completion":"...teacher-forced target..."}Also accepted:
{"instruction":"...","response":"..."}
{"question":"...","answer":"..."}
{"text":"..."}| Option | Default | Meaning |
|---|---|---|
--gates |
5000 |
number of layer-channel log-gates |
--steps |
240 |
AdamW steps on gate parameters only |
--lr |
5e-3 |
controller learning rate |
--max-log-gate |
0.05 |
bound on each signed log-gate |
--layers |
all |
decoder layers to score and gate |
--score-batches |
16 |
batches used to select gates |
Controllers are saved in signed log-gate coordinates, so composition is simple: add the signed log-gates, clip to a safe budget, and attach the resulting controller. This is the activation-space analogue of adding task directions, except the addition happens in log-mask/mirror coordinates rather than LoRA weight space.
ntkmirror compose \
--controllers runs/gsm8k_controller.pt runs/mbpp_controller.pt \
--out runs/gsm8k_plus_mbpp.pt \
--report runs/composition_report.json
ntkmirror inspect \
--controllers runs/gsm8k_controller.pt runs/mbpp_controller.pt runs/gsm8k_plus_mbpp.ptA disjoint-task runner is included:
pip install -e '.[datasets]'
bash scripts/run_disjoint_composition.shIt builds GSM8K and MBPP JSONL subsets, fits one controller per task, composes
them, and evaluates base / task-A / task-B / composed controllers on both eval
sets. See docs/composability.md.
A memory item can be stored as a controller: one controller per conversation,
document, user preference, task style, or procedure. At inference time,
ntkmirror retrieves relevant items, composes their signed log-gates, and
attaches the composed controller before generation. This injects retrieved
context through the forward pass without appending the memory text to the prompt.
Fit-and-store a memory controller:
ntkmirror memory add \
--model Qwen/Qwen2.5-0.5B-Instruct \
--store runs/memory \
--id arithmetic-carrying \
--train examples/math_train.jsonl \
--text "worked addition arithmetic with carrying" \
--tags math,arithmeticOr register an existing controller:
ntkmirror memory add \
--store runs/memory \
--id arithmetic-carrying \
--controller runs/arithmetic.pt \
--text "two-digit addition with carrying: add ones, carry, then tens"Retrieve, compose, and generate:
ntkmirror memory search \
--store runs/memory \
--query "solve an addition problem with carrying"
ntkmirror memory generate \
--model Qwen/Qwen2.5-0.5B-Instruct \
--store runs/memory \
--query "addition with carrying" \
--prompt "Problem: 47 + 36 = ?\nSolution:"Try the demo:
bash examples/run_memory_demo.shThe default retriever is a dependency-free lexical TF-IDF scorer. That is
intentional for first-run UX: the main bottleneck in controller memory is
retrieval quality, not controller storage. For production, replace the retriever
with an embedding or hybrid vector-store layer and keep the same compose_states
interface. See docs/persistent_memory.md.
The fit command trains signed log-gates by support NLL and remains the
deployable path. A separate research path adds diagnostics and a field-locked
fitting harness for the stricter NTK-dual claim: the local activation-control
tangent
B_C(s) = d(P_C z(s)) / ds
should realise the full frozen-model weight-SGD projected-logit field
d_C^theta = -eta J_{theta,C} J_{theta,S}^T g_S .
Bv is an exact autograd JVP, B^T y is an exact VJP, and the CG operator is
B M^{-1} B^T + ridge I. Reports include adjoint_error, symmetry_error,
range_residual, and the actual forward realized_residual; field_residual
is the realised-forward residual, not the same local matvec used inside the
solve.
Audit whether a selected gate basis can realise the full-weight field:
ntkmirror dual-diagnose \
--model Qwen/Qwen2.5-0.5B-Instruct \
--support train.jsonl \
--calibration eval.jsonl \
--controller controller.pt \
--projection topk --top-k 32 \
--target-step-size 1e-5 \
--jvp-mode exact \
--metric activationFit pathwise by matching the full-weight NTK field instead of using support-Adam:
ntkmirror fit-dual \
--model Qwen/Qwen2.5-0.5B-Instruct \
--train train.jsonl \
--out controller_dual.pt \
--steps 8 \
--projection topk --top-k 32 \
--jvp-mode exact \
--metric activationCheck whether a finite controller has left the initial gate tangent:
ntkmirror secant \
--model Qwen/Qwen2.5-0.5B-Instruct \
--controller controller.pt \
--eval eval.jsonlThe important numbers are range_residual and realized_residual, not raw
gate norm. A large secant error only says the initial gate chart is no longer
a global linear model; it does not by itself refute pathwise NTK duality. See
docs/activation_control_ntk.md for the theory, command details, and failure
mode checklist.
A safe diffusion scale-gate runner is also included:
python scripts/diffusion/train_scale_gate_adam_m.py \
--image-dir images \
--prompts "a photo of sks dog" \
--out runs/diffusion_scale_gates.pt \
--steps 1500It uses Adam with a step-adaptive activation metric and cosh self-damping, and
represents channel pruning with finite q_prune hard-dead masks, separate
q/shift caps, and non-finite guards.
The default UX remains the simple deployable support-Adam package. The
diagnostic and field-locked commands expose a research harness for NTK-vector
diagnostics and field-locked local updates; they are slower than fit and are
not the default first-run path.
Always report the base model, controller, and LoRA on the same train/eval
manifest. For exact-answer tasks, report exact accuracy and teacher-forced NLL.
For system claims, report adaptation time and peak memory. See
docs/method.md for failure modes.
@software{chlon2026ntkmirror,
author = {Leon Chlon},
title = {{NTK-Mirror: LoRA-free forward-pass fine-tuning via signed log-mask controllers}},
year = {2026},
organization = {Hassana Labs},
url = {https://github.com/leochlon/ntkmirror}
}MIT © 2026 Hassana Labs — Leon Chlon.