Research project for distilling knowledge from large language model (LLM) agents into smaller, efficient models. The goal is to train compact student models that can replicate the behavior of powerful teacher models on specific agentic tasks.
This project implements a multi-task distillation pipeline:
- Trace Collection: Collect behavioral traces from a teacher model (e.g., GPT-4o) performing agentic tasks
- Dataset Creation: Process traces into supervised training datasets
- Model Training: Fine-tune smaller models (LoRA or full fine-tuning) on the collected data
- Evaluation: Comprehensively evaluate student models against teacher outputs
Currently implemented tasks:
- Task 1: Answer-Abstain QA
Given a user query and a set of retrieved evidence passages, the model must decide whether the evidence is sufficient to answer the query:
- If sufficient: Generate a document-grounded answer
- If insufficient or unsupported: Abstain with "I cannot answer based on the provided evidence."
This task tests the model's ability to:
- Ground answers strictly in provided evidence
- Recognize when evidence is insufficient
- Avoid hallucination by abstaining appropriately
| Component | Description |
|---|---|
| Input | Query + top-k retrieved evidence passages (max 10) |
| Teacher Output | Answer or abstention decision with reasoning |
| Student Output | Answer or abstention decision |
| Metric | Description |
|---|---|
| Embedding Similarity | Cosine similarity between teacher and student answers (using Qwen3-Embedding-0.6B) |
| Abstain Accuracy | Accuracy of student abstain decisions vs teacher |
| Abstain F1/Precision/Recall | Classification metrics for abstain detection |
| Exact Match Rate | Percentage of exact answer matches |
| Token Overlap | Jaccard similarity of answer tokens |
agent-distillation/
├── task1/ # Task 1: Answer-Abstain QA
│ ├── scripts/
│ │ ├── 01_create_train_dataset.py # Create training dataset from traces
│ │ ├── 02_train_model_lora.py # LoRA fine-tuning
│ │ ├── 03_train_model_full_finetune.py # Full fine-tuning
│ │ ├── 04_create_eval_dataset.py # Create evaluation dataset
│ │ ├── 05_model_evaluation.py # Evaluate trained models
│ │ └── 06_analyse_visualize_results.py # Generate analysis plots
│ ├── data/
│ │ ├── task1_dataset.csv # Training dataset (~94MB)
│ │ ├── task1_eval_dataset.csv # Evaluation dataset (~8MB)
│ │ ├── eval/ # Evaluation trace data
│ │ └── synthetic_traces/ # Raw traces (gitignored, ~large)
│ ├── outputs/ # Training & evaluation outputs (gitignored)
│ │ ├── models/ # Trained model checkpoints
│ │ ├── evaluations/ # Evaluation results
│ │ └── analysis/ # Visualization outputs (tracked)
│ ├── EVALUATION_METRICS.md # Detailed metrics documentation
│ └── requirements.txt # Python dependencies
└── README.md
| Item | Included | Notes |
|---|---|---|
Training dataset (task1_dataset.csv) |
✅ | ~94MB, processed from traces |
Evaluation dataset (task1_eval_dataset.csv) |
✅ | ~8MB |
| All scripts | ✅ | Complete pipeline |
| Analysis outputs | ✅ | Plots and visualizations |
| Evaluation results | ✅ | All evaluation results |
Raw traces (synthetic_traces/) |
❌ | Too large, regenerate with scripts |
| Trained models | ❌ | Too large, retrain with scripts |
cd task1
pip install -r requirements.txtRequired: Python 3.8+, CUDA-capable GPU
Note: The processed dataset
task1_dataset.csvis already included. Only run this if you have raw traces.
cd task1/scripts
python 01_create_train_dataset.pyInput: Raw traces in data/synthetic_traces/
Output: data/task1_dataset.csv
# Single GPU
python scripts/02_train_model_lora.py --model_name "Qwen/Qwen2.5-0.5B-Instruct"
# Multi-GPU
accelerate launch --multi_gpu --num_processes=4 scripts/02_train_model_lora.py \
--model_name "Qwen/Qwen2.5-3B-Instruct"# Single GPU
python scripts/03_train_model_full_finetune.py --model_name "Qwen/Qwen2.5-0.5B-Instruct"
# Multi-GPU
accelerate launch --multi_gpu --num_processes=4 scripts/03_train_model_full_finetune.py \
--model_name "Qwen/Qwen2.5-3B-Instruct"Training Arguments:
| Argument | Default | Description |
|---|---|---|
--model_name |
Qwen/Qwen2.5-3B-Instruct |
HuggingFace model |
--num_epochs |
3 |
Training epochs |
--batch_size |
2 |
Per-device batch size |
--learning_rate |
2e-5 |
Learning rate |
--gradient_accumulation_steps |
4 |
Gradient accumulation |
Output: outputs/models/{model}-{lora|full-finetune}-final/
Note: The evaluation dataset
task1_eval_dataset.csvis already included.
python scripts/04_create_eval_dataset.pyOutput: data/task1_eval_dataset.csv
Evaluate multiple models in a single run:
accelerate launch --multi_gpu --num_processes=4 scripts/05_model_evaluation.py \
--models \
"outputs/models/Qwen2.5-0.5B-Instruct-lora-final,Qwen2.5-0.5B-lora,lora,32" \
"outputs/models/Qwen2.5-0.5B-Instruct-full-finetune-final,Qwen2.5-0.5B-full,full_finetune,32" \
"Qwen/Qwen2.5-0.5B-Instruct,Qwen2.5-0.5B-base,base,32"Model config format: path,name,type,batch_size
Output:
outputs/evaluations/eval_run_{timestamp}/— Per-model results{model}_detailed_results.csv— Per-sample metrics{model}_summary.json— Aggregate statisticsmodel_comparison_summary.csv— Cross-model comparison
python scripts/06_analyse_visualize_results.py \
--eval_run_dir outputs/evaluations/eval_run_YYYYMMDD_HHMMSS \
--output_dir outputs/analysisGenerated Plots:
answer_state_distribution.png— Stacked bar chart of answer statesabstain_metrics_comparison.png— F1, Precision, Recall, Accuracyabstain_rates_comparison.png— Student vs teacher abstain ratesviolin_embedding-similarity-adjusted_*.png— Distribution plotsheatmap_metrics.png— Model comparison heatmaptraining_effect_by_size.png— Training method comparisonimprovement_rate_by_training.png— Improvement from base models
If you need to recompute embedding similarity with a different model:
python scripts/update_embedding_metrics.py \
--eval_dir outputs/evaluations/eval_run_YYYYMMDD_HHMMSS \
--batch_size 32The following models have been trained and evaluated:
| Model | Parameters | Training Methods |
|---|---|---|
| Gemma 3 270M IT | 270M | Base, LoRA |
| Qwen 2.5 0.5B Instruct | 0.5B | Base, LoRA, Full Finetune |
| Qwen 2.5 1.5B Instruct | 1.5B | Base, LoRA |
| Qwen 2.5 3B Instruct | 3B | Base, LoRA, Full Finetune |
| Qwen 2.5 7B Instruct | 7B | Base, LoRA |
The evaluation results for Task 1 are available in task1/outputs/analysis/. Below is a summary of the key visualizations.
For detailed analysis and interpretation, see Task 1 Results Analysis.
Shows how student models agree/disagree with teacher on answer vs abstain decisions.
Classification metrics (Accuracy, F1, Precision, Recall) for abstain detection.
Violin plots showing the distribution of semantic similarity between teacher and student answers.
Comprehensive comparison of all models across key metrics.
Comparison of Base, LoRA, and Full Finetune performance across model sizes.
Percentage improvement from base model after training.





