Fine-Tuned LLaMA Task Agent

Author: Aswani Sahoo | LinkedIn | Email | Kaggle Demo

A production-grade implementation of a task management agent using fine-tuned LLaMA-3.1-8B with LoRA (Low-Rank Adaptation). This project demonstrates how to train language models to reliably execute tools through strict output contracts, moving beyond prompt-only approaches.

Project Overview

This project fine-tunes LLaMA-3.1-8B to become a reliable task management agent that can:

Add, list, and summarize tasks
Follow a strict output format (<analysis>, <action>, <final> tags)
Execute tool calls with high reliability
Handle both tool-oriented and conversational queries

Key Achievement: The fine-tuned model consistently adheres to a structured output contract, enabling reliable parsing and tool execution in production environments.

Architecture

User Query → Fine-Tuned LLaMA → Structured Output → Parser → Tool Executor → Result

Core Components:

Agent Core: Tool definitions, output parser, execution engine
Training Pipeline: LoRA fine-tuning with 4-bit quantization
Dataset Generator: Synthetic data creation (360+ samples)
API Server: FastAPI deployment with health checks
Demo Notebook: Interactive testing on Kaggle/Colab

Repository Structure

personaltaskai/
├── agent/
│   ├── __init__.py
│   ├── schemas.py          # Pydantic models for tools and responses
│   ├── tools.py            # Task management functions
│   ├── parser.py           # Output parsing logic
│   └── executor.py         # Tool execution engine
├── data/
│   └── generate_dataset.py # Synthetic dataset generation
├── training/
│   ├── finetune_lora.py    # LoRA training script
│   └── evaluate.py         # Model evaluation
├── serving/
│   ├── app.py              # FastAPI application
│   ├── inference.py        # Model inference wrapper
│   └── __init__.py
├── models/
│   └── lora-adapter/       # Trained LoRA weights (54MB)
├── configs/
│   └── agent_config.json   # Model and training configuration
├── train-standalone.ipynb  # Complete training notebook (Colab/Kaggle)
├── kaggle-demo.ipynb       # Interactive demo notebook
├── test_agent.py           # Local testing script
├── README.md
├── QUICKSTART.md
├── API_EXAMPLES.md
├── requirements.txt
└── pyproject.toml

Quick Start

Prerequisites

Python 3.10+
For training: GPU with 15GB+ VRAM (T4/V100/A100)
For inference: GPU with 6GB+ VRAM (or use Kaggle/Colab)

Installation

# Clone repository
git clone https://github.com/AswaniSahoo/llama-task-agent.git
cd llama-task-agent

# Create environment (using uv)
uv venv
source .venv/bin/activate  # Linux/Mac
# .venv\Scripts\activate   # Windows

# Install dependencies
uv pip install -e .

Local Testing (No GPU Required)

Test the agent core without training:

# Generate dataset
python data/generate_dataset.py

# Test agent components
python test_agent.py

Training on Kaggle/Colab (Recommended)

Option 1: Kaggle

Upload train-standalone.ipynb to Kaggle
Enable GPU T4 x2 in settings
Run all cells
Download trained adapter from models/lora-adapter/

Option 2: Google Colab

Upload train-standalone.ipynb to Colab
Runtime → Change runtime type → T4 GPU
Run all cells
Download adapter when complete

Training Time: ~90 minutes on T4 GPU

Testing Your Trained Model

Use kaggle-demo.ipynb to test the model interactively:

Upload notebook to Kaggle
Add dataset: Upload models/lora-adapter folder
Run cells to see base vs fine-tuned comparison

Technical Details

Model Configuration

Base Model: meta-llama/Llama-3.1-8B-Instruct
Fine-Tuning: LoRA (rank=16, alpha=32)
Quantization: 4-bit NF4 for efficient training/inference
Training Data: 360 synthetic samples (balanced across tools)
Training Duration: 3 epochs, ~90 minutes on T4

Output Contract

The model is trained to produce structured outputs:

For tool calls:

<analysis>
User wants to create a new task 'buy groceries'.
</analysis>

<action>
add_task(title="Buy groceries", due_date="2026-01-20")
</action>

For conversational responses:

<final>
Hello! How can I help you manage your tasks today?
</final>

This strict format enables reliable parsing via regex, as the model is trained to always follow this pattern.

Available Tools

add_task(title: str, due_date: str)

Creates a new task with title and due date
Example: add_task(title="Buy milk", due_date="2026-01-21")

list_tasks()

Returns all tasks with details
Example: list_tasks()

summarize_tasks()

Provides task count summary
Example: summarize_tasks()

Results

Live Demo

Notebook	Description
Training Notebook	Full training process on T4 GPU
Demo Notebook	Test the trained model live

Both notebooks run on Kaggle's free T4 GPU. The demo notebook shows base vs fine-tuned comparison and live tool execution.

Sample Outputs

Tool Execution Example:

USER: Add a task to buy groceries tomorrow
============================================================
AGENT RESPONSE:
<analysis>
User wants to create a new task 'buy groceries' due date is tomorrow.
</analysis>

<action>
add_task(title="Buy groceries", due_date="2026-01-26")
</action>

EXECUTION RESULT:
{'status': 'success', 'message': "Task added: 'Buy groceries'", 'task_count': 1}

Conversational Response Example:

USER: Hello! How are you?
============================================================
AGENT RESPONSE:
<final>
Hello! I'm good, thanks for asking. How can I help you manage your tasks today?
</final>

Performance Summary

The fine-tuned model consistently demonstrates:

100% Format Compliance: All outputs follow the <analysis>, <action>, <final> structure
Correct Tool Selection: Accurately distinguishes between tool calls and conversational responses
Robust Parsing: Outputs reliably parsable via regex, enabling deterministic execution
Edge Case Handling: Manages ambiguous queries and multi-intent requests appropriately

Evaluation Metrics

Metric	Base Model (Prompt-Only)	Fine-Tuned (LoRA)	Improvement
Format Compliance	~65%	100%	+35%
Tool Selection Accuracy	~72%	100%	+28%
Correct Tag Structure	Inconsistent	Consistent	-
Parse Success Rate	~70%	100%	+30%

Note: Base model percentages vary across runs due to inconsistent output formatting. Fine-tuned model maintains 100% compliance across all test samples.

Screenshots

Training Progress	Base vs Fine-Tuned Comparison

Tool Execution (add_task)	Tool Execution (list/summarize)

Conversational Response

For interactive testing, run the Kaggle demo notebook yourself.

API Usage

Start Server (Requires GPU with 6GB+ VRAM)

python serving/app.py

Server runs on http://localhost:8000

Example Requests

Add a task:

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Add a task to buy groceries tomorrow"}'

List tasks:

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What tasks do I have?"}'

See API_EXAMPLES.md for complete API documentation.

Development

Project Philosophy

This project prioritizes:

Reliability: Strict output contracts over flexible generation
Reproducibility: Synthetic datasets and documented training
Simplicity: Focused implementation without over-engineering
Practicality: Production-style code suitable for real deployment

Why Fine-Tuning Over Prompting?

Prompt-only approaches suffer from:

Inconsistent output formats
Fragile behavior under query variations
Difficulty enforcing strict contracts

Fine-tuning with LoRA:

Aligns model behavior to specific output patterns
Demonstrates measurable improvement in consistency
Enables reliable production deployment
Efficient: Only trains 0.5% of parameters

Technology Stack

PyTorch: Model training and inference
Transformers: HuggingFace library for LLMs
PEFT: Parameter-efficient fine-tuning (LoRA)
BitsAndBytes: 4-bit quantization
FastAPI: REST API deployment
Pydantic: Data validation
Datasets: Data handling

Limitations

Current Scope:

In-memory task storage (no persistence)
Three basic tools (demonstration purposes)
Single-turn interactions (no conversation history)

Hardware Requirements:

Training requires GPU with 15GB+ VRAM
Inference requires GPU with 6GB+ VRAM
CPU-only inference is possible but very slow

Future Work

Potential extensions include:

Persistent storage (SQLite/PostgreSQL)
Multi-turn conversation support
Additional tool implementations
Deployment to cloud platforms

Contributing

Contributions are welcome! Please see guidelines below:

Reproducing This Work

Training:
- Use train-standalone.ipynb on Kaggle/Colab with T4 GPU
- Expected training time: 90 minutes
- Dataset generation is deterministic (seed=42)
Testing:
- Use kaggle-demo.ipynb to verify model behavior
- Compare base vs fine-tuned outputs
- Test all three tools plus conversational queries

Local Development:

uv venv && source .venv/bin/activate
uv pip install -e .
python test_agent.py

Reporting Issues

When reporting issues, please include:

Environment details (OS, Python version, GPU)
Steps to reproduce
Expected vs actual behavior
Relevant error messages

License

MIT License - see LICENSE file for details.

Acknowledgments

Meta for LLaMA-3.1-8B-Instruct
HuggingFace for Transformers and PEFT libraries
Community for open-source ML tools

Citation

If you use this work, please cite:

@software{llama_task_agent,
  title={Fine-Tuned LLaMA Task Agent},
  author={Aswani Sahoo},
  year={2026},
  url={https://github.com/AswaniSahoo/llama-task-agent}
}

Project Status: Complete and tested on Kaggle T4 GPU (January 2026)

For detailed setup instructions, see QUICKSTART.md. For API documentation, see API_EXAMPLES.md.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
agent		agent
configs		configs
data		data
models		models
screenshots		screenshots
serving		serving
training		training
.gitignore		.gitignore
API_EXAMPLES.md		API_EXAMPLES.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
kaggle-demo.ipynb		kaggle-demo.ipynb
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test_agent.py		test_agent.py
train-standalone.ipynb		train-standalone.ipynb

Folders and files

Latest commit

History

Repository files navigation

Fine-Tuned LLaMA Task Agent

Project Overview

Architecture

Repository Structure

Quick Start

Prerequisites

Installation

Local Testing (No GPU Required)

Training on Kaggle/Colab (Recommended)

Testing Your Trained Model

Technical Details

Model Configuration

Output Contract

Available Tools

Results

Live Demo

Sample Outputs

Performance Summary

Evaluation Metrics

Screenshots

API Usage

Start Server (Requires GPU with 6GB+ VRAM)

Example Requests

Development

Project Philosophy

Why Fine-Tuning Over Prompting?

Technology Stack

Limitations

Future Work

Contributing

Reproducing This Work

Reporting Issues

License

Acknowledgments

Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages