Skip to content

AswaniSahoo/llama-task-agent

Repository files navigation

Fine-Tuned LLaMA Task Agent

Author: Aswani Sahoo | LinkedIn | Email | Kaggle Demo

A production-grade implementation of a task management agent using fine-tuned LLaMA-3.1-8B with LoRA (Low-Rank Adaptation). This project demonstrates how to train language models to reliably execute tools through strict output contracts, moving beyond prompt-only approaches.

Project Overview

This project fine-tunes LLaMA-3.1-8B to become a reliable task management agent that can:

  • Add, list, and summarize tasks
  • Follow a strict output format (<analysis>, <action>, <final> tags)
  • Execute tool calls with high reliability
  • Handle both tool-oriented and conversational queries

Key Achievement: The fine-tuned model consistently adheres to a structured output contract, enabling reliable parsing and tool execution in production environments.

Architecture

User Query → Fine-Tuned LLaMA → Structured Output → Parser → Tool Executor → Result

Core Components:

  • Agent Core: Tool definitions, output parser, execution engine
  • Training Pipeline: LoRA fine-tuning with 4-bit quantization
  • Dataset Generator: Synthetic data creation (360+ samples)
  • API Server: FastAPI deployment with health checks
  • Demo Notebook: Interactive testing on Kaggle/Colab

Repository Structure

personaltaskai/
├── agent/
│   ├── __init__.py
│   ├── schemas.py          # Pydantic models for tools and responses
│   ├── tools.py            # Task management functions
│   ├── parser.py           # Output parsing logic
│   └── executor.py         # Tool execution engine
├── data/
│   └── generate_dataset.py # Synthetic dataset generation
├── training/
│   ├── finetune_lora.py    # LoRA training script
│   └── evaluate.py         # Model evaluation
├── serving/
│   ├── app.py              # FastAPI application
│   ├── inference.py        # Model inference wrapper
│   └── __init__.py
├── models/
│   └── lora-adapter/       # Trained LoRA weights (54MB)
├── configs/
│   └── agent_config.json   # Model and training configuration
├── train-standalone.ipynb  # Complete training notebook (Colab/Kaggle)
├── kaggle-demo.ipynb       # Interactive demo notebook
├── test_agent.py           # Local testing script
├── README.md
├── QUICKSTART.md
├── API_EXAMPLES.md
├── requirements.txt
└── pyproject.toml

Quick Start

Prerequisites

  • Python 3.10+
  • For training: GPU with 15GB+ VRAM (T4/V100/A100)
  • For inference: GPU with 6GB+ VRAM (or use Kaggle/Colab)

Installation

# Clone repository
git clone https://github.com/AswaniSahoo/llama-task-agent.git
cd llama-task-agent

# Create environment (using uv)
uv venv
source .venv/bin/activate  # Linux/Mac
# .venv\Scripts\activate   # Windows

# Install dependencies
uv pip install -e .

Local Testing (No GPU Required)

Test the agent core without training:

# Generate dataset
python data/generate_dataset.py

# Test agent components
python test_agent.py

Training on Kaggle/Colab (Recommended)

Option 1: Kaggle

  1. Upload train-standalone.ipynb to Kaggle
  2. Enable GPU T4 x2 in settings
  3. Run all cells
  4. Download trained adapter from models/lora-adapter/

Option 2: Google Colab

  1. Upload train-standalone.ipynb to Colab
  2. Runtime → Change runtime type → T4 GPU
  3. Run all cells
  4. Download adapter when complete

Training Time: ~90 minutes on T4 GPU

Testing Your Trained Model

Use kaggle-demo.ipynb to test the model interactively:

  1. Upload notebook to Kaggle
  2. Add dataset: Upload models/lora-adapter folder
  3. Run cells to see base vs fine-tuned comparison

Technical Details

Model Configuration

  • Base Model: meta-llama/Llama-3.1-8B-Instruct
  • Fine-Tuning: LoRA (rank=16, alpha=32)
  • Quantization: 4-bit NF4 for efficient training/inference
  • Training Data: 360 synthetic samples (balanced across tools)
  • Training Duration: 3 epochs, ~90 minutes on T4

Output Contract

The model is trained to produce structured outputs:

For tool calls:

<analysis>
User wants to create a new task 'buy groceries'.
</analysis>

<action>
add_task(title="Buy groceries", due_date="2026-01-20")
</action>

For conversational responses:

<final>
Hello! How can I help you manage your tasks today?
</final>

This strict format enables reliable parsing via regex, as the model is trained to always follow this pattern.

Available Tools

add_task(title: str, due_date: str)

  • Creates a new task with title and due date
  • Example: add_task(title="Buy milk", due_date="2026-01-21")

list_tasks()

  • Returns all tasks with details
  • Example: list_tasks()

summarize_tasks()

  • Provides task count summary
  • Example: summarize_tasks()

Results

Live Demo

Notebook Description
Training Notebook Full training process on T4 GPU
Demo Notebook Test the trained model live

Both notebooks run on Kaggle's free T4 GPU. The demo notebook shows base vs fine-tuned comparison and live tool execution.

Sample Outputs

Tool Execution Example:

USER: Add a task to buy groceries tomorrow
============================================================
AGENT RESPONSE:
<analysis>
User wants to create a new task 'buy groceries' due date is tomorrow.
</analysis>

<action>
add_task(title="Buy groceries", due_date="2026-01-26")
</action>

EXECUTION RESULT:
{'status': 'success', 'message': "Task added: 'Buy groceries'", 'task_count': 1}

Conversational Response Example:

USER: Hello! How are you?
============================================================
AGENT RESPONSE:
<final>
Hello! I'm good, thanks for asking. How can I help you manage your tasks today?
</final>

Performance Summary

The fine-tuned model consistently demonstrates:

  • 100% Format Compliance: All outputs follow the <analysis>, <action>, <final> structure
  • Correct Tool Selection: Accurately distinguishes between tool calls and conversational responses
  • Robust Parsing: Outputs reliably parsable via regex, enabling deterministic execution
  • Edge Case Handling: Manages ambiguous queries and multi-intent requests appropriately

Evaluation Metrics

Metric Base Model (Prompt-Only) Fine-Tuned (LoRA) Improvement
Format Compliance ~65% 100% +35%
Tool Selection Accuracy ~72% 100% +28%
Correct Tag Structure Inconsistent Consistent -
Parse Success Rate ~70% 100% +30%

Note: Base model percentages vary across runs due to inconsistent output formatting. Fine-tuned model maintains 100% compliance across all test samples.

Screenshots

Training Progress Base vs Fine-Tuned Comparison
Training Comparison
Tool Execution (add_task) Tool Execution (list/summarize)
Add Task List Tasks
Conversational Response
Conversation

For interactive testing, run the Kaggle demo notebook yourself.

API Usage

Start Server (Requires GPU with 6GB+ VRAM)

python serving/app.py

Server runs on http://localhost:8000

Example Requests

Add a task:

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Add a task to buy groceries tomorrow"}'

List tasks:

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What tasks do I have?"}'

See API_EXAMPLES.md for complete API documentation.

Development

Project Philosophy

This project prioritizes:

  • Reliability: Strict output contracts over flexible generation
  • Reproducibility: Synthetic datasets and documented training
  • Simplicity: Focused implementation without over-engineering
  • Practicality: Production-style code suitable for real deployment

Why Fine-Tuning Over Prompting?

Prompt-only approaches suffer from:

  • Inconsistent output formats
  • Fragile behavior under query variations
  • Difficulty enforcing strict contracts

Fine-tuning with LoRA:

  • Aligns model behavior to specific output patterns
  • Demonstrates measurable improvement in consistency
  • Enables reliable production deployment
  • Efficient: Only trains 0.5% of parameters

Technology Stack

  • PyTorch: Model training and inference
  • Transformers: HuggingFace library for LLMs
  • PEFT: Parameter-efficient fine-tuning (LoRA)
  • BitsAndBytes: 4-bit quantization
  • FastAPI: REST API deployment
  • Pydantic: Data validation
  • Datasets: Data handling

Limitations

Current Scope:

  • In-memory task storage (no persistence)
  • Three basic tools (demonstration purposes)
  • Single-turn interactions (no conversation history)

Hardware Requirements:

  • Training requires GPU with 15GB+ VRAM
  • Inference requires GPU with 6GB+ VRAM
  • CPU-only inference is possible but very slow

Future Work

Potential extensions include:

  • Persistent storage (SQLite/PostgreSQL)
  • Multi-turn conversation support
  • Additional tool implementations
  • Deployment to cloud platforms

Contributing

Contributions are welcome! Please see guidelines below:

Reproducing This Work

  1. Training:

    • Use train-standalone.ipynb on Kaggle/Colab with T4 GPU
    • Expected training time: 90 minutes
    • Dataset generation is deterministic (seed=42)
  2. Testing:

    • Use kaggle-demo.ipynb to verify model behavior
    • Compare base vs fine-tuned outputs
    • Test all three tools plus conversational queries
  3. Local Development:

    uv venv && source .venv/bin/activate
    uv pip install -e .
    python test_agent.py

Reporting Issues

When reporting issues, please include:

  • Environment details (OS, Python version, GPU)
  • Steps to reproduce
  • Expected vs actual behavior
  • Relevant error messages

License

MIT License - see LICENSE file for details.

Acknowledgments

  • Meta for LLaMA-3.1-8B-Instruct
  • HuggingFace for Transformers and PEFT libraries
  • Community for open-source ML tools

Citation

If you use this work, please cite:

@software{llama_task_agent,
  title={Fine-Tuned LLaMA Task Agent},
  author={Aswani Sahoo},
  year={2026},
  url={https://github.com/AswaniSahoo/llama-task-agent}
}

Project Status: Complete and tested on Kaggle T4 GPU (January 2026)

For detailed setup instructions, see QUICKSTART.md. For API documentation, see API_EXAMPLES.md.

About

Fine-tuned LLaMA-3.1-8B task agent with LoRA for reliable tool execution

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors