Skip to content

0xr3dk/Call-Me-Maybe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This project has been created as part of the 42 curriculum by reda

Description

Call Me Maybe is a Python project that adds reliable function calling to a small language model (Qwen3-0.6B). Given a natural language prompt, it selects the correct function and extracts its arguments as structured JSON.

Instructions

Requirements

  • Python 3.10 or later
  • uv for dependency management

Installation

make install

Running the program

# Default: reads from data/input/, writes to data/output/
make run
 
# With explicit paths
uv run python -m src \
  --functions_definition data/input/functions_definition.json \
  --input data/input/function_calling_tests.json \
  --output data/output/function_calls.json

Makefile targets

Target Description
make install Install project dependencies via uv
make run Run the main program
make debug Run with Python's built-in debugger (pdb)
make clean Remove __pycache__, .mypy_cache, etc.
make lint Run flake8 and mypy with standard flags

Algorithm Explanation

Constrained Decoding

The heart of this project is constrained decoding — a method that guarantees structurally and semantically valid JSON output regardless of the model's natural tendencies.

Language models generate text one token at a time. At each step the model outputs logits — raw scores over the entire vocabulary indicating how likely each token is to come next. Constrained decoding intercepts before token selection:

  1. Get logits from the model for the current token sequence.
  2. Determine valid tokens — based on the current generation state and the target schema, scan the vocabulary to find every token that could legally extend the current output.
  3. Mask invalid tokens — set their logits to -inf so they can never be selected.
  4. Pick greedily (argmax) from the remaining valid tokens.
  5. Append the token, update state, and repeat until generation is complete.

Generation Pipeline

Prompt
    ↓
Prepend fixed prefix: '{"name": "'
    ↓
Token loop (up to max_tokens):
    │  get logits from model
    │  _apply_constraints()  ← vocab + current state + schema
    │  argmax → next token
    │  append token to output & input_ids
    │  _update_state()  → may inject structural text
    └─ repeat until END state or stop token
    ↓
Parse final output with pydantic_core.from_json()

Design Decisions

Seeded prefix + structural injection instead of full generation

The output is never generated from a blank slate. Generation starts with the hardcoded prefix {"name": ", which immediately forces the model into the right JSON context. After the function name is settled, the string "parameters": {" is injected directly into both output and input_ids — the model never "sees" a choice here. This keeps the JSON skeleton 100% reliable without spending any constraint budget on it.

Explicit finite state machine

A single state dict drives the whole generation loop. Each state has a dedicated branch in _apply_constraints() and _update_state(), making the logic easy to reason about, test, and extend. Adding support for a new type means adding one branch in _is_valid_param_value().

"none" as a fallback function name

The list of valid function names includes "none" so the decoder always has a valid completion path even for prompts that don't match any function. This prevents the constraint mask from ever being empty during function-name generation.

Graceful truncation handling

If the model hits max_tokens before completing the JSON, from_json will raise a ValueError. The decoder catches this, prints a clear diagnostic message (including the suggestion to raise max_tokens), and returns {"name": "none"} rather than crashing.

Performance Analysis

Metric Target Result
JSON validity 100% 100% — guaranteed by construction
Function selection accuracy ≥ 90% High — LLM chooses freely among constrained valid names
Argument extraction accuracy ≥ 90% High — keys forced exactly, values type-constrained
Speed (full test suite) < 5 min Depends on hardware; token-by-token inference is the bottleneck

Challenges Faced

Last-parameter vs mid-parameter closing: When a parameter value is complete, the correct closing token depends on whether it is the last parameter (}}) or not (,). The decoder compares curr_fn_param_idx to len(fn_params) - 1 to pick the right allowed token, ensuring the JSON always closes correctly.

String closing tokens span multiple characters: Tokens like ", and "} are single vocabulary entries, not two separate tokens. The string value constraint checks for their presence in next_token (via '",' in next_token or '"}' in next_token) rather than checking character by character, which requires knowing one step ahead — solved by peeking at argmax(logits) before applying constraints.

Testing Strategy

Testing was driven by iteration rather than a formal test suite. The approach was to run the full pipeline against a growing set of prompts, observe failures, trace them back to a specific gap in the constraint logic, fix it, and repeat.

To go beyond obvious cases, AI was used to generate adversarial prompts — inputs specifically designed to break the decoder. This turned out to be one of the most effective ways to find edge cases, because the failure modes of a constrained decoder are non-obvious and hard to anticipate just by reading the code.

Example Usage

Example input

data/input/function_calling_tests.json:

[
  { "prompt": "What is the sum of 2 and 3?" },
  { "prompt": "Greet shrek" },
  { "prompt": "Reverse the string 'hello'" }
]

Example output

data/output/function_calling_results.json:

[
  {
    "prompt": "What is the sum of 2 and 3?",
    "name": "fn_add_numbers",
    "parameters": { "a": 2.0, "b": 3.0 }
  },
  {
    "prompt": "Greet shrek",
    "name": "fn_greet",
    "parameters": { "name": "shrek" }
  },
  {
    "prompt": "Reverse the string 'hello'",
    "name": "fn_reverse_string",
    "parameters": { "s": "hello" }
  }
]

References

AI Usage

AI was used throughout this project in the following ways:

  • Generating adversarial test prompts: the most impactful use. Given the function definitions and a description of the constraint logic, AI was asked to produce prompts likely to expose edge cases — negative numbers, floats, strings with special characters, ambiguous intents, etc. Each failure found this way led to a concrete fix.
  • Discussing design trade-offs: talked through approaches to the state machine structure, the injection strategy, and how to handle partial JSON parsing before committing to an implementation.
  • README drafting: this README was drafted with AI assistance, then reviewed and edited to match the actual implementation exactly.

About

Introduction to function calling in LLMs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors