Call Me Maybe is a Python project that adds reliable function calling to a small language model (Qwen3-0.6B). Given a natural language prompt, it selects the correct function and extracts its arguments as structured JSON.
- Python 3.10 or later
uvfor dependency management
make install# Default: reads from data/input/, writes to data/output/
make run
# With explicit paths
uv run python -m src \
--functions_definition data/input/functions_definition.json \
--input data/input/function_calling_tests.json \
--output data/output/function_calls.json| Target | Description |
|---|---|
make install |
Install project dependencies via uv |
make run |
Run the main program |
make debug |
Run with Python's built-in debugger (pdb) |
make clean |
Remove __pycache__, .mypy_cache, etc. |
make lint |
Run flake8 and mypy with standard flags |
The heart of this project is constrained decoding — a method that guarantees structurally and semantically valid JSON output regardless of the model's natural tendencies.
Language models generate text one token at a time. At each step the model outputs logits — raw scores over the entire vocabulary indicating how likely each token is to come next. Constrained decoding intercepts before token selection:
- Get logits from the model for the current token sequence.
- Determine valid tokens — based on the current generation state and the target schema, scan the vocabulary to find every token that could legally extend the current output.
- Mask invalid tokens — set their logits to
-infso they can never be selected. - Pick greedily (
argmax) from the remaining valid tokens. - Append the token, update state, and repeat until generation is complete.
Prompt
↓
Prepend fixed prefix: '{"name": "'
↓
Token loop (up to max_tokens):
│ get logits from model
│ _apply_constraints() ← vocab + current state + schema
│ argmax → next token
│ append token to output & input_ids
│ _update_state() → may inject structural text
└─ repeat until END state or stop token
↓
Parse final output with pydantic_core.from_json()
The output is never generated from a blank slate. Generation starts with the hardcoded prefix {"name": ", which immediately forces the model into the right JSON context. After the function name is settled, the string "parameters": {" is injected directly into both output and input_ids — the model never "sees" a choice here. This keeps the JSON skeleton 100% reliable without spending any constraint budget on it.
A single state dict drives the whole generation loop. Each state has a dedicated branch in _apply_constraints() and _update_state(), making the logic easy to reason about, test, and extend. Adding support for a new type means adding one branch in _is_valid_param_value().
The list of valid function names includes "none" so the decoder always has a valid completion path even for prompts that don't match any function. This prevents the constraint mask from ever being empty during function-name generation.
If the model hits max_tokens before completing the JSON, from_json will raise a ValueError. The decoder catches this, prints a clear diagnostic message (including the suggestion to raise max_tokens), and returns {"name": "none"} rather than crashing.
| Metric | Target | Result |
|---|---|---|
| JSON validity | 100% | 100% — guaranteed by construction |
| Function selection accuracy | ≥ 90% | High — LLM chooses freely among constrained valid names |
| Argument extraction accuracy | ≥ 90% | High — keys forced exactly, values type-constrained |
| Speed (full test suite) | < 5 min | Depends on hardware; token-by-token inference is the bottleneck |
Last-parameter vs mid-parameter closing: When a parameter value is complete, the correct closing token depends on whether it is the last parameter (}}) or not (,). The decoder compares curr_fn_param_idx to len(fn_params) - 1 to pick the right allowed token, ensuring the JSON always closes correctly.
String closing tokens span multiple characters: Tokens like ", and "} are single vocabulary entries, not two separate tokens. The string value constraint checks for their presence in next_token (via '",' in next_token or '"}' in next_token) rather than checking character by character, which requires knowing one step ahead — solved by peeking at argmax(logits) before applying constraints.
Testing was driven by iteration rather than a formal test suite. The approach was to run the full pipeline against a growing set of prompts, observe failures, trace them back to a specific gap in the constraint logic, fix it, and repeat.
To go beyond obvious cases, AI was used to generate adversarial prompts — inputs specifically designed to break the decoder. This turned out to be one of the most effective ways to find edge cases, because the failure modes of a constrained decoder are non-obvious and hard to anticipate just by reading the code.
data/input/function_calling_tests.json:
[
{ "prompt": "What is the sum of 2 and 3?" },
{ "prompt": "Greet shrek" },
{ "prompt": "Reverse the string 'hello'" }
]data/output/function_calling_results.json:
[
{
"prompt": "What is the sum of 2 and 3?",
"name": "fn_add_numbers",
"parameters": { "a": 2.0, "b": 3.0 }
},
{
"prompt": "Greet shrek",
"name": "fn_greet",
"parameters": { "name": "shrek" }
},
{
"prompt": "Reverse the string 'hello'",
"name": "fn_reverse_string",
"parameters": { "s": "hello" }
}
]- Deep Dive into LLMs like ChatGPT
- Large Language Models explained briefly
- Most devs don't understand how LLM tokens work
- Pydantic v2 Documentation
AI was used throughout this project in the following ways:
- Generating adversarial test prompts: the most impactful use. Given the function definitions and a description of the constraint logic, AI was asked to produce prompts likely to expose edge cases — negative numbers, floats, strings with special characters, ambiguous intents, etc. Each failure found this way led to a concrete fix.
- Discussing design trade-offs: talked through approaches to the state machine structure, the injection strategy, and how to handle partial JSON parsing before committing to an implementation.
- README drafting: this README was drafted with AI assistance, then reviewed and edited to match the actual implementation exactly.