Team: LocalHost DC Hackathon: Cactus Compute × Google DeepMind FunctionGemma Final Score: 80.9% — F1: 0.99 · Avg Latency: 548ms · On-Device: 70%
A hybrid AI router that orchestrates Google's FunctionGemma-270M on-device model alongside Gemini 2.5 Flash Lite to achieve 99% function-calling accuracy under 550ms across 30 benchmark cases.
Cloud-only function calling is slow and expensive for simple tasks like setting alarms. Tiny on-device models (270M parameters) are fast and free but structurally broken: they hallucinate wrong tools, fail to parse numbers into JSON, refuse valid prompts, and collapse on multi-intent commands.
We built a routing layer that decides before inference whether each query can be handled locally or needs the cloud — and fixes the local model's outputs when they're partially correct.
The hybrid on-device/cloud routing pattern from this project was reused in PathGuard — an on-device spatial safety system for construction workers, built at the UMD × Ironsite Hackathon 2026. PathGuard applies the same confidence-gated local → cloud fallback architecture to a completely different domain: real-time corridor-based hazard detection with an on-device vision-language model (LFM2.5-VL-1.6B via Cactus) and Gemini cloud rescue. The core idea — run cheap/local first, escalate only when quality drops below a threshold, cap cloud costs with a cooldown timer — transferred directly.
graph TD
classDef device fill:#1e4620,stroke:#2ea043,stroke-width:2px,color:#fff
classDef cloud fill:#0d419d,stroke:#1f6feb,stroke-width:2px,color:#fff
classDef logic fill:#21262d,stroke:#30363d,stroke-width:2px,color:#fff
classDef gate fill:#50141a,stroke:#da3633,stroke-width:2px,color:#fff
classDef pp fill:#3d2b00,stroke:#d29922,stroke-width:2px,color:#fff
Q["User Query"] --> SP["Intent Splitter<br><code>_split_intents</code>"]:::logic
SP --> |"Single Intent"| SC["Difficulty Scorer (0.0–1.0)<br><code>_compute_difficulty</code>"]:::logic
SP --> |"Multi-Intent"| MI["Per-Sub-Intent Routing<br>+ Gap-Fill Merger"]:::logic
SC --> |"≤ 0.30"| T1["Tier 1: Easy"]:::logic
SC --> |"≤ 0.60"| T2["Tier 2: Medium"]:::logic
SC --> |"> 0.60"| T3["Tier 3: Hard"]:::logic
T1 --> D1["FunctionGemma-270M<br>(try RAG=min(2,n), retry RAG=1)"]:::device
T2 --> D2["FunctionGemma-270M<br>(RAG = min(2, tool count))"]:::device
T3 --> C3["Gemini 2.5 Flash Lite<br>+ Post-Processing"]:::cloud
D1 --> PP1["Post-Processing Pipeline<br>(fuzzy match → types → cleanup → NLP extraction)"]:::pp
D2 --> PP2["Post-Processing Pipeline"]:::pp
PP1 --> V1{"Structural Validation<br>+ Semantic Check"}:::gate
PP2 --> QG{"Quality Gate + Structural Validation<br>(refusals + empty args + semantics + required params)"}:::gate
V1 --> |"Pass"| R["Final Tool Call Payload"]
V1 --> |"Fail"| CR1["Cloud Rescue<br>+ Post-Processing"]:::cloud
QG --> |"Pass"| R
QG --> |"Fail"| CR2["Cloud Rescue<br>+ Post-Processing"]:::cloud
CR1 --> R
CR2 --> R
C3 --> R
MI --> R
Compound queries like "Set a timer and send a message" are split at conjunctions (and, also, then) and commas into individual sub-intents. Each sub-intent is independently scored and routed through the tier system.
Before any model inference, a lexical analyzer scores each query from 0.0 to 1.0 across four weighted factors:
| Factor | Weight | What It Measures |
|---|---|---|
| Tool Familiarity | 0.0–0.5 | Ratio of tools the 270M model consistently fails on (send_message, search_contacts, create_reminder) |
| Keyword Signals | 0 or 0.4 | Whether query language ("send", "find", "remind") triggers a known-hard tool that is present in the tool set |
| Intent Count | 0.0–0.3 | Penalty of 0.15 per additional intent |
| Tool Count | 0.0–0.2 | Penalty of 0.05 per additional distractor tool |
The score determines the routing tier:
- Tier 1 (≤ 0.3): On-device with semantic validation. If the first attempt (RAG = min(2, tool count)) fails, retries with narrower tool selection (RAG=1). Falls back to cloud only after two local failures.
- Tier 2 (0.3–0.6): On-device with a full quality gate (refusal detection + argument validation + semantic check). Single attempt before cloud rescue.
- Tier 3 (> 0.6): Cloud-first. Skips the local model entirely. Falls back to on-device only if the cloud fails.
Every result — both on-device and cloud — passes through a four-stage cleanup pipeline inside generate_cactus and generate_cloud, before any validation gates run. This means the validation gates always operate on cleaned, normalized outputs.
Fuzzy Tool Name Matching (_fuzzy_match_schema): If the model outputs a slightly misspelled tool name, Levenshtein distance snaps it to the closest valid tool within edit distance 4.
Type Coercion & Clamping (_fix_types): Forces float → int conversion for integer schema fields. Clamps negative hallucinations (the model outputs values like minutes=-300) to their absolute value. Snaps misspelled enum values to the nearest valid option via Levenshtein distance (within edit distance 3).
String Cleanup (_clean_args): Strips trailing punctuation, leading articles ("the", "a", "an"), and stray quotes from string arguments.
NLP Argument Extraction (_extract_args_from_query): The critical fix. The 270M model fundamentally cannot parse natural language numbers into JSON integers. This regex-based extractor pulls correct values directly from the user's text and overwrites the model's broken output:
"6 AM"→{"hour": 6, "minute": 0}"7:30 PM"→{"hour": 19, "minute": 30}"10 minutes"→{"minutes": 10}
It also handles refusal interception: if the model outputs zero calls but the query contains "wake", a synthetic set_alarm call is injected so the argument parser can rescue the response.
After post-processing, the routing layer validates tool selection. A keyword-to-tool mapping catches the 270M model's most dangerous failure mode: confidently selecting the wrong tool.
If the user says "Play jazz music" but the model selects set_alarm, the gate detects that "play" and "music" map to play_music, not set_alarm. It confirms another tool is a better match and kills the hallucinated result, triggering a cloud rescue.
A broader post-inference check used at Tier 2 that layers four validations on the already-post-processed output:
- Empty calls — the model returned nothing
- Refusal detection — the model output phrases like "I cannot", "I apologize", or "which song" instead of making a tool call
- Argument value check — any present argument value is
Noneor an empty string (catches the model calling the right tool but filling in garbage) - Semantic check — delegates to
_semantic_checkas the final layer
A separate check that runs alongside the semantic and quality gates. It verifies two things: (1) the tool name actually exists in the provided schema, and (2) all required parameters declared in the schema are present in the arguments dict. This catches cases where the model hallucinates a tool name that fuzzy matching couldn't fix, or where it calls the right tool but omits a required field entirely.
For multi-intent queries, each sub-intent is independently routed through the 3-tier system. If the decomposition misses intents (e.g., the local model only handles one of two requested tools), the system sends the full original query to Gemini as a fallback — but strictly filters its response to keep only the missing tools, merging them with the successful local calls.
This preserves the on-device ratio while ensuring complete coverage.
Persistent Model Handle (_get_model): The FunctionGemma-270M model is loaded once into RAM and reused across all calls via a lazy-init singleton.
Cached Cloud Client (_get_cloud_client): The google.genai client is wrapped in a singleton that reuses TCP connections, avoiding repeated TLS handshake overhead on cloud calls.
Model Selection: Cloud inference uses gemini-2.5-flash-lite — the fastest available Gemini model.
Token Budget: On-device generation is capped at max_tokens=128 to minimize local inference time.
Robust JSON Parsing: When json.loads fails on malformed model output, a depth-tracking brace parser extracts the first complete JSON object from the raw string.
The benchmark uses a weighted formula:
level_score = (0.60 × F1) + (0.15 × time_score) + (0.25 × on_device_ratio)
total_score = (0.20 × easy) + (0.30 × medium) + (0.50 × hard)
Where time_score = max(0, 1 - avg_time / 500ms). Anything under 500ms gets full time marks; anything over is penalized linearly.
Hard queries (multi-intent, 4–5 distractor tools) are worth 50% of the total score — making the hard-difficulty F1 jump from 0.50 → 0.97 the single biggest driver of our result.
| Difficulty | Queries | F1 | Strategy |
|---|---|---|---|
| Easy | 10 | ~1.00 | Tier 1 on-device with semantic check |
| Medium | 10 | ~1.00 | Tier 2 on-device with quality gate + cloud rescue |
| Hard | 10 | ~0.97 | Per-sub-intent routing + gap-fill merging |
| Overall | 30 | 0.99 | 70% on-device · 548ms avg latency |
Final Objective Score: 80.9%
We observed top-ranking teams achieving 16ms latencies by hardcoding regex matches against exact benchmark query strings — effectively bypassing the AI model entirely.
We built a generalizable zero-shot system. Our 0.99 F1 score comes from algorithmic routing and post-processing, not memorization of the eval set.
- Python 3.10+
- Cactus SDK (built with Python bindings)
- FunctionGemma-270M-it weights (placed in
cactus/weights/functiongemma-270m-it/) - Gemini API key
# Clone the repo
git clone https://github.com/your-username/localhost-router.git
cd localhost-router
# Clone and build Cactus
git clone https://github.com/cactus-compute/cactus
cd cactus && source ./setup && cd ..
cactus build --python
# Download model weights
cactus pull functiongemma-270m-it
# Configure environment
echo 'GEMINI_API_KEY=your_key_here' > .env
# Install cloud dependency
pip install google-genai# Run the full 30-case benchmark
python benchmark.py
# Submit to the hackathon leaderboard
python submit.py --team "YourTeamName" --location "DC"├── main.py # Core hybrid router (688 lines)
│ ├── generate_hybrid() # Entry point — 3-tier routing orchestrator
│ ├── generate_cactus() # On-device inference via Cactus SDK
│ ├── generate_cloud() # Cloud inference via Gemini API
│ ├── _compute_difficulty() # Pre-routing difficulty scorer
│ ├── _semantic_check() # Tool-selection hallucination detector
│ ├── _quality_gate() # Post-inference quality validation
│ ├── _validate_calls() # Structural validity check (name + required params)
│ ├── _split_intents() # Multi-intent query decomposition
│ ├── _extract_args_from_query()# NLP argument extraction + refusal rescue
│ ├── _fuzzy_match_schema() # Full post-processing pipeline orchestrator
│ ├── _fix_types() # Type coercion + negative clamping + enum snapping
│ ├── _clean_args() # String normalization
│ ├── _levenshtein() # Edit distance for fuzzy matching
│ ├── _get_model() # Persistent on-device model singleton
│ ├── _get_cloud_client() # Persistent Gemini client singleton
│ └── _load_env() # .env file loader
├── benchmark.py # 30-case evaluation suite with F1 scoring
├── submit.py # Leaderboard submission client
└── .env # API keys (not committed)
| Component | Technology |
|---|---|
| On-Device Model | FunctionGemma-270M-it via Cactus SDK |
| Cloud Model | Gemini 2.5 Flash Lite via google-genai |
| Language | Python 3.10+ |
| On-Device Runtime | Cactus Compute |
Integrating cactus_transcribe (Whisper-small) to build a voice-to-action terminal — spoken commands routed through the 3-tier system and executed as real device actions.
LocalHost DC — Built at the Cactus Compute × Google DeepMind FunctionGemma Hackathon (AI Tinkerers, Washington DC).