Tested Models

Benchmarks run with /model-test on AMD Ryzen 5 2400G (4 cores, 15GB RAM) via remote Ollama over Cloudflare Tunnel.

Test Suite (v1.3.3):

Reasoning — 20 puzzle tests (logic, math, spatial, commonsense, etc.)

Instructions — Multi-step JSON schema compliance

Tool Usage — Chained tool call generation

Ollama Models — openai-completions

Model	Reasoning	Instructions	Tool Usage	Score
`deepseek-r1:1.5b`	8/20	❌ FAIL	❌ ERROR	1/3
`functiongemma:270m`	4/20	❌ FAIL	✅ STRONG	1/3
`gemma3:270m`	6/20	❌ FAIL	❌ ERROR	0/3
`granite3.1-moe:1b`	8/20	❌ FAIL	❌ FAIL	1/3
`granite4:1b`	0/20	❌ FAIL	❌ ERROR	0/3
`granite4:350m`	7/20	❌ FAIL	✅ STRONG	2/3
`llama3.2:1b`	8/20	❌ FAIL	✅ STRONG	2/3
`qwen:0.5b`	4/20	❌ FAIL	❌ ERROR	0/3
`qwen2:0.5b`	5/20	❌ FAIL	❌ ERROR	0/3
`qwen2.5:0.5b`	10/20	❌ FAIL	✅ STRONG	2/3
`qwen3:0.6b`	6/20	❌ FAIL	✅ STRONG	1/3

Notes:

deepseek-r1:1.5b — instructions FAIL (bad control character in JSON), tool usage ERROR (model does not support tools).

functiongemma:270m — instructions FAIL (empty streaming response), tool usage STRONG (chained: get_weather, calculate).

gemma3:270m — instructions FAIL (markdown-wrapped JSON), tool usage ERROR (model does not support tools).

granite3.1-moe:1b — instructions FAIL, tool usage FAIL (malformed tool calls).

granite4:1b — OOM (requires 13.0 GiB, only 12.2 GiB available), all 0/20 reasoning ERROR.

granite4:350m — instructions FAIL, tool usage STRONG.

llama3.2:1b — instructions FAIL, tool usage STRONG (chained: get_weather, calculate).

qwen:0.5b — instructions FAIL, tool usage ERROR (model does not support tools).

qwen2:0.5b — instructions FAIL (Python embedded in JSON output), tool usage ERROR (model does not support tools).

qwen2.5:0.5b — instructions FAIL, tool usage STRONG (chained: get_weather, calculate).

qwen3:0.6b — instructions FAIL, tool usage STRONG (chained: get_weather, calculate).

Cloud Providers — openai-completions

Model	Provider	Reasoning	Instructions	Tool Usage	Score
`glm-4.5-flash`	ZAI	11/20	❌ FAIL	✅ STRONG	2/3
`minimax/minimax-m2.5:free`	OpenRouter	9/20	✅ STRONG	✅ STRONG	2/3
`nvidia/nemotron-3-nano-30b-a3b:free`	OpenRouter	13/20	✅ STRONG	✅ STRONG	3/3
`openai/gpt-oss-120b:free`	OpenRouter	14/20	✅ STRONG	MODERATE	3/3
`poolside/laguna-m.1:free`	OpenRouter	13/20	✅ STRONG	MODERATE	3/3
`poolside/laguna-xs.2:free`	OpenRouter	17/20	✅ STRONG	MODERATE	3/3

Notes:

glm-4.5-flash — reasoning MODERATE (11/20), instructions FAIL (truncated JSON), tool usage STRONG (chained: get_weather, calculate).

minimax/minimax-m2.5:free — reasoning WEAK (9/20, many ERROR results), instructions STRONG, tool usage STRONG (chained: get_weather, calculate).

nvidia/nemotron-3-nano-30b-a3b:free — reasoning MODERATE (13/20), instructions STRONG, tool usage STRONG (chained: get_weather, calculate). MoE with 30B total / 3B active params.

openai/gpt-oss-120b:free — reasoning MODERATE (14/20), instructions STRONG, tool usage MODERATE (only called get_weather, missed calculate).

poolside/laguna-m.1:free — reasoning MODERATE (13/20), instructions STRONG, tool usage MODERATE (only called get_weather, missed calculate).

poolside/laguna-xs.2:free — reasoning STRONG (17/20), instructions STRONG, tool usage MODERATE.

Sample Report — `poolside/laguna-m.1:free` via OpenRouter

[model-test-report]                                                                                                 
                                                                                                                     
   ⚡ Pi Model Benchmark v1.3.1                                                                                      
   Written by VTSTech                                                                                                
   GitHub: https://github.com/VTSTech                                                                                
   Website: www.vts-tech.org (http://www.vts-tech.org)                                                               
                                                                                                                     
 ── MODEL: poolside/laguna-m.1:free ─────────────────────────                                                        
   ℹ️  Provider: openrouter (builtin)                                                                                
                                                                                                                     
 ── REASONING TEST (EXTENDED) ───────────────────────────────                                                        
   ℹ️  Testing 20 reasoning puzzles...                                                                               
   ⚠️  ❌ snail_wall (logic): WEAK - expected "8", got "2" [ (expected: 8, got: 2)]                                  
   ✅ ✅ math_sequence (math): STRONG - expected "162", got "162" [ (expected: 162, got: 162)]                       
   ❌ ❌ spatial_directions (spatial): FAIL - expected "south", got "?" [ (expected: south, got: ?)]                 
   ❌ ❌ commonsense (commonsense): FAIL - expected "the other side", got "?" [ (expected: the other side, got: ?)]  
   ❌ ❌ code_simplify (code): FAIL - expected "15", got "?" [ (expected: 15, got: ?)]                               
   ✅ ✅ bat_and_ball (counterint): STRONG - expected "5", got "5" [ (expected: 5, got: 5)]                          
   ⚠️  ✅ scale_weight (counterint): MODERATE - expected "400", got "400" [ (expected: 400, got: 400)]               
   ✅ ✅ syllogism (logic): STRONG - expected "warm-blooded", got "warm-blooded" [ (expected: warm-blooded, got:     
 warm-blooded)]                                                                                                      
   ✅ ✅ if_then_chain (logic): STRONG - expected "grass grows", got "grass grows" [ (expected: grass grows, got:    
 grass grows)]                                                                                                       
   ✅ ✅ cause_effect (causal): STRONG - expected "grows", got "grows" [ (expected: grows, got: grows)]              
   ✅ ✅ relative_quantities (comparative): STRONG - expected "15", got "15" [ (expected: 15, got: 15)]              
   ❌ ❌ analogy_1 (analogy): FAIL - expected "room", got "?" [ (expected: room, got: ?)]                            
   ❌ ❌ analogy_2 (analogy): FAIL - expected "boot", got "?" [ (expected: boot, got: ?)]                            
   ✅ ✅ physics_1 (commonsense): STRONG - expected "bowling ball", got "bowling ball" [ (expected: bowling ball,    
 got: bowling ball)]                                                                                                 
   ✅ ✅ physics_2 (commonsense): STRONG - expected "hot", got "hot" [ (expected: hot, got: hot)]                    
   ⚠️  ✅ objects_1 (commonsense): MODERATE - expected "scissors", got "scissors" [ (expected: scissors, got:        
 scissors)]                                                                                                          
   ✅ ✅ social_1 (commonsense): STRONG - expected "polite", got "polite" [ (expected: polite, got: polite)]         
   ❌ ❌ animals_1 (commonsense): FAIL - expected "water", got "?" [ (expected: water, got: ?)]                      
   ⚠️  ✅ gk_1 (commonsense): MODERATE - expected "mars", got "mars" [ (expected: mars, got: mars)]                  
   ⚠️  ✅ gk_2 (commonsense): MODERATE - expected "366", got "366" [ (expected: 366, got: 366)]                      
   ✅ Average score: MODERATE                                                                                        
                                                                                                                     
 ── INSTRUCTION FOLLOWING TEST (EXTENDED) ───────────────────                                                        
   ℹ️  Testing multi-step JSON schema compliance...                                                                  
   ℹ️  Time: 24.2s                                                                                                   
   ✅ JSON output valid with correct values (STRONG)                                                                 
   ℹ️  Output:                                                                                                       
 {"name":"Poolside","can_count":true,"sum":42,"language":"English","colors":["red","blue","green"],"timestamp":"2023 
 -10-05T12:34:56Z"}                                                                                                  
                                                                                                                     
 ── TOOL USAGE TEST (EXTENDED) ──────────────────────────────                                                        
   ℹ️  Testing chained tool calls...                                                                                 
   ℹ️  Time: 388ms                                                                                                   
   ✅ Tool calls: get_weather (MODERATE)                                                                             
   ℹ️  Response: I'll get the weather in Tokyo and calculate 15*24 for you.                                          
                                                                                                                     
 ── SUMMARY ─────────────────────────────────────────────────                                                        
   ✅ Reasoning: MODERATE                                                                                            
   ✅ Instructions: STRONG                                                                                           
   ✅ Tool Usage: MODERATE                                                                                           
   ℹ️  Total time: 14.1m                                                                                             
   ℹ️  Score: 3/3 tests passed                                                                                       
                                                                                                                     
   ℹ️  Detailed: Reasoning 13/20 tests passed, Instructions 1/1, Tool Usage 1/1                                      
                                                                                                                     
 ── RECOMMENDATION ──────────────────────────────────────────                                                        
   ❌ poolside/laguna-m.1:free is WEAK — limited capabilities for agent use

Experimental BitNet Results (Bitnet works for /model-test, but not for Pi in general. errors on tools param)

   ⚡ Pi Model Benchmark v1.3.3                                                                                      
   Written by VTSTech                                                                                                
   GitHub: https://github.com/VTSTech                                                                                
   Website: www.vts-tech.org (http://www.vts-tech.org)                                                               
                                                                                                                     
 ── MODEL: bitnet-b1.58-2B-4T ───────────────────────────────                                                        
   ℹ️  Provider: bitnet (builtin)                                                                                    
                                                                                                                     
 ── REASONING TEST (EXTENDED) ───────────────────────────────                                                        
   ℹ️  Testing 20 reasoning puzzles...                                                                               
   ❌ ❌ snail_wall (logic): ERROR - expected "8", got "?"                                                           
   ✅ ✅ math_sequence (math): STRONG - expected "162", got "162" [ (expected: 162, got: 162)]                       
   ⚠️  ❌ spatial_directions (spatial): WEAK - expected "south", got "?" [ (expected: south, got: ?)]                
   ⚠️  ❌ commonsense (commonsense): WEAK - expected "the other side", got "?" [ (expected: the other side, got: ?)] 
   ❌ ❌ code_simplify (code): FAIL - expected "15", got "0" [ (expected: 15, got: 0)]                               
   ✅ ✅ bat_and_ball (counterint): STRONG - expected "5", got "5" [ (expected: 5, got: 5)]                          
   ✅ ✅ scale_weight (counterint): STRONG - expected "400", got "400" [ (expected: 400, got: 400)]                  
   ✅ ✅ syllogism (logic): STRONG - expected "warm-blooded", got "warm-blooded" [ (expected: warm-blooded, got:     
 warm-blooded)]                                                                                                      
   ✅ ✅ if_then_chain (logic): STRONG - expected "grass grows", got "grass grows" [ (expected: grass grows, got:    
 grass grows)]                                                                                                       
   ⚠️  ❌ cause_effect (causal): WEAK - expected "grows", got "?" [ (expected: grows, got: ?)]                       
   ✅ ✅ relative_quantities (comparative): STRONG - expected "15", got "15" [ (expected: 15, got: 15)]              
   ❌ ❌ analogy_1 (analogy): FAIL - expected "room", got "?" [ (expected: room, got: ?)]                            
   ⚠️  ❌ analogy_2 (analogy): WEAK - expected "boot", got "?" [ (expected: boot, got: ?)]                           
   ✅ ✅ physics_1 (commonsense): STRONG - expected "bowling ball", got "bowling ball" [ (expected: bowling ball,    
 got: bowling ball)]                                                                                                 
   ✅ ✅ physics_2 (commonsense): STRONG - expected "hot", got "hot" [ (expected: hot, got: hot)]                    
   ⚠️  ❌ objects_1 (commonsense): WEAK - expected "scissors", got "?" [ (expected: scissors, got: ?)]               
   ✅ ✅ social_1 (commonsense): STRONG - expected "polite", got "polite" [ (expected: polite, got: polite)]         
   ⚠️  ❌ animals_1 (commonsense): WEAK - expected "water", got "?" [ (expected: water, got: ?)]                     
   ✅ ✅ gk_1 (commonsense): STRONG - expected "mars", got "mars" [ (expected: mars, got: mars)]                     
   ✅ ✅ gk_2 (commonsense): STRONG - expected "366", got "366" [ (expected: 366, got: 366)]                         
   ✅ Average score: MODERATE                                                                                        
                                                                                                                     
 ── INSTRUCTION FOLLOWING TEST (EXTENDED) ───────────────────                                                        
   ℹ️  Testing multi-step JSON schema compliance...                                                                  
   ℹ️  Time: 16.9s                                                                                                   
   ✅ JSON output valid with correct values (STRONG)                                                                 
   ℹ️  Output: {"name":"AI                                                                                           
 Assistant","can_count":true,"sum":42,"language":"English","colors":["red","blue","green"],"timestamp":"2023-10-05T1 
 4:48:00.123456"}                                                                                                    
                                                                                                                     
 ── TOOL USAGE TEST (EXTENDED) ──────────────────────────────                                                        
   ℹ️  Testing chained tool calls...                                                                                 
   ℹ️  Time: 0ms                                                                                                     
   ❌ Tool calls: none (ERROR)                                                                                       
   ℹ️  Response: OpenAI API returned 500: {"error":{"code":500,"message":"Unsupported param:                         
 tools","type":"server_error"}}                                                                                      
                                                                                                                     
 ── SUMMARY ─────────────────────────────────────────────────                                                        
   ✅ Reasoning: MODERATE                                                                                            
   ✅ Instructions: STRONG                                                                                           
   ❌ Tool Usage: ERROR                                                                                              
   ℹ️  Total time: 6.9m                                                                                              
   ℹ️  Score: 2/3 tests passed                                                                                       
                                                                                                                     
   ℹ️  Detailed: Reasoning 11/20 tests passed, Instructions 1/1, Tool Usage 0/1                                      
                                                                                                                     
 ── RECOMMENDATION ──────────────────────────────────────────                                                        
   ❌ bitnet-b1.58-2B-4T is WEAK — limited capabilities for agent use

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tested Models

Ollama Models — openai-completions

Cloud Providers — openai-completions

Sample Report — `poolside/laguna-m.1:free` via OpenRouter

FilesExpand file tree

TESTS.md

Latest commit

History

TESTS.md

File metadata and controls

Tested Models

Ollama Models — openai-completions

Cloud Providers — openai-completions

Sample Report — poolside/laguna-m.1:free via OpenRouter

Sample Report — `poolside/laguna-m.1:free` via OpenRouter