Test Models Feature Documentation

Overview

The Test Models page provides a dedicated interface for testing LLM prompts directly without needing existing logs. This feature complements the existing Replay functionality by allowing users to experiment with fresh prompts and compare model responses.

Features

✨ Direct Model Testing

Multiple Provider Support: Test with OpenAI, Ollama, Mistral, and OpenRouter
Dynamic Model Selection: Automatically loads available models for each provider
Real-time Configuration: Adjust temperature, max tokens, and system messages
Instant Results: See responses, token usage, cost estimates, and performance metrics

🔄 Compare Mode

Side-by-Side Testing: Run multiple test configurations simultaneously
Multi-Provider Comparison: Compare the same prompt across different models
Performance Benchmarking: Compare response times, costs, and token usage

💾 Prompt Management

Save Prompts: Save frequently used prompts for reuse
Example Library: Pre-built example prompts for different use cases
Quick Loading: Easily load saved or example prompts into any test configuration

📊 Advanced Features

Cost Estimation: Preview estimated costs before running tests
Token Counting: Real-time character and estimated token counting
Validation: Pre-flight validation of test configurations
Response Analysis: Detailed metrics including latency, tokens, and costs

API Endpoints

Test Prompt

POST /api/test/prompt

Test a single prompt with specified provider and model.

Request Body:

{
  "prompt": "Your test prompt here",
  "provider": "openai",
  "model": "gpt-3.5-turbo",
  "systemMessage": "You are a helpful assistant.",
  "temperature": 0.7,
  "maxTokens": 1000
}

Response:

{
  "success": true,
  "data": {
    "requestId": "test-123456789",
    "provider": "openai",
    "model": "gpt-3.5-turbo",
    "response": "Assistant response here...",
    "tokenUsage": {
      "promptTokens": 15,
      "completionTokens": 25,
      "totalTokens": 40
    },
    "cost": 0.0012,
    "duration": 1250,
    "timestamp": "2023-12-07T10:30:00Z"
  }
}

Compare Models

POST /api/test/compare

Test the same prompt with multiple models.

Request Body:

{
  "prompt": "Explain quantum computing",
  "models": [
    {
      "provider": "openai",
      "model": "gpt-3.5-turbo",
      "temperature": 0.7,
      "maxTokens": 500
    },
    {
      "provider": "ollama",
      "model": "llama2",
      "temperature": 0.7,
      "maxTokens": 500
    }
  ]
}

Get Available Models

GET /api/test/models

Retrieve available models for all configured providers.

Cost Estimation

POST /api/test/estimate

Get cost estimates before running tests.

Configuration Validation

POST /api/test/validate

Validate test configuration and get recommendations.

Usage Examples

Basic Testing

Navigate to the Test Models page
Enter your prompt in the text area
Select your preferred provider and model
Adjust parameters (temperature, max tokens)
Click Run to execute the test
View results including response, metrics, and costs

Compare Mode

Enable Compare Mode in the header
Add multiple test configurations using Add Test
Configure different providers/models for each test
Use the same prompt across all configurations
Click Run All Tests to execute simultaneously
Compare results side-by-side

Using Examples

Click the Load example... dropdown
Select from pre-built prompts like:
- Creative Writing
- Code Review
- Data Analysis
- Technical Documentation
- Problem Solving
The prompt and appropriate system message will be loaded
Modify as needed and run the test

Saving Prompts

Enter or load a prompt you want to save
Click Save Prompt below the text area
Enter a descriptive name
Access saved prompts via Load saved... dropdown

Best Practices

Prompt Design

Be Specific: Clear, detailed prompts generally produce better results
Set Context: Use system messages to establish the assistant's role
Test Variations: Try different phrasings to find optimal results

Model Selection

Start Simple: Begin with faster, cheaper models like GPT-3.5-turbo
Compare Performance: Use compare mode to evaluate different models
Consider Cost: Balance quality needs with budget constraints

Parameter Tuning

Temperature:
- 0.0-0.3: Focused, deterministic responses
- 0.4-0.7: Balanced creativity and coherence
- 0.8-1.0: More creative, varied responses
Max Tokens: Set appropriate limits to control response length and cost

Testing Strategy

Validate First: Use the validation endpoint to check configurations
Estimate Costs: Preview costs for expensive models or long prompts
Iterative Testing: Start with basic prompts and refine based on results
Save Good Prompts: Build a library of effective prompts for reuse

Integration with Monitoring

All test requests are automatically logged in the monitoring system with:

Special Tagging: Tests are marked with isTest: true metadata
Full Tracking: Same metrics as production requests (latency, costs, tokens)
Error Logging: Failed tests are captured for debugging
Analytics Integration: Test data appears in analytics dashboards

Troubleshooting

Common Issues

"No models available"

Check provider configuration in Settings
Ensure API keys are properly set
Verify provider services are running (especially Ollama)

"Test failed" errors

Check network connectivity
Verify API keys and provider settings
Review error messages in the response
Check provider-specific documentation

High costs warning

Review max tokens setting
Consider using smaller/cheaper models for testing
Use cost estimation before running expensive tests

Performance Tips

Use smaller models for rapid iteration
Set reasonable max token limits
Test with shorter prompts first
Monitor token usage to optimize costs

Security Considerations

API Key Safety: Test prompts use the same API keys as production
Data Privacy: Test prompts are logged and stored
Rate Limiting: Respect provider rate limits during testing
Cost Control: Monitor spending, especially with expensive models

Future Enhancements

Planned features for future releases:

Batch Testing: Upload CSV files with multiple prompts
A/B Testing Framework: Systematic comparison tools
Prompt Templates: Reusable prompt structures
Export Results: Download test results as CSV/JSON
Scheduling: Automated recurring tests
Benchmarking Suites: Standard evaluation datasets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test Models Feature Documentation

Overview

Features

✨ Direct Model Testing

🔄 Compare Mode

💾 Prompt Management

📊 Advanced Features

API Endpoints

Test Prompt

Compare Models

Get Available Models

Cost Estimation

Configuration Validation

Usage Examples

Basic Testing

Compare Mode

Using Examples

Saving Prompts

Best Practices

Prompt Design

Model Selection

Parameter Tuning

Testing Strategy

Integration with Monitoring

Troubleshooting

Common Issues

Performance Tips

Security Considerations

Future Enhancements

FilesExpand file tree

TEST_MODELS_GUIDE.md

Latest commit

History

TEST_MODELS_GUIDE.md

File metadata and controls

Test Models Feature Documentation

Overview

Features

✨ Direct Model Testing

🔄 Compare Mode

💾 Prompt Management

📊 Advanced Features

API Endpoints

Test Prompt

Compare Models

Get Available Models

Cost Estimation

Configuration Validation

Usage Examples

Basic Testing

Compare Mode

Using Examples

Saving Prompts

Best Practices

Prompt Design

Model Selection

Parameter Tuning

Testing Strategy

Integration with Monitoring

Troubleshooting

Common Issues

Performance Tips

Security Considerations

Future Enhancements