Performance Tuning Guide

How to make your local coding assistant faster.

Quick Performance Comparison

Model	Quantization	Size	Speed (tokens/sec)	Quality	Best For
qwen2.5-coder-7b	Q4_K_M	4GB	5-15	Excellent	Best quality, slower
qwen2.5-coder-3b	Q4_K_M	2GB	15-30	Very Good	Recommended - fast & good
qwen2.5-coder-7b	Q3_K_M	3GB	8-20	Good	Faster, lower quality
deepseek-coder-6.7b	Q4_K_M	4GB	6-16	Excellent	Alternative to Qwen
codellama-7b	Q4_K_M	4GB	5-15	Very Good	Good at following instructions

Speed tested on i7-8665U (8 threads, 31GB RAM).

Where to Adjust Parameters

All configuration is in one file: config.sh

nano ~/PycharmProjects/coding-assistant/config.sh

After any change, restart the server: Ctrl+C then ./server.sh

Parameter Guide

1. ACTIVE_MODEL (Biggest Impact)

What it does: Selects which model to use.

Location: config.sh line 11

ACTIVE_MODEL="qwen2.5-coder-7b"

Available options:

# Fastest (Recommended for speed)
ACTIVE_MODEL="qwen2.5-coder-3b"

# Best quality (Default)
ACTIVE_MODEL="qwen2.5-coder-7b"

# Alternative models
ACTIVE_MODEL="deepseek-coder-6.7b"
ACTIVE_MODEL="codellama-7b"

To try a different model:

# 1. Download it first
./download-model.sh qwen2.5-coder-3b

# 2. Edit config.sh
nano config.sh
# Change ACTIVE_MODEL="qwen2.5-coder-3b"

# 3. Restart
./server.sh

2. N_THREADS (CPU Usage)

What it does: How many CPU threads to use for generation.

Location: config.sh line 14

N_THREADS=7

Options:

N_THREADS=7   # Default - leaves 1 thread for system
N_THREADS=8   # Max - use all threads (may slow down system)
N_THREADS=6   # Conservative - keeps system more responsive
N_THREADS=4   # Light usage - for background running

Rule of thumb: Your CPU has 8 threads. Use 7 for best performance, or less if you need system responsiveness.

3. CONTEXT_SIZE (Memory & Speed)

What it does: How much conversation history the model remembers.

Location: config.sh line 15

CONTEXT_SIZE=8192

Options:

CONTEXT_SIZE=8192   # Default - full context, slower
CONTEXT_SIZE=4096   # Half - faster, still good for most tasks
CONTEXT_SIZE=2048   # Quarter - fastest, limited context
CONTEXT_SIZE=16384  # Double - very slow, only for long conversations

Impact:

Larger = slower processing, more RAM, better long conversations
Smaller = faster, less RAM, might forget earlier context

Recommendation: Start with 4096 for speed, increase if you need longer context.

4. TEMPERATURE (Output Randomness)

What it does: Controls creativity vs consistency.

Location: config.sh line 16

TEMPERATURE=0.7

Options:

TEMPERATURE=0.3   # Very focused, deterministic, slightly faster
TEMPERATURE=0.5   # Balanced, good for code
TEMPERATURE=0.7   # Default - creative, natural responses
TEMPERATURE=0.9   # More creative, less predictable

Impact: Lower = slightly faster generation, more predictable output.

Recommendation: Use 0.5 for coding tasks.

5. TOP_P (Sampling Strategy)

What it does: Alternative to temperature for controlling randomness.

Location: config.sh line 17

TOP_P=0.95

Options:

TOP_P=0.90   # More focused
TOP_P=0.95   # Default - balanced
TOP_P=1.00   # Maximum diversity

Impact: Minimal performance impact. Lower = more focused responses.

6. REPEAT_PENALTY (Avoid Repetition)

What it does: Penalizes repeated tokens.

Location: config.sh line 18

REPEAT_PENALTY=1.1

Options:

REPEAT_PENALTY=1.0   # No penalty (might repeat)
REPEAT_PENALTY=1.1   # Default - balanced
REPEAT_PENALTY=1.2   # Strong penalty (more diverse, might be less coherent)

Impact: Minimal performance impact.

7. SERVER_PORT (Network)

What it does: Which port the API server listens on.

Location: config.sh line 21-22

SERVER_HOST="127.0.0.1"
SERVER_PORT=8080

Change if: Port 8080 is already in use.

SERVER_PORT=8081   # Alternative port

Quantization Explained

What is quantization? Compressing model weights to use less memory and run faster.

Available quantizations (from HuggingFace):

Quantization	Size vs Original	Quality	Speed	When to Use
Q4_K_M	~25%	Excellent	Good	Recommended - best balance
Q3_K_M	~20%	Good	Better	When you need more speed
Q2_K	~15%	Fair	Best	Only if desperate for speed
Q5_K_M	~30%	Excellent+	Slower	When quality > speed
Q6_K	~40%	Near-perfect	Slowest	Max quality, slow

For coding: Stick with Q4_K_M - it's the sweet spot.

To try different quantization:

Finding Available Quantizations

Go to the model's HuggingFace repo (check models.conf for the repo name)
Click "Files and versions" tab
Look for .gguf files - the quantization is in the filename

Example repos:

Common filenames you'll see:

qwen2.5-coder-7b-instruct-q2_k.gguf      # Smallest, fastest, lowest quality
qwen2.5-coder-7b-instruct-q3_k_m.gguf    # Small, fast
qwen2.5-coder-7b-instruct-q4_k_m.gguf    # <-- Default, best balance
qwen2.5-coder-7b-instruct-q5_k_m.gguf    # Larger, better quality
qwen2.5-coder-7b-instruct-q6_k.gguf      # Large, high quality
qwen2.5-coder-7b-instruct-q8_0.gguf      # Largest, near-original quality

Adding a Different Quantization

Add to models.conf:

qwen2.5-coder-7b-q3|Qwen/Qwen2.5-Coder-7B-Instruct-GGUF|qwen2.5-coder-7b-instruct-q3_k_m.gguf|Q3_K_M|3GB

Then download and switch:

./download-model.sh qwen2.5-coder-7b-q3
nano config.sh  # Change ACTIVE_MODEL
./server.sh

Recommended Configurations

Fast & Good (Recommended)

Best balance of speed and quality:

ACTIVE_MODEL="qwen2.5-coder-3b"
N_THREADS=7
CONTEXT_SIZE=4096
TEMPERATURE=0.5

Expected: 15-30 tokens/sec, good code quality.

Maximum Quality (Slower)

When quality matters more than speed:

ACTIVE_MODEL="qwen2.5-coder-7b"
N_THREADS=7
CONTEXT_SIZE=8192
TEMPERATURE=0.7

Expected: 5-15 tokens/sec, excellent code quality.

Speed Demon (Fastest)

Maximum speed, acceptable quality:

ACTIVE_MODEL="qwen2.5-coder-3b"
N_THREADS=8
CONTEXT_SIZE=2048
TEMPERATURE=0.3

Expected: 20-35 tokens/sec, good enough for most tasks.

Balanced (Middle Ground)

Default configuration:

ACTIVE_MODEL="qwen2.5-coder-7b"
N_THREADS=7
CONTEXT_SIZE=4096
TEMPERATURE=0.5

Expected: 8-18 tokens/sec, very good quality.

Step-by-Step: Try Something Different

Example: Switch to Faster 3B Model

# 1. Stop current server (Ctrl+C if running)

# 2. Download the 3B model
cd ~/PycharmProjects/coding-assistant
./download-model.sh qwen2.5-coder-3b

# 3. Edit config
nano config.sh
# Find line: ACTIVE_MODEL="qwen2.5-coder-7b"
# Change to: ACTIVE_MODEL="qwen2.5-coder-3b"
# Save: Ctrl+O, Enter, Ctrl+X

# 4. Optionally reduce context for more speed
# Find line: CONTEXT_SIZE=8192
# Change to: CONTEXT_SIZE=4096

# 5. Restart server
./server.sh

# 6. Test in OpenCode - should be noticeably faster

Example: Try DeepSeek Model

# 1. Download DeepSeek
./download-model.sh deepseek-coder-6.7b

# 2. Edit config
nano config.sh
# Change: ACTIVE_MODEL="deepseek-coder-6.7b"

# 3. Restart
./server.sh

Example: Maximize Speed

# 1. Get 3B model if not already downloaded
./download-model.sh qwen2.5-coder-3b

# 2. Edit config.sh - set all these:
nano config.sh

ACTIVE_MODEL="qwen2.5-coder-3b"
N_THREADS=8
CONTEXT_SIZE=2048
TEMPERATURE=0.3

# 3. Restart
./server.sh

Measuring Performance

Check Speed

The model outputs tokens/sec at the end of each response:

[ Prompt: 3.5 t/s | Generation: 15.2 t/s ]

What's good:

5-10 t/s = Slow but usable
10-20 t/s = Good
20-30 t/s = Fast
30+ t/s = Very fast

Compare Configurations

Test with the same prompt:

# In chat or via API:
"Write a Python function to reverse a string"

Note the generation speed and compare.

Troubleshooting

Still too slow after switching to 3B

Reduce context: CONTEXT_SIZE=2048
Use all threads: N_THREADS=8
Close other applications
Check CPU usage: htop (should be ~700% when generating)
Try a smaller quantization (Q3_K_M) - see Quantization Explained

Model quality not good enough

Switch back to 7B: ACTIVE_MODEL="qwen2.5-coder-7b"
Increase context: CONTEXT_SIZE=8192
Try different model: deepseek-coder-6.7b

Out of memory

Use smaller model: 3B instead of 7B
Reduce context: CONTEXT_SIZE=2048
Try a smaller quantization (Q3_K_M or Q2_K) - see Quantization Explained
Close other applications

Responses feel "mechanical"

Increase temperature: TEMPERATURE=0.7 or 0.9
Increase top_p: TOP_P=0.98

Quick Reference

Config file: ~/PycharmProjects/coding-assistant/config.sh

After changes: Restart server (type x then Enter, then ./server.sh)

Download models: ./download-model.sh <model-id>

List models: ./download-model.sh

Check current config: cat config.sh

Test speed: Watch the [ Generation: X.X t/s ] output

Adding New Models

To add a model not in models.conf:

Find GGUF file on HuggingFace

Add to models.conf:

MODEL_ID|HF_REPO|FILENAME|QUANTIZATION|SIZE

Download: ./download-model.sh MODEL_ID
Edit config.sh: ACTIVE_MODEL="MODEL_ID"
Restart: ./server.sh

Example - Adding Starcoder2:

# Add to models.conf:
starcoder2-7b|bigcode/starcoder2-7b-GGUF|starcoder2-7b.Q4_K_M.gguf|Q4_K_M|4GB

# Download and use:
./download-model.sh starcoder2-7b
nano config.sh  # Set ACTIVE_MODEL="starcoder2-7b"
./server.sh

Summary

For faster responses:

Switch to 3B model (biggest impact)
Reduce CONTEXT_SIZE to 4096
Use N_THREADS=8

For best quality:

Use 7B model
Keep CONTEXT_SIZE=8192
Use TEMPERATURE=0.7

The magic formula for your hardware:

ACTIVE_MODEL="qwen2.5-coder-3b"
N_THREADS=7
CONTEXT_SIZE=4096
TEMPERATURE=0.5

Try it and see 2-3x speed improvement with still-excellent quality!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Tuning Guide

Quick Performance Comparison

Where to Adjust Parameters

Parameter Guide

1. ACTIVE_MODEL (Biggest Impact)

2. N_THREADS (CPU Usage)

3. CONTEXT_SIZE (Memory & Speed)

4. TEMPERATURE (Output Randomness)

5. TOP_P (Sampling Strategy)

6. REPEAT_PENALTY (Avoid Repetition)

7. SERVER_PORT (Network)

Quantization Explained

Finding Available Quantizations

Adding a Different Quantization

Recommended Configurations

Fast & Good (Recommended)

Maximum Quality (Slower)

Speed Demon (Fastest)

Balanced (Middle Ground)

Step-by-Step: Try Something Different

Example: Switch to Faster 3B Model

Example: Try DeepSeek Model

Example: Maximize Speed

Measuring Performance

Check Speed

Compare Configurations

Troubleshooting

Still too slow after switching to 3B

Model quality not good enough

Out of memory

Responses feel "mechanical"

Quick Reference

Adding New Models

Summary

FilesExpand file tree

PERFORMANCE_TUNING.md

Latest commit

History

PERFORMANCE_TUNING.md

File metadata and controls

Performance Tuning Guide

Quick Performance Comparison

Where to Adjust Parameters

Parameter Guide

1. ACTIVE_MODEL (Biggest Impact)

2. N_THREADS (CPU Usage)

3. CONTEXT_SIZE (Memory & Speed)

4. TEMPERATURE (Output Randomness)

5. TOP_P (Sampling Strategy)

6. REPEAT_PENALTY (Avoid Repetition)

7. SERVER_PORT (Network)

Quantization Explained

Finding Available Quantizations

Adding a Different Quantization

Recommended Configurations

Fast & Good (Recommended)

Maximum Quality (Slower)

Speed Demon (Fastest)

Balanced (Middle Ground)

Step-by-Step: Try Something Different

Example: Switch to Faster 3B Model

Example: Try DeepSeek Model

Example: Maximize Speed

Measuring Performance

Check Speed

Compare Configurations

Troubleshooting

Still too slow after switching to 3B

Model quality not good enough

Out of memory

Responses feel "mechanical"

Quick Reference

Adding New Models

Summary