Skip to content

Latest commit

 

History

History
124 lines (89 loc) · 4.5 KB

File metadata and controls

124 lines (89 loc) · 4.5 KB

SmugVision Performance Analysis

Summary

Based on the log output analysis, here's where time is being spent during image processing:

Timing Breakdown (from logs)

Phase Duration Percentage Issue
GPS Reverse Geocoding #1 ~47s 35% ⚠️ MAJOR BOTTLENECK
User venue selection (interactive) ~28s 21% Expected (user input)
Face detection ~6s 4.5% Reasonable
GPS Reverse Geocoding #2 ~47s 35% ⚠️ MAJOR BOTTLENECK
Caption generation (Llama) ~18s 13% Expected (LLM inference)
Tags generation (Llama) ~6s 4.5% Expected (LLM inference)
Face recognizer init ~1s <1% Reasonable
Model initialization <1s <1% Reasonable

Total Processing Time: ~153s (2.5 minutes) Actual Processing (excluding user input): ~125s

Root Cause Analysis

🔴 Critical Issue: Reverse Geocoding Taking 47 Seconds

Location: smugvision/utils/exif.py, lines 390-426

Problem: The reverse_geocode() function has a catastrophically inefficient implementation:

  1. It iterates through ~40 different venue types (restaurant, cafe, theater, school, etc.)
  2. For each venue type, it makes a separate API call to Nominatim geocoding service
  3. Each API call has a 5-second timeout
  4. If even half the venue types are tried, that's 20+ API calls × 5 seconds = 100+ seconds potential

Code snippet causing the issue:

# Lines 373-385: Comprehensive venue type list (~40 types)
all_venue_types = [
    'restaurant', 'cafe', 'coffee', 'bar', 'pub', 'brewery',
    'theater', 'theatre', 'cinema', 'venue', 'hall', 'auditorium',
    'museum', 'gallery', 'library',
    # ... 40+ types total
]

# Lines 390-426: Loop making API call for EACH type
for search_term in all_venue_types:
    query = f"{search_term} near {latitude},{longitude}"
    search_results = geolocator.geocode(
        query,
        exactly_one=False,
        limit=5,
        timeout=5  # 5 seconds per venue type!
    )

Why This Happens Twice

  1. First call (in test_vision.py): Lines 78-86, called with interactive=True for user selection
  2. Second call (in process_image): Called again inside the vision model processing

Recommendations

Immediate Fix (High Priority)

Option 1: Use Nominatim's nearby search properly Instead of searching for each venue type individually, use a single reverse() call with better parameters, or use Overpass API for nearby POI search.

Option 2: Cache results The function is being called twice with the same coordinates. Cache the result from the first call.

Option 3: Reduce timeout 5 seconds per venue type is excessive. Reduce to 2 seconds.

Option 4: Limit venue types Don't search all 40 venue types. Search only the most common ones (top 5-10).

Option 5: Use concurrent requests If multiple searches are needed, use ThreadPoolExecutor to parallelize API calls.

Proposed Optimized Implementation

Replace the sequential venue search with:

  1. Single reverse geocode call (already done at line 352)
  2. If building name not found, make a single Overpass API query for all POI types within radius
  3. Or use Nominatim's lookup endpoint for nearby POIs in one call

Expected Performance After Fix

  • GPS reverse geocoding: 47s → 2-5s (90-95% reduction)
  • Total processing time: 153s → ~35s (excluding user input)
  • Interactive mode: 153s → ~63s (including user input)

Monitoring

Run the updated test_vision.py script which now includes detailed timing breakdowns:

./test_vision.py <image_path>

The script will output a timing breakdown showing exactly where time is spent in each phase:

⏱️  TIMING BREAKDOWN
============================================================
2. EXIF Location Extraction................... 47.23s (35.2%)
4. Total Image Processing..................... 53.45s (39.8%)
3. Face Recognizer Initialization..............  0.54s ( 0.4%)
1. Model Initialization........................  0.25s ( 0.2%)
------------------------------------------------------------
TOTAL.......................................... 134.2s
============================================================

Additional Notes

  • Llama vision model inference (18s caption + 6s tags) is reasonable for local inference
  • Face detection (6s) is acceptable
  • The 94 seconds spent on GPS geocoding (2 × 47s) represents 70% of non-interactive time
  • Fixing the reverse geocoding will make the overall process 4× faster