Based on the log output analysis, here's where time is being spent during image processing:
| Phase | Duration | Percentage | Issue |
|---|---|---|---|
| GPS Reverse Geocoding #1 | ~47s | 35% | |
| User venue selection (interactive) | ~28s | 21% | Expected (user input) |
| Face detection | ~6s | 4.5% | Reasonable |
| GPS Reverse Geocoding #2 | ~47s | 35% | |
| Caption generation (Llama) | ~18s | 13% | Expected (LLM inference) |
| Tags generation (Llama) | ~6s | 4.5% | Expected (LLM inference) |
| Face recognizer init | ~1s | <1% | Reasonable |
| Model initialization | <1s | <1% | Reasonable |
Total Processing Time: ~153s (2.5 minutes) Actual Processing (excluding user input): ~125s
Location: smugvision/utils/exif.py, lines 390-426
Problem: The reverse_geocode() function has a catastrophically inefficient implementation:
- It iterates through ~40 different venue types (restaurant, cafe, theater, school, etc.)
- For each venue type, it makes a separate API call to Nominatim geocoding service
- Each API call has a 5-second timeout
- If even half the venue types are tried, that's 20+ API calls × 5 seconds = 100+ seconds potential
Code snippet causing the issue:
# Lines 373-385: Comprehensive venue type list (~40 types)
all_venue_types = [
'restaurant', 'cafe', 'coffee', 'bar', 'pub', 'brewery',
'theater', 'theatre', 'cinema', 'venue', 'hall', 'auditorium',
'museum', 'gallery', 'library',
# ... 40+ types total
]
# Lines 390-426: Loop making API call for EACH type
for search_term in all_venue_types:
query = f"{search_term} near {latitude},{longitude}"
search_results = geolocator.geocode(
query,
exactly_one=False,
limit=5,
timeout=5 # 5 seconds per venue type!
)- First call (in test_vision.py): Lines 78-86, called with
interactive=Truefor user selection - Second call (in process_image): Called again inside the vision model processing
Option 1: Use Nominatim's nearby search properly
Instead of searching for each venue type individually, use a single reverse() call with better parameters, or use Overpass API for nearby POI search.
Option 2: Cache results The function is being called twice with the same coordinates. Cache the result from the first call.
Option 3: Reduce timeout 5 seconds per venue type is excessive. Reduce to 2 seconds.
Option 4: Limit venue types Don't search all 40 venue types. Search only the most common ones (top 5-10).
Option 5: Use concurrent requests
If multiple searches are needed, use ThreadPoolExecutor to parallelize API calls.
Replace the sequential venue search with:
- Single reverse geocode call (already done at line 352)
- If building name not found, make a single Overpass API query for all POI types within radius
- Or use Nominatim's
lookupendpoint for nearby POIs in one call
- GPS reverse geocoding: 47s → 2-5s (90-95% reduction)
- Total processing time: 153s → ~35s (excluding user input)
- Interactive mode: 153s → ~63s (including user input)
Run the updated test_vision.py script which now includes detailed timing breakdowns:
./test_vision.py <image_path>The script will output a timing breakdown showing exactly where time is spent in each phase:
⏱️ TIMING BREAKDOWN
============================================================
2. EXIF Location Extraction................... 47.23s (35.2%)
4. Total Image Processing..................... 53.45s (39.8%)
3. Face Recognizer Initialization.............. 0.54s ( 0.4%)
1. Model Initialization........................ 0.25s ( 0.2%)
------------------------------------------------------------
TOTAL.......................................... 134.2s
============================================================
- Llama vision model inference (18s caption + 6s tags) is reasonable for local inference
- Face detection (6s) is acceptable
- The 94 seconds spent on GPS geocoding (2 × 47s) represents 70% of non-interactive time
- Fixing the reverse geocoding will make the overall process 4× faster