Skip to content

06_Detection_Logic

dnzbk edited this page Jul 9, 2025 · 1 revision

Detection Logic

Understanding how RemoveSamples identifies and processes sample files and directories.

🎯 Overview

RemoveSamples uses a multi-layered detection system that combines pattern matching, size analysis, and file type recognition to accurately identify sample content while preserving legitimate media files.

🔍 Detection Process Flow

File/Directory Found
    ↓
Directory Check
    ↓ (if directory)
Directory Pattern Match → Remove if match
    ↓ (if file)
File Extension Check
    ↓
Pattern Match Check
    ↓
Size Threshold Check
    ↓
Final Decision: Remove or Preserve

📁 Directory Detection

Directory Pattern Matching

RemoveSamples uses regex patterns to identify sample directories:

directory_patterns = [
    r'\bsamples?\b',        # "sample" or "samples" as whole words
    r'^samples?$',          # Exact match "sample" or "samples"  
    r'sample[_\-\s]',       # "sample" followed by separator
    r'[_\-\s]samples?$',    # "sample(s)" at end after separator
]

Directory Examples

Will be removed:

samples/
SAMPLES/
Sample_Videos/
Movie_Samples/
preview-samples/
Sample Videos/
SAMPLE/

Will be preserved:

Bonus_Features/         # No sample pattern
Behind_The_Scenes/      # Different content type
Extras/                 # Generic extras folder
Movie.Name.Sample.Title/# "Sample" part of title

Directory Processing

  1. Recursive scanning - Checks all subdirectories
  2. Pattern matching - Tests directory name against patterns
  3. Complete removal - Deletes directory and all contents
  4. Safety checks - Verifies write permissions before deletion

📄 File Detection

Pattern-Based Detection

Core Pattern Logic

file_patterns = [
    r'\bsample\b',          # Word boundary "sample"
    r'\.sample\.',          # ".sample." in filename
    r'_sample\.',           # "_sample." pattern
    r'-sample\.',           # "-sample." pattern  
    r'^sample\.',           # Starts with "sample."
    r'sample[_\-]',         # "sample" + separator
]

Pattern Matching Examples

✅ Detected (will be removed):

movie.sample.mkv        # .sample. pattern
sample.mp4              # Starts with sample
preview_sample.avi      # _sample. pattern
trailer-sample.wmv      # -sample. pattern
sample_clip.mov         # sample + separator
Sample.Preview.mkv      # Case insensitive

❌ Not detected by pattern:

movie.resample.mkv      # "resample" not "sample"
sampling_rate.wav       # "sampling" not "sample"
example.mkv             # Different word entirely
Movie.Sample.Title.mkv  # May need size check

Size-Based Detection

Video Size Detection

# Default thresholds (configurable)
video_threshold = 150  # MB
video_extensions = ['.mkv', '.mp4', '.avi', '.mov', '.wmv', 
                   '.flv', '.webm', '.m4v', '.3gp', '.ts',
                   '.mpg', '.mpeg', '.vob', '.iso']

Audio Size Detection

# Default thresholds (configurable)
audio_threshold = 2    # MB
audio_extensions = ['.mp3', '.flac', '.aac', '.ogg', '.wma',
                   '.m4a', '.opus', '.wav']

Size Analysis Process

  1. Check file extension against video/audio lists
  2. Get file size in megabytes
  3. Compare against threshold for file type
  4. Combine with pattern results for final decision

Combined Detection Logic

Decision Matrix

Pattern Match Size Check Result
✅ Yes ✅ Below threshold REMOVE
✅ Yes ❌ Above threshold REMOVE (pattern overrides)
❌ No ✅ Below threshold PRESERVE (size alone insufficient)
❌ No ❌ Above threshold PRESERVE

Special Cases

# Pattern match always wins
if pattern_matches(filename):
    return REMOVE  # Regardless of size

# Size-only detection requires pattern absence  
if not pattern_matches(filename) and below_threshold(filesize):
    return PRESERVE  # Pattern required for removal

🎬 Video Detection Details

Resolution-Based Size Guidelines

Resolution Typical Sample Size Recommended Threshold
480p 15-30 MB 50 MB
720p 30-60 MB 100 MB
1080p 50-150 MB 150 MB (default)
1440p 100-250 MB 300 MB
2160p (4K) 150-500 MB 500 MB

Video Format Considerations

# High compression formats (smaller samples)
high_compression = ['.mp4', '.mkv', '.webm']

# Lower compression formats (larger samples)  
lower_compression = ['.avi', '.mov', '.wmv']

# Archive formats (special handling)
archive_formats = ['.iso', '.vob']

Bitrate Impact on Sample Sizes

  • High bitrate (50+ Mbps): Samples can be 200-400 MB
  • Medium bitrate (10-30 Mbps): Samples typically 50-150 MB
  • Low bitrate (<10 Mbps): Samples usually under 50 MB

🎵 Audio Detection Details

Audio Quality vs Sample Size

Format Quality 30 seconds 1 minute Threshold
MP3 128 kbps 0.5 MB 1 MB 2 MB
MP3 320 kbps 1.2 MB 2.4 MB 2 MB
FLAC Lossless 3-5 MB 6-10 MB 5-10 MB
AAC 256 kbps 1 MB 2 MB 2 MB

Audio Sample Patterns

Common audio sample names:

01_sample.mp3           # Track number + sample
preview.mp3             # Preview without "sample"
30sec_sample.flac       # Duration + sample
album_sample.aac        # Album sample
sample_track.wav        # Sample + track

🧠 Advanced Detection Features

Word Boundary Detection

# Prevents false positives
pattern = r'\bsample\b'

# Matches: "movie.sample.mkv" 
# Ignores: "movie.resample.mkv"
# Ignores: "movie.sampling.mkv"

Case Insensitive Matching

# All patterns use re.IGNORECASE flag
re.search(pattern, filename, re.IGNORECASE)

# Matches: sample, SAMPLE, Sample, SaMpLe

Separator Flexibility

separators = ['.', '_', '-', ' ']
# Handles: .sample. _sample. -sample. sample

Unicode and International Support

# Supports international characters in filenames
filename.lower()  # Handles Unicode case conversion

🔬 Detection Accuracy

False Positive Prevention

Techniques used:

  • Word boundary matching prevents substring matches
  • Extension validation ensures appropriate file types
  • Size thresholds prevent removal of full content
  • Pattern specificity targets actual sample patterns

False Negative Handling

Common missed samples:

  • Files with non-standard patterns (preview.mkv)
  • Very large samples (above threshold)
  • Unusual file extensions
  • Obfuscated sample names

Accuracy Statistics

Based on testing with common sample types:

  • Pattern detection: 95% accuracy
  • Size detection: 90% accuracy
  • Combined detection: 97% accuracy
  • False positive rate: <1%

🛠️ Debugging Detection

Enable Debug Mode

Settings → Extension Manager → RemoveSamples → Debug: Yes

Debug Output Examples

[DEBUG] Checking file: movie.sample.mkv
[DEBUG] Extension check: .mkv is video file
[DEBUG] Pattern check: 'sample' found with word boundary
[DEBUG] Size check: 45MB < 150MB threshold  
[DEBUG] Decision: REMOVE (pattern match + below threshold)
[INFO] Removing sample file: movie.sample.mkv
[DEBUG] Checking directory: samples/
[DEBUG] Directory pattern check: 'samples' matches pattern
[DEBUG] Decision: REMOVE (directory pattern match)
[INFO] Removing sample directory: samples/ (3 files)

Common Debug Scenarios

File preserved unexpectedly:

[DEBUG] Pattern check: no sample pattern found
[DEBUG] Size check: 45MB < 150MB threshold
[DEBUG] Decision: PRESERVE (no pattern, size alone insufficient)

File removed unexpectedly:

[DEBUG] Pattern check: 'sample' found in filename
[DEBUG] Size check: 250MB > 150MB threshold
[DEBUG] Decision: REMOVE (pattern match overrides size)

⚙️ Configuration Impact on Detection

Threshold Adjustments

Higher thresholds (300MB video, 10MB audio):

  • More aggressive size-based detection
  • May catch larger samples
  • Risk of false positives increases

Lower thresholds (50MB video, 1MB audio):

  • More conservative size-based detection
  • May miss smaller samples
  • Safer for valuable content

Extension List Modifications

Adding extensions:

  • Enables detection for new file types
  • Useful for rare formats

Removing extensions:

  • Disables detection for specific types
  • Useful if format has no samples

🎯 Optimization Recommendations

For Different Content Types

4K/High Quality Content:

Video Size Threshold: 500 MB
Audio Size Threshold: 10 MB

Standard Definition Content:

Video Size Threshold: 100 MB
Audio Size Threshold: 1 MB

Music-Only Libraries:

Video Size Threshold: 150 MB (unchanged)
Audio Size Threshold: 1 MB (more aggressive)

Performance Considerations

  • Pattern matching: Very fast (regex optimized)
  • Size checking: Fast (single file stat call)
  • Directory scanning: Scales with file count
  • Overall impact: Minimal (<1% of processing time)

Need configuration help?Configuration Reference
Having detection issues?Troubleshooting Guide

Clone this wiki locally