From 398bd330dbc00c7976424840c509be64ae779f66 Mon Sep 17 00:00:00 2001 From: arahangua Date: Fri, 15 Aug 2025 13:22:17 +0900 Subject: [PATCH] fixed/updated: model search function, removed outdated update-status method --- QUICKSTART.md | 8 +++---- README.md | 61 +++++++++++++++++++++------------------------------ src/cli.py | 42 +++-------------------------------- 3 files changed, 31 insertions(+), 80 deletions(-) diff --git a/QUICKSTART.md b/QUICKSTART.md index cf0d4d5..8a45966 100644 --- a/QUICKSTART.md +++ b/QUICKSTART.md @@ -26,7 +26,7 @@ OPENROUTER_MODEL=your_model #### Option B: Ollama (Local) ```bash ollama serve -ollama pull model_alias +ollama pull model_alias # or you can just configure using the recent Ollama gui # Edit .env: LLM_PROVIDER=local LOCAL_LLM_TYPE=ollama @@ -79,7 +79,6 @@ scapo scrape all --dry-run # Preview what will be processed - `targeted --service NAME` - Extract tips for one service - `batch --category TYPE` - Process multiple services (limited) - `all --priority LEVEL` - Process ALL services one by one -- `update-status` - See what needs updating ## 📚 Approach 2: Legacy Sources @@ -189,9 +188,8 @@ NOT generic advice like (but sometimes we get them... sadly): ## 🚀 Next Steps 1. **Explore extracted tips**: `scapo tui` -2. **Update regularly**: `scapo scrape update-status` -3. **Track changes**: `python scripts/git_update.py --status` -4. **Contribute**: Share your findings via PR! +2. **Track changes**: `python scripts/git_update.py --status` +3. **Contribute**: Share your findings via PR! ## Need Help? diff --git a/README.md b/README.md index 9b44da8..b219124 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,7 @@ [![PRs Welcome](https://img.shields.io/badge/PRs-Welcome-brightgreen.svg)](CONTRIBUTING.md) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) -### 🎯 Real optimization tips from real users for AI services +### 🎯 Real usage tips from real users for AI services If you find **SCAPO** useful, please consider giving it a star on GitHub! Your support helps the project grow and reach more people. @@ -29,54 +29,51 @@ Your support helps the project grow and reach more people. **Keywords**: AI cost optimization, prompt engineering, LLM tips, OpenAI, Claude, Anthropic, Midjourney, Stable Diffusion, ElevenLabs, GitHub Copilot, reduce AI costs, AI service best practices, Reddit scraper, community knowledge base -Ever burned through credits in minutes? Searching Reddit for that one optimization tip? Getting generic advice when you need specific settings? +Ever burned through credits in minutes? Searching Reddit for one peculiar problem that you were having? Seach results telling you just generic advice when you need specific info? ![Scapo Intro](assets/intro.gif) -**SCAPO** extracts **specific, actionable optimization techniques** from Reddit about AI services - not generic "write better prompts" advice, but real discussions. +**SCAPO** extracts **specific usage tips and discussion** from Reddit about AI services - not generic "write better prompts" advice, but real discussions. So, can be sometimes wrong (i.e., crowd wisdom) but for sure will lift your eyebrows often "huh? ok, didn't know that..." ## ✨ Two Approaches SCAPO offers two distinct workflows: -### 1. 🎯 **Service Discovery Mode** (NEW - Recommended) - -Automatically discovers AI services and extracts specific optimization tips: - -![Scapo Discover](assets/scrape-discovery.gif) - -Discover services from GitHub Awesome lists +### 1. 🎯 **Batch Processing via Service Discovery (recommended)** +Discovers existing AI services and cache them for reference and downstream usage (see below): ```bash scapo scrape discover --update ``` -![Scapo Discover](assets/scrape-targeted.gif) + +![Scapo Discover](assets/scrape-discovery.gif) + Extract optimization tips for specific services ```bash scapo scrape targeted --service "Eleven Labs" --limit 20 ``` +![Scapo Discover](assets/scrape-targeted.gif) -![Scapo Discover](assets/scrape-batch.gif) -Batch process multiple priority services +Batch process multiple priority services (Recommended) ```bash scapo scrape batch --max-services 3 --category audio ``` - -### 2. 📚 **Legacy Sources Mode** - -![Scapo Batch](assets/legacy.gif) +![Scapo Discover](assets/scrape-batch.gif) +### 2. 📚 **Legacy Sources Mode** Traditional approach using predefined sources from `sources.yaml`: ```bash # Scrape from configured sources scapo scrape run --sources reddit:LocalLLaMA --limit 10 ``` +![Scapo Batch](assets/legacy.gif) + ## 🏃‍♂️ Quick Start (2 Minutes) @@ -102,6 +99,8 @@ cp .env.example .env ``` Get your API key from [openrouter.ai](https://openrouter.ai/) +* you can also use local LLMs (Ollama, LMstudio). Check [QUICKSTART.md](./QUICKSTART.md) + ### 3. Start Extracting Optimization Tips @@ -122,7 +121,7 @@ scapo scrape batch --category video --limit 15 scapo scrape all --priority ultra --limit 20 ``` -#### Option B: Legacy Sources +#### Option B: Legacy method: using sources.yaml file ```bash # Use predefined sources from sources.yaml @@ -155,13 +154,6 @@ cat models/video/heygen/pitfalls.md ❌ **Generic**: "Try different settings" ✅ **Specific**: "Use 720p instead of 1080p in HeyGen to save 40% credits" -## 📊 Real Results - -From actual extractions: -- **Eleven Labs**: Found 15+ specific optimization techniques from 75 Reddit posts -- **GitHub Copilot**: Discovered exact limits and configuration tips -- **Character.AI**: Found 32,000 character limit and mobile workarounds -- **HeyGen**: Credit optimization techniques and API alternatives ## 🛠️ How It Works @@ -174,10 +166,10 @@ From actual extractions: ### Intelligent Extraction - **Specific search patterns**: "config settings", "API key", "rate limit daily", "parameters" - **Aggressive filtering**: Ignores generic advice like "be patient" -- **Batch processing**: Processes 50+ posts at once for efficiency -- **Context awareness**: Uses full 128k token windows when available +- **Batch processing**: Can process 50+ posts at once for efficiency (we recommend minimum of 15 posts per query) +- **Context awareness**: Uses full token windows of your chosen LLM when available (for local LLM, you need to set your context window in .env) -### Smart Organization +### Output Organization ``` models/ ├── audio/ @@ -202,7 +194,7 @@ scapo scrape discover --show-all # List all services # Target specific services scapo scrape targeted \ - --service "Eleven Labs" \ # Service name (handles variations) + --service "Eleven Labs" \ # Service name (handles variations, you can put whatever --> if we don't get hit in services.json, then it will be created under 'general' folder) --limit 20 \ # Posts per search (15-20 recommended) --max-queries 10 # Number of searches @@ -212,9 +204,6 @@ scapo scrape batch \ --max-services 3 \ # Services to process --limit 15 # Posts per search -# Check update status -scapo scrape update-status # See what needs updating -``` ### Legacy Sources Mode ```bash @@ -232,7 +221,7 @@ scapo scrape run \ # CLI commands scapo models list # List all models scapo models search "copilot" # Search models -scapo models info github-copilot --category coding +scapo models info github-copilot --category code ``` ## ⚙️ Configuration @@ -252,7 +241,7 @@ LOCAL_LLM_OPTIMAL_CHUNK=2048 # Optimal batch size (typically 1/4 of m LOCAL_LLM_TIMEOUT_SECONDS=600 # 10 minutes for slower local models LLM_TIMEOUT_SECONDS=120 # 2 minutes for cloud models -# Extraction Quality +# Extraction Quality (depends on your chosen LLM's discretion) LLM_QUALITY_THRESHOLD=0.6 # Min quality (0.0-1.0) # Scraping @@ -264,7 +253,7 @@ MAX_POSTS_PER_SCRAPE=100 # Limit per source ```bash --limit 5 # ❌ Often finds nothing (too few samples) --limit 15 # ✅ Good baseline (finds common issues) ---limit 25 # 🎯 Optimal (uncovers hidden gems & edge cases) +--limit 25 # 🎯 Will find something (as long as there is active discussion on it) ``` so, hand-wavy breakdown: With 5 posts, extraction success ~20%. With 20+ posts, success jumps to ~80%. @@ -283,7 +272,7 @@ Navigate extracted tips with: ## 🔄 Git-Friendly Updates tracking AI services in the Models folder -SCAPO is designed for version control: +SCAPO is designed for version control (this is only for tracking the models folder): ```bash # Check what changed uv run scripts/git_update.py --status diff --git a/src/cli.py b/src/cli.py index be70379..97c6a4e 100644 --- a/src/cli.py +++ b/src/cli.py @@ -746,43 +746,6 @@ async def _batch(): asyncio.run(_batch()) -@scrape.command(name="update-status") -def update_status(): - """Show which services need updating.""" - show_banner() - - from src.services.update_manager import UpdateManager - manager = UpdateManager() - status = manager.get_update_status() - - # Display update status - console.print(Panel( - f"[bold]Update Status[/bold]\n\n" - f"Total services tracked: [cyan]{status['total_services']}[/cyan]\n" - f"Last update: [yellow]{status.get('last_update', 'Never')}[/yellow]\n" - f"Update frequency: {status.get('update_frequency', 'N/A')}\n", - border_style="blue", - title="SCAPO Update Tracker" - )) - - if status['recent_updates']: - console.print("\n[green]Recently Updated:[/green]") - for service in status['recent_updates'][:10]: - console.print(f" ✓ {service}") - - if status['stale_services']: - console.print("\n[yellow]Needs Update (>30 days old):[/yellow]") - for service in status['stale_services'][:10]: - console.print(f" ⚠ {service}") - - if len(status['stale_services']) > 10: - console.print(f" ... and {len(status['stale_services']) - 10} more") - - # Suggest next action - if status['stale_services']: - console.print(f"\n[dim]Tip: Run 'scapo scrape batch --max-services {min(3, len(status['stale_services']))}' to update stale services[/dim]") - - @scrape.command(name="all") @click.option('-l', '--limit', default=20, help='Max posts per search (default: 20)') @click.option('-c', '--category', help='Filter by category (video, audio, code, etc)') @@ -1167,8 +1130,9 @@ def search_models(query, limit): console.print("[yellow]No models directory found. Run 'sota scrape run' first.[/yellow]") return - # Search through all categories and models - for category in ["text", "image", "video", "audio", "multimodal"]: + # Search through all categories and models dynamically + categories = [d for d in os.listdir(models_dir) if os.path.isdir(os.path.join(models_dir, d))] + for category in categories: cat_dir = os.path.join(models_dir, category) if os.path.exists(cat_dir): for model in os.listdir(cat_dir):