diff --git a/QUICKSTART.md b/QUICKSTART.md index cf0d4d5..8a45966 100644 --- a/QUICKSTART.md +++ b/QUICKSTART.md @@ -26,7 +26,7 @@ OPENROUTER_MODEL=your_model #### Option B: Ollama (Local) ```bash ollama serve -ollama pull model_alias +ollama pull model_alias # or you can just configure using the recent Ollama gui # Edit .env: LLM_PROVIDER=local LOCAL_LLM_TYPE=ollama @@ -79,7 +79,6 @@ scapo scrape all --dry-run # Preview what will be processed - `targeted --service NAME` - Extract tips for one service - `batch --category TYPE` - Process multiple services (limited) - `all --priority LEVEL` - Process ALL services one by one -- `update-status` - See what needs updating ## πŸ“š Approach 2: Legacy Sources @@ -189,9 +188,8 @@ NOT generic advice like (but sometimes we get them... sadly): ## πŸš€ Next Steps 1. **Explore extracted tips**: `scapo tui` -2. **Update regularly**: `scapo scrape update-status` -3. **Track changes**: `python scripts/git_update.py --status` -4. **Contribute**: Share your findings via PR! +2. **Track changes**: `python scripts/git_update.py --status` +3. **Contribute**: Share your findings via PR! ## Need Help? diff --git a/README.md b/README.md index 9b44da8..b219124 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,7 @@ [![PRs Welcome](https://img.shields.io/badge/PRs-Welcome-brightgreen.svg)](CONTRIBUTING.md) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) -### 🎯 Real optimization tips from real users for AI services +### 🎯 Real usage tips from real users for AI services If you find **SCAPO** useful, please consider giving it a star on GitHub! Your support helps the project grow and reach more people. @@ -29,54 +29,51 @@ Your support helps the project grow and reach more people. **Keywords**: AI cost optimization, prompt engineering, LLM tips, OpenAI, Claude, Anthropic, Midjourney, Stable Diffusion, ElevenLabs, GitHub Copilot, reduce AI costs, AI service best practices, Reddit scraper, community knowledge base -Ever burned through credits in minutes? Searching Reddit for that one optimization tip? Getting generic advice when you need specific settings? +Ever burned through credits in minutes? Searching Reddit for one peculiar problem that you were having? Seach results telling you just generic advice when you need specific info? ![Scapo Intro](assets/intro.gif) -**SCAPO** extracts **specific, actionable optimization techniques** from Reddit about AI services - not generic "write better prompts" advice, but real discussions. +**SCAPO** extracts **specific usage tips and discussion** from Reddit about AI services - not generic "write better prompts" advice, but real discussions. So, can be sometimes wrong (i.e., crowd wisdom) but for sure will lift your eyebrows often "huh? ok, didn't know that..." ## ✨ Two Approaches SCAPO offers two distinct workflows: -### 1. 🎯 **Service Discovery Mode** (NEW - Recommended) - -Automatically discovers AI services and extracts specific optimization tips: - -![Scapo Discover](assets/scrape-discovery.gif) - -Discover services from GitHub Awesome lists +### 1. 🎯 **Batch Processing via Service Discovery (recommended)** +Discovers existing AI services and cache them for reference and downstream usage (see below): ```bash scapo scrape discover --update ``` -![Scapo Discover](assets/scrape-targeted.gif) + +![Scapo Discover](assets/scrape-discovery.gif) + Extract optimization tips for specific services ```bash scapo scrape targeted --service "Eleven Labs" --limit 20 ``` +![Scapo Discover](assets/scrape-targeted.gif) -![Scapo Discover](assets/scrape-batch.gif) -Batch process multiple priority services +Batch process multiple priority services (Recommended) ```bash scapo scrape batch --max-services 3 --category audio ``` - -### 2. πŸ“š **Legacy Sources Mode** - -![Scapo Batch](assets/legacy.gif) +![Scapo Discover](assets/scrape-batch.gif) +### 2. πŸ“š **Legacy Sources Mode** Traditional approach using predefined sources from `sources.yaml`: ```bash # Scrape from configured sources scapo scrape run --sources reddit:LocalLLaMA --limit 10 ``` +![Scapo Batch](assets/legacy.gif) + ## πŸƒβ€β™‚οΈ Quick Start (2 Minutes) @@ -102,6 +99,8 @@ cp .env.example .env ``` Get your API key from [openrouter.ai](https://openrouter.ai/) +* you can also use local LLMs (Ollama, LMstudio). Check [QUICKSTART.md](./QUICKSTART.md) + ### 3. Start Extracting Optimization Tips @@ -122,7 +121,7 @@ scapo scrape batch --category video --limit 15 scapo scrape all --priority ultra --limit 20 ``` -#### Option B: Legacy Sources +#### Option B: Legacy method: using sources.yaml file ```bash # Use predefined sources from sources.yaml @@ -155,13 +154,6 @@ cat models/video/heygen/pitfalls.md ❌ **Generic**: "Try different settings" βœ… **Specific**: "Use 720p instead of 1080p in HeyGen to save 40% credits" -## πŸ“Š Real Results - -From actual extractions: -- **Eleven Labs**: Found 15+ specific optimization techniques from 75 Reddit posts -- **GitHub Copilot**: Discovered exact limits and configuration tips -- **Character.AI**: Found 32,000 character limit and mobile workarounds -- **HeyGen**: Credit optimization techniques and API alternatives ## πŸ› οΈ How It Works @@ -174,10 +166,10 @@ From actual extractions: ### Intelligent Extraction - **Specific search patterns**: "config settings", "API key", "rate limit daily", "parameters" - **Aggressive filtering**: Ignores generic advice like "be patient" -- **Batch processing**: Processes 50+ posts at once for efficiency -- **Context awareness**: Uses full 128k token windows when available +- **Batch processing**: Can process 50+ posts at once for efficiency (we recommend minimum of 15 posts per query) +- **Context awareness**: Uses full token windows of your chosen LLM when available (for local LLM, you need to set your context window in .env) -### Smart Organization +### Output Organization ``` models/ β”œβ”€β”€ audio/ @@ -202,7 +194,7 @@ scapo scrape discover --show-all # List all services # Target specific services scapo scrape targeted \ - --service "Eleven Labs" \ # Service name (handles variations) + --service "Eleven Labs" \ # Service name (handles variations, you can put whatever --> if we don't get hit in services.json, then it will be created under 'general' folder) --limit 20 \ # Posts per search (15-20 recommended) --max-queries 10 # Number of searches @@ -212,9 +204,6 @@ scapo scrape batch \ --max-services 3 \ # Services to process --limit 15 # Posts per search -# Check update status -scapo scrape update-status # See what needs updating -``` ### Legacy Sources Mode ```bash @@ -232,7 +221,7 @@ scapo scrape run \ # CLI commands scapo models list # List all models scapo models search "copilot" # Search models -scapo models info github-copilot --category coding +scapo models info github-copilot --category code ``` ## βš™οΈ Configuration @@ -252,7 +241,7 @@ LOCAL_LLM_OPTIMAL_CHUNK=2048 # Optimal batch size (typically 1/4 of m LOCAL_LLM_TIMEOUT_SECONDS=600 # 10 minutes for slower local models LLM_TIMEOUT_SECONDS=120 # 2 minutes for cloud models -# Extraction Quality +# Extraction Quality (depends on your chosen LLM's discretion) LLM_QUALITY_THRESHOLD=0.6 # Min quality (0.0-1.0) # Scraping @@ -264,7 +253,7 @@ MAX_POSTS_PER_SCRAPE=100 # Limit per source ```bash --limit 5 # ❌ Often finds nothing (too few samples) --limit 15 # βœ… Good baseline (finds common issues) ---limit 25 # 🎯 Optimal (uncovers hidden gems & edge cases) +--limit 25 # 🎯 Will find something (as long as there is active discussion on it) ``` so, hand-wavy breakdown: With 5 posts, extraction success ~20%. With 20+ posts, success jumps to ~80%. @@ -283,7 +272,7 @@ Navigate extracted tips with: ## πŸ”„ Git-Friendly Updates tracking AI services in the Models folder -SCAPO is designed for version control: +SCAPO is designed for version control (this is only for tracking the models folder): ```bash # Check what changed uv run scripts/git_update.py --status diff --git a/models/image/midjourney/cost_optimization.md b/models/image/midjourney/cost_optimization.md new file mode 100644 index 0000000..c99fd13 --- /dev/null +++ b/models/image/midjourney/cost_optimization.md @@ -0,0 +1,12 @@ +# Midjourney - Cost Optimization Guide + +*Last updated: 2025-08-15* + +## Cost & Pricing Information + +- 200 image limit +- I use Midjourney quite a bit for graphic design (mainly to generate assets for thumbnails to save trawling through hundreds of pages of stock images). But the tier I use is Β£30 a month. +- $10 version +- The company’s image AI service, accessible through Discord, stands out with a diverse range of packages priced between $10 and $120 per month. +- $4 additional rollover GPU time + diff --git a/models/image/midjourney/metadata.json b/models/image/midjourney/metadata.json index 7e4189d..883ee88 100644 --- a/models/image/midjourney/metadata.json +++ b/models/image/midjourney/metadata.json @@ -1,13 +1,13 @@ { "service": "Midjourney", "category": "image", - "last_updated": "2025-08-11T23:01:57.902430", - "extraction_timestamp": "2025-08-11T23:01:57.902430", + "last_updated": "2025-08-15T14:50:42.037636", + "extraction_timestamp": "2025-08-15T14:50:34.751518", "data_sources": [ "Reddit API", "Community discussions" ], - "posts_analyzed": 0, + "posts_analyzed": 113, "confidence": "medium", "version": "1.0.0" } \ No newline at end of file diff --git a/models/image/midjourney/parameters.json b/models/image/midjourney/parameters.json new file mode 100644 index 0000000..855da06 --- /dev/null +++ b/models/image/midjourney/parameters.json @@ -0,0 +1,14 @@ +{ + "service": "Midjourney", + "last_updated": "2025-08-15T14:50:41.953959", + "recommended_settings": {}, + "cost_optimization": { + "tip_0": "200 image limit", + "tip_1": "I use Midjourney quite a bit for graphic design (mainly to generate assets for thumbnails to save trawling through hundreds of pages of stock images). But the tier I use is \u00a330 a month.", + "pricing": "$4 additional rollover GPU time" + }, + "sources": [ + "Reddit community", + "User reports" + ] +} \ No newline at end of file diff --git a/models/image/midjourney/pitfalls.md b/models/image/midjourney/pitfalls.md new file mode 100644 index 0000000..97246fa --- /dev/null +++ b/models/image/midjourney/pitfalls.md @@ -0,0 +1,24 @@ +# Midjourney - Common Pitfalls & Issues + +*Last updated: 2025-08-15* + +## Technical Issues + +### ⚠️ It's possible to queue 12 image generations/upscales in the Pro plan, but usually this is really annoying when I'm batch-upscaling images for later. Is there any way to bypass this 12 image queue limit? I don't want to have to go to Discord every few minutes to add more items to the queue (also it's really buggy and sometimes it's impossible to tell if an image has been added to the queue since the button doesn't get pressed) + +### ⚠️ TLDR: MJ is the best for artistic generations compared to other models, but is artificially limiting its use-cases by not offering an API to artists who want to create dynamic, interactive, artworks. I suggest a personal API tier to allow artists to use MJ in this way. + +--- + +I want to start by saying I understand there are many reasons why MJ would not want to offer an API. They are totally reasonable, especially from a business perspective. + +I want to present a case as to why I feel the lack of + +## Cost & Limits + +### πŸ’° Currently on the $10 version of midjourney, was curious if I hit the 200 image mark, then ourchas the $4 additional rollover gpu time, if that will let me go over the limit? + +Thanks + +### πŸ’° 200 image limit + diff --git a/models/image/midjourney/prompting.md b/models/image/midjourney/prompting.md index d51e67a..935f06c 100644 --- a/models/image/midjourney/prompting.md +++ b/models/image/midjourney/prompting.md @@ -1,10 +1,13 @@ # Midjourney Prompting Guide -*Last updated: 2025-08-11* +*Last updated: 2025-08-15* -## Usage Tips +## Tips & Techniques -- Try using the --raw parameter with Midjourney's Video Model +- Use the midjourney-python-api, an open-source Python client built for the unofficial MidJourney API, leveraging a Discord self bot and the Merubokkusu/Discord-S.C.U.M library. Key features include info retrieval, imagine prompt, image upscale and vectorization by lab. +- switch to fast mode +- upgrade your plan +- You can build a simple interface on Wix that uses open-source GitHub APIs to connect to Midjourney, sending image and text prompts and storing output images in a gallery after receiving their links. It required about 70 lines of code. ## Sources diff --git a/src/cli.py b/src/cli.py index be70379..88a7ec6 100644 --- a/src/cli.py +++ b/src/cli.py @@ -389,7 +389,8 @@ async def _discover(): @click.option("--all", "run_all", is_flag=True, help="Run all generated queries") @click.option("--max-queries", "-m", default=10, help="Maximum queries to run (default: 10)") @click.option("--parallel", "-p", default=3, help="Number of parallel scraping tasks") -def targeted_scrape(service, category, limit, batch_size, dry_run, run_all, max_queries, parallel): +@click.option("--use-all-patterns", is_flag=True, help="Use ALL 20 search patterns instead of just 5 (uses all 4 patterns from each category: cost, optimization, technical, workarounds, bugs)") +def targeted_scrape(service, category, limit, batch_size, dry_run, run_all, max_queries, parallel, use_all_patterns): """Run targeted searches for specific AI services.""" show_banner() @@ -404,7 +405,7 @@ async def _targeted(): from datetime import datetime # Access outer scope variables - nonlocal service, category, limit, batch_size, dry_run, run_all, max_queries, parallel + nonlocal service, category, limit, batch_size, dry_run, run_all, max_queries, parallel, use_all_patterns # Generate targeted searches generator = TargetedSearchGenerator() @@ -413,7 +414,9 @@ async def _targeted(): if service and not category: # Just generate queries for the requested service - don't generate for all services first console.print(f"[cyan]Generating queries for {service}...[/cyan]") - queries = generator.generate_queries_for_service(service, max_queries=max_queries) + if use_all_patterns: + console.print(f"[yellow]Using ALL patterns (20 total search queries)[/yellow]") + queries = generator.generate_queries_for_service(service, max_queries=max_queries, use_all_patterns=use_all_patterns) if not queries: console.print(f"[red]Could not generate queries for service: {service}[/red]") @@ -422,7 +425,8 @@ async def _targeted(): # Generate queries based on category or all services all_queries = generator.generate_queries( max_queries=100 if run_all else max_queries, - category_filter=category if category else None + category_filter=category if category else None, + use_all_patterns=use_all_patterns ) queries = all_queries @@ -746,43 +750,6 @@ async def _batch(): asyncio.run(_batch()) -@scrape.command(name="update-status") -def update_status(): - """Show which services need updating.""" - show_banner() - - from src.services.update_manager import UpdateManager - manager = UpdateManager() - status = manager.get_update_status() - - # Display update status - console.print(Panel( - f"[bold]Update Status[/bold]\n\n" - f"Total services tracked: [cyan]{status['total_services']}[/cyan]\n" - f"Last update: [yellow]{status.get('last_update', 'Never')}[/yellow]\n" - f"Update frequency: {status.get('update_frequency', 'N/A')}\n", - border_style="blue", - title="SCAPO Update Tracker" - )) - - if status['recent_updates']: - console.print("\n[green]Recently Updated:[/green]") - for service in status['recent_updates'][:10]: - console.print(f" βœ“ {service}") - - if status['stale_services']: - console.print("\n[yellow]Needs Update (>30 days old):[/yellow]") - for service in status['stale_services'][:10]: - console.print(f" ⚠ {service}") - - if len(status['stale_services']) > 10: - console.print(f" ... and {len(status['stale_services']) - 10} more") - - # Suggest next action - if status['stale_services']: - console.print(f"\n[dim]Tip: Run 'scapo scrape batch --max-services {min(3, len(status['stale_services']))}' to update stale services[/dim]") - - @scrape.command(name="all") @click.option('-l', '--limit', default=20, help='Max posts per search (default: 20)') @click.option('-c', '--category', help='Filter by category (video, audio, code, etc)') @@ -1167,8 +1134,9 @@ def search_models(query, limit): console.print("[yellow]No models directory found. Run 'sota scrape run' first.[/yellow]") return - # Search through all categories and models - for category in ["text", "image", "video", "audio", "multimodal"]: + # Search through all categories and models dynamically + categories = [d for d in os.listdir(models_dir) if os.path.isdir(os.path.join(models_dir, d))] + for category in categories: cat_dir = os.path.join(models_dir, category) if os.path.exists(cat_dir): for model in os.listdir(cat_dir): diff --git a/src/scrapers/targeted_search_generator.py b/src/scrapers/targeted_search_generator.py index 6b54aaa..58ae19b 100644 --- a/src/scrapers/targeted_search_generator.py +++ b/src/scrapers/targeted_search_generator.py @@ -77,41 +77,98 @@ def load_services(self): data = json.load(f) self.services = data.get('services', {}) - def generate_queries_for_service(self, service_name: str, max_queries: int = 10) -> List[Dict]: - """Generate queries for a specific service""" + def generate_queries_for_service(self, service_name: str, max_queries: int = 10, use_all_patterns: bool = False) -> List[Dict]: + """Generate queries for a specific service + + Args: + service_name: The service to generate queries for + max_queries: Maximum number of queries to generate + use_all_patterns: If True, use ALL patterns (20 total), not just first from each category + """ queries = [] - # Generate queries for each problem pattern - patterns_to_use = list(self.problem_patterns.keys()) - queries_per_pattern = max(1, max_queries // len(patterns_to_use)) + # Try to get service info from alias manager + service_info = self.alias_manager.match_service(service_name) + if service_info: + service_category = service_info['category'] + service_key = service_info['service_key'] + else: + service_category = 'general' + service_key = service_name.lower().replace(' ', '-') - for pattern_type in patterns_to_use[:max_queries]: - pattern_list = self.problem_patterns[pattern_type] - # Use first pattern from each type - if pattern_list: - pattern = pattern_list[0] - query_text = pattern.replace('{service}', service_name) - query_url = f'https://old.reddit.com/search?q={query_text.replace(" ", "+").replace('"', "%22")}' - - query = { - 'service': service_name, - 'service_key': service_name.lower().replace(' ', '-'), - 'category': 'general', # Default category for custom queries - 'query': query_text, - 'query_url': query_url, - 'pattern': query_text, - 'pattern_type': pattern_type, - 'priority': 'custom', - 'generated': datetime.now().isoformat() - } - queries.append(query) + # If category is 'general', try to determine it from known keywords + if service_category == 'general': + service_name_lower = service_name.lower() + if any(keyword in service_name_lower for keyword in ['midjourney', 'dall-e', 'stable diffusion', 'leonardo', 'ideogram']): + service_category = 'image' + elif any(keyword in service_name_lower for keyword in ['runway', 'pika', 'luma', 'kaiber', 'genmo', 'haiper']): + service_category = 'video' + elif any(keyword in service_name_lower for keyword in ['elevenlabs', 'eleven labs', 'murf', 'play.ht', 'wellsaid', 'descript']): + service_category = 'audio' + elif any(keyword in service_name_lower for keyword in ['gpt', 'claude', 'llama', 'gemini', 'mistral']): + service_category = 'text' + elif any(keyword in service_name_lower for keyword in ['copilot', 'cursor', 'codeium', 'tabnine']): + service_category = 'code' + + if use_all_patterns: + # Use ALL patterns from each category + for pattern_type, pattern_list in self.problem_patterns.items(): + for pattern in pattern_list: + query_text = pattern.replace('{service}', service_name) + query_url = f'https://old.reddit.com/search?q={query_text.replace(" ", "+").replace('"', "%22")}' + + query = { + 'service': service_name, + 'service_key': service_key, + 'category': service_category, + 'query': query_text, + 'query_url': query_url, + 'pattern': query_text, + 'pattern_type': pattern_type, + 'priority': 'custom', + 'generated': datetime.now().isoformat() + } + queries.append(query) + + if len(queries) >= max_queries: + return queries[:max_queries] + else: + # Original behavior: use first pattern from each type + patterns_to_use = list(self.problem_patterns.keys()) + queries_per_pattern = max(1, max_queries // len(patterns_to_use)) + + for pattern_type in patterns_to_use[:max_queries]: + pattern_list = self.problem_patterns[pattern_type] + # Use first pattern from each type + if pattern_list: + pattern = pattern_list[0] + query_text = pattern.replace('{service}', service_name) + query_url = f'https://old.reddit.com/search?q={query_text.replace(" ", "+").replace('"', "%22")}' + + query = { + 'service': service_name, + 'service_key': service_key, + 'category': service_category, + 'query': query_text, + 'query_url': query_url, + 'pattern': query_text, + 'pattern_type': pattern_type, + 'priority': 'custom', + 'generated': datetime.now().isoformat() + } + queries.append(query) return queries[:max_queries] - def generate_queries(self, max_queries: int = 100, category_filter: str = None) -> List[Dict]: + def generate_queries(self, max_queries: int = 100, category_filter: str = None, use_all_patterns: bool = False) -> List[Dict]: """ Generate targeted search queries for discovered services Returns list of query dicts with: service, query_url, pattern_type, priority + + Args: + max_queries: Maximum number of queries to generate + category_filter: Filter services by category + use_all_patterns: If True, use ALL patterns (20 total), not just first from each category """ queries = [] @@ -152,7 +209,12 @@ def generate_queries(self, max_queries: int = 100, category_filter: str = None) prioritized_services.append((service_data, 'medium')) # Calculate services to process - services_to_process = min(len(prioritized_services), max(1, max_queries // 5)) # At least 1 service + if use_all_patterns: + # When using all patterns, we generate 20 queries per service + services_to_process = min(len(prioritized_services), max(1, max_queries // 20)) + else: + # Original behavior: 5 queries per service (one from each category) + services_to_process = min(len(prioritized_services), max(1, max_queries // 5)) # Generate queries for prioritized services for service_data, priority in prioritized_services[:services_to_process]: @@ -160,7 +222,10 @@ def generate_queries(self, max_queries: int = 100, category_filter: str = None) # Generate queries for each pattern type for pattern_type, patterns in self.problem_patterns.items(): - for pattern in patterns[:1]: # Take first pattern of each type to avoid explosion + # Use all patterns or just first one based on flag + patterns_to_use = patterns if use_all_patterns else patterns[:1] + + for pattern in patterns_to_use: query = pattern.replace('{service}', service_name) query_url = f'https://old.reddit.com/search?q={query.replace(" ", "+").replace('"', "%22")}' diff --git a/src/services/model_entry_generator.py b/src/services/model_entry_generator.py index 5f90d4f..f6077b0 100644 --- a/src/services/model_entry_generator.py +++ b/src/services/model_entry_generator.py @@ -34,11 +34,25 @@ def categorize_service(self, service_name: str) -> str: alias_manager = ServiceAliasManager() service_match = alias_manager.match_service(service_name) + category = 'general' if service_match: - return service_match.get('category', 'general') + category = service_match.get('category', 'general') - # Default to general if not found - return 'general' + # If category is 'general', try to determine it from known keywords + if category == 'general': + service_name_lower = service_name.lower() + if any(keyword in service_name_lower for keyword in ['midjourney', 'dall-e', 'stable diffusion', 'leonardo', 'ideogram']): + category = 'image' + elif any(keyword in service_name_lower for keyword in ['runway', 'pika', 'luma', 'kaiber', 'genmo', 'haiper']): + category = 'video' + elif any(keyword in service_name_lower for keyword in ['elevenlabs', 'eleven labs', 'murf', 'play.ht', 'wellsaid', 'descript']): + category = 'audio' + elif any(keyword in service_name_lower for keyword in ['gpt', 'claude', 'llama', 'gemini', 'mistral']): + category = 'text' + elif any(keyword in service_name_lower for keyword in ['copilot', 'cursor', 'codeium', 'tabnine']): + category = 'code' + + return category def normalize_service_name(self, service_name: str) -> str: """Normalize service name for file/folder naming"""