feat: Add comprehensive Apache Cassandra database integration

tsrdatatech · tsrdatatech · commit 056cf45ac1bd · 2025-08-29T13:07:10.000-04:00
- Implement CassandraManager with full CRUD operations and analytics
- Add distributed data storage with automatic URL and content deduplication
- Create dynamic seed URL management system stored in database
- Implement time-series crawl statistics and performance tracking
- Add content versioning and change detection capabilities
- Create CassandraParserManager extending existing parser with database features
- Add Docker Compose setup for complete Cassandra deployment
- Implement comprehensive test suite with 15+ test cases for database operations
- Add demo script showcasing all Cassandra integration features
- Support graceful fallback when database unavailable

Database Architecture:
- Articles table: Main content storage with time-series partitioning
- URL tracker: Deduplication and processing history management
- Seeds table: Dynamic crawler target management with prioritization
- Statistics: Performance metrics and monitoring counters
- History: Content versioning and change detection tracking

Technical Features:
- High-throughput write optimization for large-scale scraping
- Horizontal scaling support across multiple Cassandra nodes
- Intelligent deduplication using URL and content hashing
- Database-driven seed management replacing file-based approach
- Async/await integration with existing scraper architecture
- Production-ready error handling and connection management
- Time-series analytics for crawl performance monitoring

Infrastructure:
- Complete Docker Compose stack with Cassandra cluster
- Web UI for database administration and monitoring
- Health checks and automated service orchestration
- Volume persistence and network configuration
- Integration with existing Kubernetes deployment
diff --git a/README.md b/README.md
@@ -31,6 +31,7 @@ This project demonstrates enterprise-level software design patterns:
 ## 🚀 Technical Stack
 
 - **AI/ML Framework**: LangChain with prompt engineering and structured output parsing
+- **Database**: Apache Cassandra for distributed data storage and deduplication
 - **Browser Automation**: Crawlee + Playwright for sophisticated queue management
 - **Data Validation**: Pydantic v2 with advanced type checking and serialization
 - **Content Extraction**: Multi-method approach (Newspaper3k + Trafilatura + custom)
@@ -72,6 +73,14 @@ This project demonstrates enterprise-level software design patterns:
 - **Security**: Pod security standards, RBAC, and minimal privilege execution
 - **Reliability**: Failure recovery, resource cleanup, and graceful degradation
 
+### Distributed Database Engineering
+- **Cassandra Integration**: High-performance, scalable NoSQL database for web scraping data
+- **Data Deduplication**: Intelligent URL and content duplicate detection and prevention  
+- **Dynamic Seed Management**: Database-driven crawler seed URL management and prioritization
+- **Time-Series Analytics**: Crawl statistics, performance metrics, and historical data tracking
+- **Content Versioning**: Track article changes over time with automated change detection
+- **Horizontal Scaling**: Distributed architecture supporting multi-node deployments
+
 ### Testing & Quality Assurance
 - **Test-Driven Development**: 26+ automated tests covering multiple scenarios
 - **Integration Testing**: End-to-end workflow validation
@@ -92,6 +101,7 @@ This project demonstrates enterprise-level software design patterns:
 ## 🏛️ Core Features
 
 - **AI-Enhanced Content Analysis**: LangChain-powered summarization, sentiment analysis, and topic classification
+- **Distributed Database Storage**: Cassandra integration with deduplication and analytics
 - **Multi-Parser Architecture**: Automatic parser selection based on URL fingerprinting
 - **Advanced Kubernetes Orchestration**: Enterprise-grade batch processing with auto-scaling
 - **Production Pipelines**: Complete CI/CD with automated testing, building, and deployment
@@ -306,4 +316,4 @@ jobs:
 
 ## 🔧 Technical Keywords
 
-`Python` • `LangChain` • `AI/ML Engineering` • `Async/Await` • `Pydantic` • `Playwright` • `Docker` • `Kubernetes` • `GitHub Actions` • `Test-Driven Development` • `Clean Architecture` • `Design Patterns` • `Type Safety` • `CI/CD` • `Container Orchestration` • `Web Scraping` • `Parser Registry` • `Strategy Pattern` • `Prompt Engineering` • `Content Analysis`
+`Python` • `Apache Cassandra` • `Distributed Systems` • `LangChain` • `AI/ML Engineering` • `Async/Await` • `Pydantic` • `Playwright` • `Docker` • `Kubernetes` • `GitHub Actions` • `Test-Driven Development` • `Clean Architecture` • `Design Patterns` • `Type Safety` • `CI/CD` • `Container Orchestration` • `Web Scraping` • `Parser Registry` • `Strategy Pattern` • `Prompt Engineering` • `Content Analysis` • `Database Engineering` • `Data Deduplication` • `Time-Series Analytics`
diff --git a/demo_cassandra_integration.py b/demo_cassandra_integration.py
@@ -0,0 +1,172 @@
+#!/usr/bin/env python3
+"""
+Cassandra Database Integration Demo
+Demonstrates distributed data storage, deduplication, and seed management.
+"""
+
+import asyncio
+from datetime import datetime
+from src.database.cassandra_manager import CassandraManager, CassandraConfig
+from src.schemas.news import NewsArticle
+
+
+async def main():
+    """Demonstrate Cassandra database integration capabilities."""
+    print("🗄️  Cassandra Database Integration Demo")
+    print("=" * 50)
+    
+    # Configuration
+    config = CassandraConfig(
+        hosts=["localhost"],
+        keyspace="web_scraper_demo",
+        replication_factor=1
+    )
+    
+    try:
+        # Initialize database connection
+        print("\n🔌 Connecting to Cassandra...")
+        manager = CassandraManager(config)
+        await manager.connect()
+        print("✅ Connected successfully!")
+        
+        # Demo 1: Store sample articles
+        print("\n📰 Storing sample articles...")
+        
+        sample_articles = [
+            # Only title and url are required - all other fields are Optional  
+            NewsArticle(  # type: ignore
+                title="Revolutionary AI Breakthrough in Healthcare",
+                content="Researchers have developed an AI system that can diagnose diseases "
+                       "with 95% accuracy, potentially transforming medical care worldwide.",
+                url="https://example.com/ai-healthcare-breakthrough",
+                author="Dr. Jane Smith"
+            ),
+            NewsArticle(  # type: ignore
+                title="Climate Change Solutions: New Carbon Capture Technology",
+                content="Scientists unveil innovative carbon capture technology that could "
+                       "remove millions of tons of CO2 from the atmosphere annually.",
+                url="https://example.com/carbon-capture-tech"
+            ),
+            NewsArticle(  # type: ignore
+                title="Quantum Computing Milestone Achieved",
+                content="Tech giant announces quantum computer with 1000+ qubits, bringing "
+                       "practical quantum computing closer to reality.",
+                url="https://example.com/quantum-milestone",
+                author="Tech Reporter"
+            )
+        ]
+        
+        stored_count = 0
+        duplicate_count = 0
+        
+        for article in sample_articles:
+            was_stored = await manager.store_article(article, "generic_news")
+            if was_stored:
+                stored_count += 1
+                print(f"  ✅ Stored: {article.title[:50]}...")
+            else:
+                duplicate_count += 1
+                print(f"  ⚠️  Duplicate: {article.title[:50]}...")
+        
+        print(f"\n📊 Storage Results: {stored_count} stored, {duplicate_count} duplicates")
+        
+        # Demo 2: Test deduplication
+        print("\n🔄 Testing deduplication...")
+        
+        # Try to store the same article again
+        duplicate_article = sample_articles[0]  # First article again
+        was_stored = await manager.store_article(duplicate_article, "generic_news")
+        
+        if not was_stored:
+            print("✅ Deduplication working correctly - duplicate detected and skipped")
+        else:
+            print("❌ Deduplication failed - duplicate was stored")
+        
+        # Demo 3: Add seed URLs
+        print("\n🌱 Managing seed URLs...")
+        
+        seed_urls = [
+            {
+                "url": "https://techcrunch.com",
+                "label": "h2 a",
+                "parser": "news",
+                "priority": 8
+            },
+            {
+                "url": "https://news.ycombinator.com", 
+                "label": "a.storylink",
+                "parser": "news",
+                "priority": 6
+            },
+            {
+                "url": "https://reddit.com/r/technology",
+                "label": "a[data-click-id='body']",
+                "parser": "generic_news", 
+                "priority": 5
+            }
+        ]
+        
+        for seed in seed_urls:
+            await manager.add_seed_url(
+                url=seed["url"],
+                label=seed["label"],
+                parser=seed["parser"],
+                priority=seed["priority"]
+            )
+            print(f"  ✅ Added seed: {seed['url']}")
+        
+        # Demo 4: Retrieve seeds from database
+        print("\n📋 Retrieving seeds from database...")
+        
+        seeds = await manager.get_seed_urls(limit=10)
+        print(f"Found {len(seeds)} active seeds:")
+        
+        for i, seed in enumerate(seeds, 1):
+            print(f"  {i}. {seed['url']}")
+            print(f"     Label: {seed['label']}")
+            print(f"     Parser: {seed['parser']}")
+        
+        # Demo 5: Get crawl statistics
+        print("\n📈 Crawl Statistics...")
+        
+        stats = await manager.get_crawl_statistics(days=1)
+        if stats:
+            for metric, count in stats.items():
+                print(f"  {metric}: {count}")
+        else:
+            print("  No statistics available yet")
+        
+        print("\n🎯 Key Features Demonstrated:")
+        print("  ✅ Distributed data storage with Cassandra")
+        print("  ✅ Automatic URL and content deduplication")
+        print("  ✅ Dynamic seed URL management from database")
+        print("  ✅ Time-series data tracking and statistics")
+        print("  ✅ Scalable architecture for high-volume scraping")
+        print("  ✅ Content versioning and change tracking")
+        
+        print("\n🔧 Database Architecture:")
+        print("  • Articles table: Main content storage with partitioning")
+        print("  • URL tracker: Deduplication and processing history")
+        print("  • Seeds table: Dynamic crawl target management") 
+        print("  • Statistics: Performance metrics and monitoring")
+        print("  • History: Content versioning and change detection")
+        
+        print("\n🚀 Production Benefits:")
+        print("  • High write throughput for large-scale scraping")
+        print("  • Horizontal scaling across multiple nodes")
+        print("  • No single point of failure with replication")
+        print("  • Efficient time-series data for analytics")
+        print("  • Schema flexibility for varying content structures")
+        
+        # Cleanup
+        await manager.close()
+        print("\n✨ Demo completed successfully!")
+        
+    except Exception as e:
+        print(f"\n❌ Demo failed: {e}")
+        print("\n💡 Make sure Cassandra is running:")
+        print("   docker-compose -f docker-compose.cassandra.yml up -d cassandra")
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
diff --git a/deployment/cassandra/init-seeds.cql b/deployment/cassandra/init-seeds.cql
@@ -0,0 +1,25 @@
+# Cassandra initialization script
+# Creates keyspace and initial seed data
+
+# Create keyspace
+CREATE KEYSPACE IF NOT EXISTS web_scraper
+WITH replication = {
+    'class': 'SimpleStrategy',
+    'replication_factor': 1
+};
+
+USE web_scraper;
+
+# Sample seed URLs
+INSERT INTO seeds (seed_id, url, label, parser, priority, added_at, success_count, failure_count, status, metadata)
+VALUES (uuid(), 'https://example.com/news', 'a.article-link', 'news', 5, toTimestamp(now()), 0, 0, 'active', {});
+
+INSERT INTO seeds (seed_id, url, label, parser, priority, added_at, success_count, failure_count, status, metadata)  
+VALUES (uuid(), 'https://techcrunch.com', 'a[data-module="ArticleCard"]', 'news', 8, toTimestamp(now()), 0, 0, 'active', {});
+
+INSERT INTO seeds (seed_id, url, label, parser, priority, added_at, success_count, failure_count, status, metadata)
+VALUES (uuid(), 'https://news.ycombinator.com', 'a.storylink', 'news', 6, toTimestamp(now()), 0, 0, 'active', {});
+
+# Sample configuration
+INSERT INTO seeds (seed_id, url, label, parser, priority, added_at, success_count, failure_count, status, metadata)
+VALUES (uuid(), 'https://reddit.com/r/technology', 'a[data-click-id="body"]', 'generic_news', 4, toTimestamp(now()), 0, 0, 'active', {'source': 'reddit'});
diff --git a/docker-compose.cassandra.yml b/docker-compose.cassandra.yml
@@ -0,0 +1,71 @@
+version: '3.8'
+
+services:
+  cassandra:
+    image: cassandra:4.1
+    container_name: web-scraper-cassandra
+    ports:
+      - "9042:9042"
+      - "9160:9160"  # Thrift port (optional)
+    environment:
+      - CASSANDRA_CLUSTER_NAME=WebScraperCluster
+      - CASSANDRA_DC=datacenter1
+      - CASSANDRA_RACK=rack1
+      - CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch
+      - CASSANDRA_NUM_TOKENS=256
+      - MAX_HEAP_SIZE=1G
+      - HEAP_NEWSIZE=200M
+    volumes:
+      - cassandra_data:/var/lib/cassandra
+      - ./deployment/cassandra/cassandra.yaml:/etc/cassandra/cassandra.yaml
+      - ./deployment/cassandra/init-scripts:/docker-entrypoint-initdb.d
+    networks:
+      - scraper-network
+    healthcheck:
+      test: ["CMD-SHELL", "cqlsh -e 'describe cluster'"]
+      interval: 30s
+      timeout: 10s
+      retries: 5
+      start_period: 2m
+
+  cassandra-web:
+    image: markusgulden/cassandra-web:latest
+    container_name: cassandra-web-ui
+    ports:
+      - "3000:3000"
+    environment:
+      - CASSANDRA_HOST=cassandra
+      - CASSANDRA_PORT=9042
+    depends_on:
+      cassandra:
+        condition: service_healthy
+    networks:
+      - scraper-network
+
+  web-scraper:
+    build:
+      context: .
+      dockerfile: Dockerfile
+    container_name: web-scraper-app
+    environment:
+      - CASSANDRA_HOSTS=cassandra
+      - CASSANDRA_KEYSPACE=web_scraper
+      - CASSANDRA_PORT=9042
+      - PYTHONPATH=/app
+    volumes:
+      - ./storage:/app/storage
+      - ./logs:/app/logs
+    depends_on:
+      cassandra:
+        condition: service_healthy
+    networks:
+      - scraper-network
+    command: ["python", "src/main.py", "--file", "seeds.txt"]
+
+volumes:
+  cassandra_data:
+    driver: local
+
+networks:
+  scraper-network:
+    driver: bridge
diff --git a/requirements.txt b/requirements.txt
@@ -12,6 +12,10 @@ aiofiles>=23.0.0
 kubernetes>=29.0.0
 structlog>=23.2.0
 
+# Database dependencies  
+cassandra-driver>=3.28.0
+pandas>=2.0.0  # For data analysis and export
+
 # LangChain AI/ML dependencies
 langchain>=0.3.27
 langchain-community>=0.3.29
diff --git a/src/database/__init__.py b/src/database/__init__.py
@@ -0,0 +1,12 @@
+"""
+Database package initialization.
+Provides unified access to database components.
+"""
+
+from .cassandra_manager import CassandraManager, CassandraConfig, create_cassandra_manager
+
+__all__ = [
+    'CassandraManager',
+    'CassandraConfig', 
+    'create_cassandra_manager'
+]
diff --git a/src/database/cassandra_manager.py b/src/database/cassandra_manager.py
diff --git a/src/database/cassandra_parser_manager.py b/src/database/cassandra_parser_manager.py
diff --git a/tests/test_cassandra_integration.py b/tests/test_cassandra_integration.py