Skip to content

Latest commit

Β 

History

History
124 lines (94 loc) Β· 4.08 KB

File metadata and controls

124 lines (94 loc) Β· 4.08 KB

πŸ“„ Full Document Processing Guide

🎯 Enable Processing of All Pages

Since you want to process all pages instead of just the first page, here's how to enable full document processing while still avoiding timeouts.

πŸ”§ Quick Setup for App Platform

Add this environment variable to your App Platform settings:

PROCESS_FULL_DOCUMENT=true

Complete environment variables for full processing:

DEPLOY_PLATFORM=app_platform
AGGRESSIVE_OPTIMIZATION=true
ENVIRONMENT=production
PROCESS_FULL_DOCUMENT=true
OPENAI_API_KEY=your-openai-key
GOOGLE_API_KEY=your-google-key

🧠 Smart Processing Strategy

The system will now use smart page sampling for large PDFs to avoid timeouts:

For Large PDFs (>50MB like your 290-page PDF):

  • Strategy: Smart sampling of ~30 important pages
  • Pages Selected:
    • βœ… First 5 pages (introduction)
    • βœ… Pages from each quartile (25%, 50%, 75%)
    • βœ… Last 3 pages (conclusion)
    • βœ… Strategic sampling from middle sections
  • Processing Time: 3-8 minutes
  • Content Coverage: Comprehensive overview of entire document

For Medium PDFs (<50MB):

  • Strategy: Process ALL pages using chunking
  • Processing Time: 2-5 minutes
  • Content Coverage: Complete document

πŸ“Š Expected Results for Your 290-Page PDF

{
  "success": true,
  "processing_time": 280.5,
  "text_length": 45000,
  "extracted_text": "=== FULL DOCUMENT EXTRACTION ===\nDocument: shrek.pdf\nTotal pages: 290\nPages processed: 30\nProcessing strategy: Smart sampling\n\n=== INTRODUCTION (Page 1) ===\n[Content from page 1]\n\n=== EARLY CONTENT (Page 15) ===\n[Content from page 15]\n\n=== FIRST HALF (Page 72) ===\n[Content from page 72]\n\n=== SECOND HALF (Page 145) ===\n[Content from page 145]\n\n=== LATE CONTENT (Page 220) ===\n[Content from page 220]\n\n=== CONCLUSION (Page 290) ===\n[Content from page 290]",
  "processor_used": "OpenAIProcessor"
}

🎚️ Processing Options

Option 1: Smart Sampling (Recommended for App Platform)

  • Best for: Large PDFs on App Platform
  • Pages: ~30 strategically selected pages
  • Time: 3-8 minutes
  • Reliability: High (no timeouts)

Option 2: Complete Processing (For Smaller PDFs)

  • Best for: PDFs under 50MB
  • Pages: All pages
  • Time: 2-5 minutes
  • Reliability: High

πŸ”„ How It Works

  1. File Size Check: System detects your 63.96MB PDF
  2. Smart Sampling: Selects ~30 most important pages
  3. Mixed Processing: Uses text extraction + Vision API as needed
  4. Structured Output: Organizes content by document sections
  5. Complete Result: Returns comprehensive overview in 3-8 minutes

πŸ“ˆ Content Quality

Smart sampling ensures you get:

  • βœ… Introduction: Key concepts and overview
  • βœ… Early Content: Foundation material
  • βœ… Mid-sections: Core topics from each quarter
  • βœ… Late Content: Advanced topics
  • βœ… Conclusion: Summary and key takeaways
  • βœ… Total Coverage: Representative content from entire document

πŸš€ Deploy Instructions

  1. Add environment variable: PROCESS_FULL_DOCUMENT=true
  2. Redeploy your app in App Platform
  3. Test with your PDF - should complete in 3-8 minutes
  4. Verify results - you'll get content from across the entire document

πŸ§ͺ Test the New Behavior

curl -X POST "https://your-app.ondigitalocean.app/extract/url" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://daniel.com.pt/shrek.pdf"}'

You should now get:

  • βœ… Content from ~30 pages across the document
  • βœ… Processing time: 3-8 minutes
  • βœ… Text length: 40,000+ characters
  • βœ… No timeout errors

πŸ’‘ Pro Tips

  1. For Even More Content: If you need more pages, you can increase the sampling by setting:

    PDF_MAX_PAGES_PER_CHUNK=3
    
  2. For Faster Processing: If still too slow, you can reduce sampling:

    PDF_MAX_CHUNKS_TO_PROCESS=20
    
  3. For Complete Processing: For smaller PDFs, the system will automatically process all pages.

This gives you comprehensive document coverage while staying within App Platform's timeout limits! πŸŽ‰