Releases: raintree-technology/docpull
v2.2.0: Resume, Auth, JSON/SQLite output
New Features
- Resume capability (
--resume): Continue interrupted fetches - URL preview mode (
--preview-urls): See discovered URLs before fetching - Authentication support:
--auth-bearer,--auth-basic,--auth-cookie,--auth-header - Env var expansion for auth tokens (
$VARand${VAR}syntax) - Adaptive rate limiting (
--adaptive-rate-limit): Auto-adjust based on 429 responses - JSON output (
--format json): Stream documents to single JSON file - SQLite output (
--format sqlite): Save to SQLite database - Skip reason tracking: Better progress feedback
Breaking Changes
- Requires Python 3.10+ (dropped 3.9 support)
Install
pip install docpull --upgrade
v2.0.0 - Complete Architecture Rewrite
Breaking Changes
- New Python API:
Fetcherclass with async context manager and streaming events - src/ layout: PEP 517/518 compliant package structure
- Pydantic models: Configuration via
DocpullConfiginstead of dictionaries - Removed v1.x modules: All deprecated code removed
New Features
- Streaming Event API:
AsyncIterator[FetchEvent]for real-time progress - Pipeline Architecture: Composable steps (Validate, Fetch, Convert, Dedup, Save)
- CacheManager: O(1) lookups with batched writes and TTL eviction
- StreamingDeduplicator: Real-time content deduplication via SHA-256
- JavaScript Rendering: Browser-based fetching via Playwright
- Profile Presets: RAG, MIRROR, QUICK for common use cases
- Rate Limiting: Per-host concurrent request limits
- Security: robots.txt respect and URL validation
Quick Start
```bash
CLI
docpull https://docs.example.com --profile rag
Python API
from docpull import Fetcher, DocpullConfig, ProfileName, EventType
async with Fetcher(DocpullConfig(url="https://docs.example.com", profile=ProfileName.RAG)) as f:
async for event in f.run():
if event.type == EventType.FETCH_PROGRESS:
print(f"{event.current}/{event.total}")
```
Full Changelog
See CHANGELOG.md
v1.5.0
Release v1.5.0: Major Simplification and Modernization
Breaking Changes
- Removed legacy profile system (stripe-specific profiles)
- Removed deprecated
requirements.txt(usepyproject.tomlinstead)
Changes
- Simplified architecture: Consolidated utils into main package
- Reorganized documentation: Moved CONTRIBUTING.md and SECURITY.md to
.github/ - Added GitHub issue templates configuration
- Cleaner fetcher architecture: Removed stripe-specific fetcher
- Updated tests for new structure
Removed Files
CHANGELOG.md- Deprecated in favor of GitHub releasesMANIFEST.in- No longer needed with modern packagingTROUBLESHOOTING.md- Content moved to READMErequirements.txt- Dependencies now in pyproject.toml- Legacy profile system files
- Legacy utils directory
Installation
pip install docpullOr install from source:
pip install git+https://github.com/raintree-technology/docpull.gitv1.3.0: Rich Metadata Extraction & Simplified Profiles
v1.3.0: Rich Metadata Extraction & Simplified Profiles
Highlights
docpull v1.3.0 adds rich structured metadata extraction for enhanced AI/RAG integration and simplifies the profile system by focusing on the excellent generic fetcher.
New Features
Rich Metadata Extraction
- Structured Metadata: Extract Open Graph, JSON-LD, and microdata during fetch
- Enhanced Frontmatter: Adds author, description, keywords, images, publish dates, and more
- AI/RAG Ready: Richer context for embeddings and retrieval systems
- Opt-in Feature: Enabled with
--rich-metadataflag orrich_metadata: truein config - Powered by extruct: Uses the battle-tested extruct library for extraction
Simplified Profile System
- Streamlined Architecture: Removed 7 built-in profiles (React, Next.js, D3, Plaid, Tailwind, Bun, Turborepo)
- Kept Stripe: Retained as reference implementation for custom profiles
- Generic Fetcher Excellence: Works excellently for all documentation sites
- Reduced Complexity: Less maintenance burden, simpler codebase
- Easy Customization: Users can create custom profiles as needed
Technical Details
New Dependencies
- Added
extruct>=0.15.0for structured metadata extraction
New Files
docpull/metadata_extractor.py- Rich metadata extraction moduletests/test_metadata_extractor.py- Comprehensive test suite (13 tests)
Updated Files
docpull/fetchers/base.py- Integrated rich metadata extractiondocpull/fetchers/generic_async.py- Addeduse_rich_metadataparameterdocpull/config.py- Addedrich_metadataconfiguration optiondocpull/sources_config.py- Addedrich_metadatafielddocpull/cli.py- Added--rich-metadataCLI flagdocpull/profiles/__init__.py- Simplified to single Stripe profile
Removed Files
- 7 profile files (react.py, nextjs.py, d3.py, plaid.py, tailwind.py, bun.py, turborepo.py)
- 7 fetcher implementation files (same names)
Version & Testing
- Bumped version from
1.2.1to1.3.0 - All 107 tests passing ✅
- Zero mypy type errors ✅
- All lint checks passing ✅
Example Usage
Rich Metadata Extraction
# Extract rich metadata during fetch
docpull https://docs.anthropic.com --rich-metadata
# Combine with other features
docpull https://stripe.com/docs --rich-metadata --create-index --language en
# Multi-source configuration
docpull --sources-file config.yamlEnhanced Frontmatter Output
---
url: https://docs.example.com/guide
fetched: 2025-11-20
title: Getting Started Guide
description: Learn the basics of our platform
author: John Doe
keywords: [tutorial, guide, api]
image: https://docs.example.com/og-image.png
type: article
site_name: Example Docs
published_time: 2024-01-15T10:00:00Z
modified_time: 2024-01-20T15:30:00Z
---Multi-Source Configuration with Rich Metadata
sources:
anthropic:
url: https://docs.anthropic.com
rich_metadata: true # Enable rich metadata extraction
language: en
create_index: true
stripe:
url: https://stripe.com/docs
rich_metadata: true
max_file_size: 200kbBackward Compatibility
All existing workflows continue to work unchanged. Rich metadata extraction is opt-in, and the generic fetcher handles all documentation sites that previously used specific profiles.
Installation
pip install --upgrade docpullLinks
Stats: 30 files changed, +765/-867 lines
v1.2.1 - Critical Bug Fixes & Type Checking
🐛 Bug Fixes
This patch release fixes critical issues found in v1.2.0:
Type Checking & Code Quality
- Fixed all 60 mypy type errors - achieved zero type errors ✅
- Added proper type annotations throughout the codebase
- Improved type safety in processors, formatters, and orchestrator modules
- All lint checks now passing (mypy, ruff, black)
Test Fixes
- Fixed test failure in
test_orchestrator.py(archive_format parameter) - Fixed 9 SourcesConfiguration test failures
- All 101 tests now passing ✅
Code Cleanup
- Removed deprecated files (EMOJI_CLEANUP.md)
- Fixed Black formatting issues
- Added specific error codes to type: ignore comments
📝 Technical Details
Files Updated
docpull/processors/content_filter.py: More specific return typesdocpull/formatters/: Proper type annotations for nested functionsdocpull/orchestrator.py: Correct parameter naming and type hintsdocpull/cli.py: Better handling of Optional[str] typesdocpull/processors/language_filter.py: Fixed config type assignmentsdocpull/processors/deduplicator.py: Fixed config type assignments
CI/CD
This release ensures the codebase passes all CI checks and maintains high code quality standards.
📦 Installation
pip install --upgrade docpull🔗 Links
v1.2.0: 15 Major Features - 58% Size Reduction
Highlights
docpull v1.2.0 delivers 15 major features that dramatically improve documentation fetching efficiency. Real-world testing shows 58% size reduction (31 MB → 13 MB) when processing 1,914 documentation files.
New Features
Phase 1: Core Optimization
- Language Filtering: Auto-detect and filter by programming language
- Deduplication: SHA-256 based duplicate detection with flexible keep strategies
- Auto-Index Generation: Tree view, TOC, category-based, and statistics indexes
- Size Limits: Enforce per-file and total size constraints
- Multi-Source Configuration: YAML-based configuration for multiple documentation sources
Phase 2: Advanced Processing
- Selective Crawling: Include/exclude patterns for precise control
- Content Filtering: Remove unwanted sections from documentation
- Format Conversion: Output in Markdown, TOON, JSON, or SQLite
- Smart Naming: 4 naming strategies (full, short, flat, hierarchical)
Phase 3: Efficiency
- Metadata Extraction: Automatic metadata collection and JSON storage
- Update Detection: Skip unchanged files based on checksums
- Incremental Mode: Update only changed documentation
Phase 4: Integration
- Hooks/Plugins: Decorator-based plugin system for custom processing
- Git Integration: Automatic commits with templated messages
- Archive Mode: Create compressed archives (tar.gz, tar.bz2, tar.xz, zip)
Technical Details
- 20 new modules (3,886 lines of code)
- Full backward compatibility with v1.1.0
- All features integrated into CLI
- 145+ unit tests
- Zero syntax errors, zero linting issues
Installation
pip install --upgrade docpullQuick Example
# Fetch Python docs with optimization
docpull https://docs.python.org/3/ ./python-docs \
--language python \
--deduplicate \
--create-index \
--max-total-size 20MB
# Multi-source with YAML config
docpull --sources-file sources.yamlSee the CHANGELOG for complete details.
v1.1.0 - Diagnostic Tools and Improved Error Handling
What's New in v1.1.0
Added
-
--doctorcommand for diagnosing installation and dependency issues- Checks all core dependencies (requests, beautifulsoup4, html2text, defusedxml, aiohttp, rich)
- Checks optional dependencies (PyYAML, Playwright) with installation suggestions
- Tests network connectivity
- Verifies output directory write permissions
- Works even when dependencies are missing
-
requirements.txtfile for transparent dependency listing -
Comprehensive TROUBLESHOOTING.md documentation with:
- Installation troubleshooting (missing dependencies, pipx issues)
- Runtime issue solutions (YAML config errors, JavaScript rendering)
- Diagnostic tools usage guide
- Common error messages reference table
- Quick reference commands
Changed
-
Improved error handling for missing dependencies
- Early dependency checking at CLI entry point
- Clear, actionable error messages with installation instructions
- Specific recommendations for pipx, pip, and development installations
-
Enhanced YAML configuration error handling
- Auto-fallback to JSON when PyYAML is not installed
- Clear error messages for YAML-related import errors
- Helpful suggestions for installing optional dependencies
-
Updated README.md with:
--doctorcommand in Quick Start section- Reference to TROUBLESHOOTING.md
- Better troubleshooting guidance
Fixed
- Improved user experience when dependencies are missing (no more confusing tracebacks)
- Better handling of optional dependency errors (PyYAML, Playwright)
Installation
```bash
pip install docpull
docpull --doctor # Verify installation
```
Full Changelog
https://github.com/raintree-technology/docpull/blob/main/CHANGELOG.md#110---2025-11-14