Skip to content

Releases: raintree-technology/docpull

v2.2.0: Resume, Auth, JSON/SQLite output

15 Dec 21:00

Choose a tag to compare

New Features

  • Resume capability (--resume): Continue interrupted fetches
  • URL preview mode (--preview-urls): See discovered URLs before fetching
  • Authentication support: --auth-bearer, --auth-basic, --auth-cookie, --auth-header
  • Env var expansion for auth tokens ($VAR and ${VAR} syntax)
  • Adaptive rate limiting (--adaptive-rate-limit): Auto-adjust based on 429 responses
  • JSON output (--format json): Stream documents to single JSON file
  • SQLite output (--format sqlite): Save to SQLite database
  • Skip reason tracking: Better progress feedback

Breaking Changes

  • Requires Python 3.10+ (dropped 3.9 support)

Install

pip install docpull --upgrade

v2.0.0 - Complete Architecture Rewrite

29 Nov 23:26

Choose a tag to compare

Breaking Changes

  • New Python API: Fetcher class with async context manager and streaming events
  • src/ layout: PEP 517/518 compliant package structure
  • Pydantic models: Configuration via DocpullConfig instead of dictionaries
  • Removed v1.x modules: All deprecated code removed

New Features

  • Streaming Event API: AsyncIterator[FetchEvent] for real-time progress
  • Pipeline Architecture: Composable steps (Validate, Fetch, Convert, Dedup, Save)
  • CacheManager: O(1) lookups with batched writes and TTL eviction
  • StreamingDeduplicator: Real-time content deduplication via SHA-256
  • JavaScript Rendering: Browser-based fetching via Playwright
  • Profile Presets: RAG, MIRROR, QUICK for common use cases
  • Rate Limiting: Per-host concurrent request limits
  • Security: robots.txt respect and URL validation

Quick Start

```bash

CLI

docpull https://docs.example.com --profile rag

Python API

from docpull import Fetcher, DocpullConfig, ProfileName, EventType

async with Fetcher(DocpullConfig(url="https://docs.example.com", profile=ProfileName.RAG)) as f:
async for event in f.run():
if event.type == EventType.FETCH_PROGRESS:
print(f"{event.current}/{event.total}")
```

Full Changelog

See CHANGELOG.md

v1.5.0

29 Nov 03:55

Choose a tag to compare

Release v1.5.0: Major Simplification and Modernization

Breaking Changes

  • Removed legacy profile system (stripe-specific profiles)
  • Removed deprecated requirements.txt (use pyproject.toml instead)

Changes

  • Simplified architecture: Consolidated utils into main package
  • Reorganized documentation: Moved CONTRIBUTING.md and SECURITY.md to .github/
  • Added GitHub issue templates configuration
  • Cleaner fetcher architecture: Removed stripe-specific fetcher
  • Updated tests for new structure

Removed Files

  • CHANGELOG.md - Deprecated in favor of GitHub releases
  • MANIFEST.in - No longer needed with modern packaging
  • TROUBLESHOOTING.md - Content moved to README
  • requirements.txt - Dependencies now in pyproject.toml
  • Legacy profile system files
  • Legacy utils directory

Installation

pip install docpull

Or install from source:

pip install git+https://github.com/raintree-technology/docpull.git

v1.3.0: Rich Metadata Extraction & Simplified Profiles

20 Nov 19:30

Choose a tag to compare

v1.3.0: Rich Metadata Extraction & Simplified Profiles

Highlights

docpull v1.3.0 adds rich structured metadata extraction for enhanced AI/RAG integration and simplifies the profile system by focusing on the excellent generic fetcher.

New Features

Rich Metadata Extraction

  • Structured Metadata: Extract Open Graph, JSON-LD, and microdata during fetch
  • Enhanced Frontmatter: Adds author, description, keywords, images, publish dates, and more
  • AI/RAG Ready: Richer context for embeddings and retrieval systems
  • Opt-in Feature: Enabled with --rich-metadata flag or rich_metadata: true in config
  • Powered by extruct: Uses the battle-tested extruct library for extraction

Simplified Profile System

  • Streamlined Architecture: Removed 7 built-in profiles (React, Next.js, D3, Plaid, Tailwind, Bun, Turborepo)
  • Kept Stripe: Retained as reference implementation for custom profiles
  • Generic Fetcher Excellence: Works excellently for all documentation sites
  • Reduced Complexity: Less maintenance burden, simpler codebase
  • Easy Customization: Users can create custom profiles as needed

Technical Details

New Dependencies

  • Added extruct>=0.15.0 for structured metadata extraction

New Files

  • docpull/metadata_extractor.py - Rich metadata extraction module
  • tests/test_metadata_extractor.py - Comprehensive test suite (13 tests)

Updated Files

  • docpull/fetchers/base.py - Integrated rich metadata extraction
  • docpull/fetchers/generic_async.py - Added use_rich_metadata parameter
  • docpull/config.py - Added rich_metadata configuration option
  • docpull/sources_config.py - Added rich_metadata field
  • docpull/cli.py - Added --rich-metadata CLI flag
  • docpull/profiles/__init__.py - Simplified to single Stripe profile

Removed Files

  • 7 profile files (react.py, nextjs.py, d3.py, plaid.py, tailwind.py, bun.py, turborepo.py)
  • 7 fetcher implementation files (same names)

Version & Testing

  • Bumped version from 1.2.1 to 1.3.0
  • All 107 tests passing ✅
  • Zero mypy type errors ✅
  • All lint checks passing ✅

Example Usage

Rich Metadata Extraction

# Extract rich metadata during fetch
docpull https://docs.anthropic.com --rich-metadata

# Combine with other features
docpull https://stripe.com/docs --rich-metadata --create-index --language en

# Multi-source configuration
docpull --sources-file config.yaml

Enhanced Frontmatter Output

---
url: https://docs.example.com/guide
fetched: 2025-11-20
title: Getting Started Guide
description: Learn the basics of our platform
author: John Doe
keywords: [tutorial, guide, api]
image: https://docs.example.com/og-image.png
type: article
site_name: Example Docs
published_time: 2024-01-15T10:00:00Z
modified_time: 2024-01-20T15:30:00Z
---

Multi-Source Configuration with Rich Metadata

sources:
  anthropic:
    url: https://docs.anthropic.com
    rich_metadata: true  # Enable rich metadata extraction
    language: en
    create_index: true

  stripe:
    url: https://stripe.com/docs
    rich_metadata: true
    max_file_size: 200kb

Backward Compatibility

All existing workflows continue to work unchanged. Rich metadata extraction is opt-in, and the generic fetcher handles all documentation sites that previously used specific profiles.

Installation

pip install --upgrade docpull

Links


Stats: 30 files changed, +765/-867 lines

v1.2.1 - Critical Bug Fixes & Type Checking

17 Nov 01:19

Choose a tag to compare

🐛 Bug Fixes

This patch release fixes critical issues found in v1.2.0:

Type Checking & Code Quality

  • Fixed all 60 mypy type errors - achieved zero type errors ✅
  • Added proper type annotations throughout the codebase
  • Improved type safety in processors, formatters, and orchestrator modules
  • All lint checks now passing (mypy, ruff, black)

Test Fixes

  • Fixed test failure in test_orchestrator.py (archive_format parameter)
  • Fixed 9 SourcesConfiguration test failures
  • All 101 tests now passing ✅

Code Cleanup

  • Removed deprecated files (EMOJI_CLEANUP.md)
  • Fixed Black formatting issues
  • Added specific error codes to type: ignore comments

📝 Technical Details

Files Updated

  • docpull/processors/content_filter.py: More specific return types
  • docpull/formatters/: Proper type annotations for nested functions
  • docpull/orchestrator.py: Correct parameter naming and type hints
  • docpull/cli.py: Better handling of Optional[str] types
  • docpull/processors/language_filter.py: Fixed config type assignments
  • docpull/processors/deduplicator.py: Fixed config type assignments

CI/CD

This release ensures the codebase passes all CI checks and maintains high code quality standards.

📦 Installation

pip install --upgrade docpull

🔗 Links

v1.2.0: 15 Major Features - 58% Size Reduction

16 Nov 22:12

Choose a tag to compare

Highlights

docpull v1.2.0 delivers 15 major features that dramatically improve documentation fetching efficiency. Real-world testing shows 58% size reduction (31 MB → 13 MB) when processing 1,914 documentation files.

New Features

Phase 1: Core Optimization

  • Language Filtering: Auto-detect and filter by programming language
  • Deduplication: SHA-256 based duplicate detection with flexible keep strategies
  • Auto-Index Generation: Tree view, TOC, category-based, and statistics indexes
  • Size Limits: Enforce per-file and total size constraints
  • Multi-Source Configuration: YAML-based configuration for multiple documentation sources

Phase 2: Advanced Processing

  • Selective Crawling: Include/exclude patterns for precise control
  • Content Filtering: Remove unwanted sections from documentation
  • Format Conversion: Output in Markdown, TOON, JSON, or SQLite
  • Smart Naming: 4 naming strategies (full, short, flat, hierarchical)

Phase 3: Efficiency

  • Metadata Extraction: Automatic metadata collection and JSON storage
  • Update Detection: Skip unchanged files based on checksums
  • Incremental Mode: Update only changed documentation

Phase 4: Integration

  • Hooks/Plugins: Decorator-based plugin system for custom processing
  • Git Integration: Automatic commits with templated messages
  • Archive Mode: Create compressed archives (tar.gz, tar.bz2, tar.xz, zip)

Technical Details

  • 20 new modules (3,886 lines of code)
  • Full backward compatibility with v1.1.0
  • All features integrated into CLI
  • 145+ unit tests
  • Zero syntax errors, zero linting issues

Installation

pip install --upgrade docpull

Quick Example

# Fetch Python docs with optimization
docpull https://docs.python.org/3/ ./python-docs \
  --language python \
  --deduplicate \
  --create-index \
  --max-total-size 20MB

# Multi-source with YAML config
docpull --sources-file sources.yaml

See the CHANGELOG for complete details.

v1.1.0 - Diagnostic Tools and Improved Error Handling

14 Nov 23:47

Choose a tag to compare

What's New in v1.1.0

Added

  • --doctor command for diagnosing installation and dependency issues

    • Checks all core dependencies (requests, beautifulsoup4, html2text, defusedxml, aiohttp, rich)
    • Checks optional dependencies (PyYAML, Playwright) with installation suggestions
    • Tests network connectivity
    • Verifies output directory write permissions
    • Works even when dependencies are missing
  • requirements.txt file for transparent dependency listing

  • Comprehensive TROUBLESHOOTING.md documentation with:

    • Installation troubleshooting (missing dependencies, pipx issues)
    • Runtime issue solutions (YAML config errors, JavaScript rendering)
    • Diagnostic tools usage guide
    • Common error messages reference table
    • Quick reference commands

Changed

  • Improved error handling for missing dependencies

    • Early dependency checking at CLI entry point
    • Clear, actionable error messages with installation instructions
    • Specific recommendations for pipx, pip, and development installations
  • Enhanced YAML configuration error handling

    • Auto-fallback to JSON when PyYAML is not installed
    • Clear error messages for YAML-related import errors
    • Helpful suggestions for installing optional dependencies
  • Updated README.md with:

    • --doctor command in Quick Start section
    • Reference to TROUBLESHOOTING.md
    • Better troubleshooting guidance

Fixed

  • Improved user experience when dependencies are missing (no more confusing tracebacks)
  • Better handling of optional dependency errors (PyYAML, Playwright)

Installation

```bash
pip install docpull
docpull --doctor # Verify installation
```

Full Changelog

https://github.com/raintree-technology/docpull/blob/main/CHANGELOG.md#110---2025-11-14