Update documentation with PDF outline feature details

dashed · dashed · commit f6cf0f319265 · 2025-08-03T21:47:00.000-04:00
- Add comprehensive PDF outline tool documentation to CHANGELOG.md
- Update README.md with get_pdf_outline in multiple sections:
  - Add to PDF information tools list
  - Include natural language usage examples
  - Add CLI command examples with all parameters
  - Create detailed tool documentation section with examples
- Document all outline features: hierarchical structure, simple/detailed modes, max_depth, fuzzy_filter
- Add docs-updater agent configuration for future documentation updates
diff --git a/.claude/agents/docs-updater.md b/.claude/agents/docs-updater.md
@@ -0,0 +1,48 @@
+---
+name: docs-updater
+description: Use this agent when you need to update project documentation files, specifically CHANGELOG.md and README.md, to reflect recent code changes, new features, or implementation updates. This agent should be used after significant code changes or feature additions to ensure documentation stays synchronized with the codebase.\n\nExamples:\n- <example>\n  Context: The user has just implemented a new feature or made significant changes to the codebase.\n  user: "I've finished implementing the new authentication system"\n  assistant: "Great! Now let me use the docs-updater agent to update the CHANGELOG.md and README.md to reflect these changes"\n  <commentary>\n  Since new features have been implemented, use the docs-updater agent to ensure documentation is updated accordingly.\n  </commentary>\n</example>\n- <example>\n  Context: The user explicitly asks for documentation updates.\n  user: "Update CHANGELOG.md and README.md to reflect the new API endpoints"\n  assistant: "I'll use the docs-updater agent to update both documentation files with the new API endpoint information"\n  <commentary>\n  The user is explicitly requesting documentation updates, so use the docs-updater agent.\n  </commentary>\n</example>
+tools: Task, Bash, Glob, Grep, LS, ExitPlanMode, Read, Edit, MultiEdit, Write, NotebookRead, NotebookEdit, WebFetch, TodoWrite, WebSearch, mcp__file-search__search_files, mcp__file-search__filter_files, ListMcpResourcesTool, ReadMcpResourceTool, mcp__sequential_thinking__sequentialthinking, mcp__playwright__browser_close, mcp__playwright__browser_resize, mcp__playwright__browser_console_messages, mcp__playwright__browser_handle_dialog, mcp__playwright__browser_evaluate, mcp__playwright__browser_file_upload, mcp__playwright__browser_install, mcp__playwright__browser_press_key, mcp__playwright__browser_type, mcp__playwright__browser_navigate, mcp__playwright__browser_navigate_back, mcp__playwright__browser_navigate_forward, mcp__playwright__browser_network_requests, mcp__playwright__browser_take_screenshot, mcp__playwright__browser_snapshot, mcp__playwright__browser_click, mcp__playwright__browser_drag, mcp__playwright__browser_hover, mcp__playwright__browser_select_option, mcp__playwright__browser_tab_list, mcp__playwright__browser_tab_new, mcp__playwright__browser_tab_select, mcp__playwright__browser_tab_close, mcp__playwright__browser_wait_for, mcp__sqlite__query, mcp__sqlite__execute, mcp__sqlite__list_tables, mcp__sqlite__describe_table, mcp__sqlite__create_table, mcp__fuzzy-search__extract_pdf_pages, mcp__fuzzy-search__get_pdf_page_labels, mcp__fuzzy-search__get_pdf_page_count, mcp__fuzzy-search__get_pdf_outline, mcp__fuzzy-search__fuzzy_search_files, mcp__fuzzy-search__fuzzy_search_content, mcp__fuzzy-search__fuzzy_search_documents
+model: sonnet
+---
+
+You are a meticulous documentation specialist focused on maintaining accurate and up-to-date project documentation. Your primary responsibility is updating CHANGELOG.md and README.md files to reflect the current state of the codebase.
+
+When updating documentation:
+
+1. **Analyze Recent Changes**: Examine the codebase to identify what has changed, been added, or removed. Focus on:
+   - New features or functionality
+   - Breaking changes
+   - Bug fixes
+   - Performance improvements
+   - Dependency updates
+   - API changes
+
+2. **Update CHANGELOG.md**:
+   - Follow the Keep a Changelog format (if already in use) or maintain consistency with existing format
+   - Add entries under the appropriate version section or create a new version section if needed
+   - Use clear, concise descriptions that explain what changed and why it matters to users
+   - Include dates for releases
+   - Categorize changes appropriately (Added, Changed, Deprecated, Removed, Fixed, Security)
+
+3. **Update README.md**:
+   - Ensure all features are accurately documented
+   - Update installation instructions if dependencies or setup process changed
+   - Revise usage examples to reflect current API or interface
+   - Update configuration options if any were added or modified
+   - Ensure all code examples are working with the current implementation
+   - Update any outdated links or references
+
+4. **Quality Checks**:
+   - Verify all technical details are accurate
+   - Ensure consistency in formatting and style with the rest of the documentation
+   - Check that version numbers are correct and consistent
+   - Confirm that all new features mentioned in CHANGELOG are properly documented in README
+
+5. **Best Practices**:
+   - Write from the user's perspective - focus on impact rather than implementation details
+   - Be concise but comprehensive
+   - Use clear, simple language
+   - Include examples where they add clarity
+   - Maintain chronological order in CHANGELOG (newest first)
+
+You should ONLY edit existing CHANGELOG.md and README.md files. Do not create new documentation files unless they already exist in the project. Focus exclusively on updating these two files to accurately reflect the current state of the implementation.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -30,6 +30,13 @@
 - **New PDF Information Tools**:
   - `get_pdf_page_labels`: Returns mapping of all page indices to their labels
   - `get_pdf_page_count`: Returns total number of pages in a PDF
+  - `get_pdf_outline`: Extracts table of contents/bookmarks from PDFs
+    - Returns hierarchical outline structure with levels, titles, page numbers, and page labels
+    - Supports `simple` mode (default) with basic info or detailed mode with link information
+    - Optional `max_depth` parameter to limit traversal depth for deep hierarchies
+    - Optional `fuzzy_filter` parameter to search outline entries by title using fzf
+    - Handles PDFs without outlines gracefully by returning empty structure
+    - Available in both CLI (`pdf-outline` command) and MCP tool interface
 - **Type Checking**: Added support for `ty` type checker
   - Updated Makefile to use `ty check --exclude git-repos`
   - Added `ty>=0.0.1a16` as dev dependency
@@ -56,6 +63,7 @@
   - Added `TYPE_CHECKING` imports and type annotations for conditional imports
   - Fixed `mcp.context` attribute issues with type: ignore comments
 - **Test Assertion**: Added missing assertion for `proc.stdin` in CLI tests
+- **Test Tool Count**: Updated `test_list_tools` to expect 7 tools with addition of `get_pdf_outline`
 
 ### Dependencies
 - Replaced `pdfminer.six>=20221105` with `PyMuPDF>=1.23.0` for PDF operations
diff --git a/README.md b/README.md
@@ -23,9 +23,10 @@ Advanced search with both file name and content capabilities using `ripgrep` and
   - `fuzzy_search_content`: Search file CONTENTS with path+content matching by default
 - **PDF and document search** (optional) - search through PDFs, Office docs, and archives using `ripgrep-all`
 - **PDF page extraction** (optional) - extract specific pages from PDFs using PyMuPDF with page label support
-- **PDF information tools** (optional) - get page labels and page count from PDF files:
+- **PDF information tools** (optional) - get page labels, page count, and table of contents from PDF files:
   - `get_pdf_page_labels`: Get all page labels from a PDF file
   - `get_pdf_page_count`: Get the total number of pages in a PDF file
+  - `get_pdf_outline`: Extract table of contents/bookmarks from a PDF file
 - **Simplified interface** - just provide fuzzy search terms (NO regex support)
 - **Multiline record processing** for complex pattern matching
 - **Standalone CLI** for testing and direct usage
@@ -339,6 +340,8 @@ Once configured in Claude Desktop, you can use natural language for advanced sea
 - "Search for 'vector' in PDF documents" (requires ripgrep-all)
 - "Find all references to 'machine learning' in PDFs and Word documents"
 - "Extract pages 5-10 from the user manual PDF"
+- "Get the table of contents from the research paper PDF"
+- "Show me the outline of chapters in the user manual"
 
 #### CLI Usage
 
@@ -385,6 +388,10 @@ The fuzzy search server also works as a standalone CLI tool:
 ./mcp_fuzzy_search.py page-labels manual.pdf  # List all page labels
 ./mcp_fuzzy_search.py page-labels manual.pdf --start 100 --limit 20  # Get labels for pages 100-119
 ./mcp_fuzzy_search.py page-count manual.pdf  # Get total page count
+./mcp_fuzzy_search.py pdf-outline manual.pdf  # Get table of contents
+./mcp_fuzzy_search.py pdf-outline manual.pdf --max-depth 2  # Limit to 2 levels
+./mcp_fuzzy_search.py pdf-outline manual.pdf --fuzzy-filter "chapter"  # Filter by title
+./mcp_fuzzy_search.py pdf-outline manual.pdf --no-simple  # Detailed output with links
 ```
 
 ### SQLite Server
@@ -965,6 +972,92 @@ Get the total number of pages in a PDF file.
 }
 ```
 
+#### `get_pdf_outline`
+Extract the table of contents (outline/bookmarks) from a PDF file.
+
+**Purpose:** Returns the hierarchical outline structure with levels, titles, page numbers, and page labels, helpful for navigating complex PDFs and understanding document structure.
+
+**Parameters:**
+- `file` (required): Path to PDF file
+- `simple` (optional): Return basic info (default: true) or detailed info with link data (false)
+- `max_depth` (optional): Maximum depth to traverse in the outline hierarchy (default: unlimited)
+- `fuzzy_filter` (optional): Fuzzy search string to filter outline entries by title using fzf
+
+**Example:**
+```python
+{
+  "file": "research_paper.pdf"
+}
+
+# Returns (simple mode):
+{
+  "outline": [
+    [1, "Introduction", 1, "i"],
+    [1, "Chapter 1: Background", 5, "1"],
+    [2, "1.1 History", 6, "2"],
+    [2, "1.2 Related Work", 10, "6"],
+    [1, "Chapter 2: Methods", 15, "11"],
+    [2, "2.1 Data Collection", 16, "12"],
+    [3, "2.1.1 Sources", 17, "13"],
+    [2, "2.2 Analysis", 20, "16"]
+  ],
+  "total_entries": 8,
+  "max_depth_found": 3
+}
+
+# Example with filtering:
+{
+  "file": "research_paper.pdf",
+  "fuzzy_filter": "chapter"
+}
+
+# Returns:
+{
+  "outline": [
+    [1, "Chapter 1: Background", 5, "1"],
+    [1, "Chapter 2: Methods", 15, "11"]
+  ],
+  "total_entries": 8,
+  "max_depth_found": 3,
+  "filtered_count": 2
+}
+
+# Example with detailed output:
+{
+  "file": "research_paper.pdf",
+  "simple": false,
+  "max_depth": 2
+}
+
+# Returns:
+{
+  "outline": [
+    [1, "Introduction", 1, "i", {
+      "page": 1,
+      "uri": "#page=1&zoom=100,0,0",
+      "is_external": false,
+      "is_open": true,
+      "dest": {
+        "kind": 1,
+        "page": 0,
+        "uri": "#page=1&zoom=100,0,0"
+      }
+    }],
+    # ... more entries with link details
+  ],
+  "total_entries": 8,
+  "max_depth_found": 2
+}
+```
+
+**Outline Format:**
+- Simple mode returns: `[level, title, page, page_label]`
+  - `level`: Hierarchy level (1-based, 1 = top level)
+  - `title`: The bookmark/outline entry title
+  - `page`: Page number (1-based)
+  - `page_label`: Page label as shown in PDF readers (e.g., "i", "ii", "1", "ToC")
+- Detailed mode adds a 5th element with link information including destination details
+
 ### SQLite Server Tools
 
 #### `query`