Skip to content

Multi-file parallelism for read_xml and read_html #72

@teaguesterling

Description

@teaguesterling

Summary

When reading multiple XML/HTML files (via glob patterns or array of paths), each file could be processed by a separate thread. Currently both DOM and SAX paths are single-threaded (MaxThreads() = 1).

Design

  • Move per-file state from XMLReadGlobalState into a new XMLReadLocalState : LocalTableFunctionState
  • Global state provides mutex-protected file index assignment
  • Each thread gets its own DOM or SAX resources
  • MaxThreads() returns files.size()
  • Register init_local callback on all read_xml / read_html table functions

Considerations

  • Schema inference happens at bind time (single-threaded) — only extraction parallelizes
  • union_by_name may need special handling since schema merging happens during bind
  • DOM path: each thread holds its own XMLDocRAII — straightforward
  • SAX path: each thread holds its own push parser context — straightforward
  • libxml2 is thread-safe for independent parser contexts (no shared global state per CLAUDE.md guidelines)

Impact

For single-file reads: no change (1 thread).
For multi-file reads: up to N threads for N files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions