Releases: teaguesterling/duckdb_webbed
v1.5.0
New Features
datetime_formatparameter — Control date/time detection and parsing inread_xml,read_html,parse_xml, andparse_html. Supports preset names (auto,none,us,eu,iso, etc.), custom strftime format strings, and lists of formats. Replaces regex-based temporal detection with DuckDB'sStrpTimeFormatcandidate elimination. (#38)nullstrparameter — Custom NULL value representation for XML/HTML parsing (#40)- Lazy DOM extraction — Records are now extracted one at a time directly from the DOM instead of caching all rows, reducing peak memory usage (#17, Phase 1)
- Type inference for elements with attributes —
#textfield now infers proper types (DOUBLE, INTEGER, DATE, BOOLEAN) instead of defaulting to VARCHAR (#49, #46)
Improvements
- Increased default
maximum_file_sizefrom 16MB to 128MB (#66)
Bug Fixes
- Fixed
read_xmlreturning NULL for non-Latin text content — Cyrillic, CJK, and other multi-byte UTF-8 characters were being stripped by whitespace trimming (#64)
v1.4.1
Update for DuckDB v1.5 compatibility.
webbed v1.4.0
webbed v1.4.0 Release Notes
Overview
This release introduces new parse_xml and parse_html functions for parsing XML/HTML content directly from strings, complementing the existing file-based read_xml and read_html functions. Also includes a bug fix for CDATA section handling.
New Features
String-based XML/HTML Parsing
New table functions for parsing XML/HTML content from strings instead of files:
parse_xml_objects(xml_string)- Parse XML string and return raw content as XMLTypeparse_html_objects(html_string)- Parse HTML string and return raw content as HTMLTypeparse_xml(xml_string, [options])- Parse XML string with schema inferenceparse_html(html_string, [options])- Parse HTML string with schema inference
Basic Usage:
-- Parse XML string to raw content
SELECT * FROM parse_xml_objects('<root><item>test</item></root>');
-- Parse XML with schema inference
SELECT title, price
FROM parse_xml('<catalog><book><title>DuckDB</title><price>29.99</price></book></catalog>');
-- Parse with explicit schema
SELECT * FROM parse_xml('<root><item>42</item></root>', columns := {item: 'INTEGER'});
-- Error handling
SELECT * FROM parse_xml_objects('invalid xml', ignore_errors := true);Supported Parameters:
All parse_* functions support:
ignore_errors(BOOLEAN) - Return empty result instead of failing on invalid input
parse_xml and parse_html support all schema inference parameters from their read_* counterparts:
root_element,record_element,force_listattr_mode,attr_prefix,text_keynamespaces,empty_elementsauto_detect,max_depthunnest_as,all_varchar,columns
Bug Fixes
- Fixed CDATA sections converted to empty objects in xml_to_json (#63) - CDATA content is now properly preserved when converting XML to JSON
Testing
- Added comprehensive test suite for parse_xml and parse_html functions
- 17 new test assertions for string-based parsing
v1.3.2 - Bugfixes
Fixing #61 - Filename parameters, better parameter handling tests, and documentation
v1.3.0
webbed v1.3.0 Release Notes
Overview
This release introduces the duck_blocks document processing system, significantly improved XML namespace handling, and several bug fixes. The duck_blocks API is designed to integrate seamlessly with the duck_block_utils extension for full Markdown support.
New Features
Duck Blocks Document Processing
New functions for converting HTML documents to and from structured block representations:
html_to_duck_blocks(html)- Parse HTML into a list of structured duck_block elementsduck_blocks_to_html(blocks)- Convert duck_block elements back to HTML
The duck_block structure provides a standardized representation:
STRUCT(kind, element_type, content, level, encoding, attributes, element_order)Supported block types include: paragraph, heading, code_block, blockquote, list_item, horizontal_rule, table, image, metadata, and inline elements (text, bold, italic, code, link, image, linebreak, etc.).
Features:
- Round-trip HTML preservation (HTML -> duck_blocks -> HTML)
- Inline element support with structured parsing (#59)
- Table rendering with proper
<thead>/<tbody>structure (#57) - Frontmatter preservation using
<script type="application/vnd.frontmatter+yaml">tags (#56)
XML Namespace Improvements (#60)
New namespace modes for XPath functions:
'auto'(recommended) - Automatically detects undeclared prefixes and either looks up common URIs or creates mock URIs'strict'- Requires all namespaces to be explicitly declared'ignore'- Ignores namespace declarations
New helper functions:
xml_find_undefined_prefixes(xml, xpath)- Find undeclared namespace prefixes in XPathxml_add_namespace_declarations(xml, map)- Inject namespace declarations into XMLxml_lookup_namespace(prefix)- Look up common namespace URIs (gml, svg, xlink, dc, etc.)
Updated functions to support namespace configuration:
xml_extract_textxml_extract_elementsxml_extract_elements_stringxml_extract_attributes
Implicit Type Casting
XML and HTML types now implicitly cast to VARCHAR (cost 1), allowing string functions to work on XML/HTML values while preferring direct XML/HTML function overloads.
Duck Blocks API Rename
Functions have been created to convert HTML to duck_blocks for integration with duck_block_utils:
html_to_duck_blocksduck_blocks_to_html
These can be used to convert to other document types (e.g., markdown) or integrate with pandoc.
HTML Namespace Parameter Removed
The namespace parameter has been removed from html_extract_text. HTML5 parsing doesn't support XML namespace declarations (prefixed elements are treated as literal names with colons).
Bug Fixes
Documentation
- Added comprehensive documentation for duck_block functions
- Added namespace mode documentation with
'auto'mode recommendation - Added integration examples with
duck_block_utilsextension - Updated quick reference tables
Testing
- 54 new assertions for duck_block API
- 37 new assertions for table round-trips
- Comprehensive namespace mode tests
v1.2.1 - Bug fixes and XPaths
Release Notes: webbed v1.2.1
Overview
This release brings significant improvements to XPath functionality, namespace handling, HTML/XML feature parity, and comprehensive documentation. It includes several breaking changes to align with PostgreSQL's xpath() semantics.
Breaking Changes
XPath Functions Now Return LIST of All Matches (Issue #53)
All XPath extraction functions now return a LIST of all matching results instead of just the first match. This aligns with PostgreSQL's xpath() behavior.
| Function | Old Return Type | New Return Type |
|---|---|---|
xml_extract_text(xml, xpath) |
VARCHAR |
LIST(VARCHAR) |
html_extract_text(html, xpath) |
VARCHAR |
LIST(VARCHAR) |
xml_extract_elements(xml, xpath) |
XMLFragment |
LIST(XMLFragment) |
Migration: Use list indexing [1] to get single values:
-- Before (v1.2.0)
SELECT xml_extract_text(xml, '//title');
-- After (v1.3.0)
SELECT xml_extract_text(xml, '//title')[1]; -- Get first match
SELECT xml_extract_text(xml, '//title'); -- Get all matches as LISTxml_namespaces() Returns MAP Instead of LIST (Uncommitted)
The xml_namespaces() function now returns MAP(VARCHAR, VARCHAR) instead of LIST(STRUCT(prefix, uri)) for easier namespace lookups and merging.
-- New: Direct key access
SELECT map_extract_value(xml_namespaces(xml), 'gml');
-- New: Easy merging with map_concat
SELECT map_concat(xml_namespaces(xml), xml_common_namespaces());New Features
XPath Namespace Prefix Auto-Registration (Issue #4)
XPath expressions with namespace prefixes now work when the namespace is declared in the document:
-- Now works! (was broken in v1.2.0)
SELECT xml_extract_text(
'<root xmlns:gml="http://www.opengis.net/gml"><gml:posList>1 2 3</gml:posList></root>',
'//gml:posList'
);
-- Returns: ['1 2 3']Technical Details: libxml2's XPath engine requires explicit namespace registration. We now auto-register all xmlns: declarations from the document into the XPath context using xmlGetNsList() and xmlXPathRegisterNs().
New Namespace Helper Functions
| Function | Description |
|---|---|
xml_common_namespaces() |
Returns MAP of ~25 well-known namespace prefixes (xsd, svg, gml, rdf, dc, soap, etc.) |
xml_detect_prefixes(xpath) |
Parses XPath expression and returns LIST of namespace prefixes used |
xml_mock_namespaces(prefixes) |
Creates mock URIs (urn:mock:prefix) for a list of prefixes |
-- Get common namespaces
SELECT xml_common_namespaces();
-- Returns: {xml=..., xsd=..., svg=..., gml=..., rdf=..., dc=..., ...}
-- Detect prefixes in XPath
SELECT xml_detect_prefixes('//gml:posList | //svg:path');
-- Returns: ['gml', 'svg']
-- Create mock namespaces for testing
SELECT xml_mock_namespaces(['custom', 'app']);
-- Returns: {custom='urn:mock:custom', app='urn:mock:app'}HTML/XML Feature Parity (Issue #18)
read_html() now supports the same parameters as read_xml():
max_depth- Control nesting depthall_varchar- Force all columns to VARCHARforce_list- Force specific elements to be LIST typeunion_by_name- Combine multiple files with different schemas
Cross-Record Schema Inference (Issues #33, #50)
Schema inference now correctly examines ALL records to detect:
- Repeated elements - Elements that appear multiple times become
LISTtype - Nested attributes - Attributes discovered in later records create proper
STRUCTtypes
-- Elements with attributes now properly typed
-- <phone type="mobile">555-1234</phone> becomes:
-- STRUCT("#text" VARCHAR, "type" VARCHAR)Opaque Type Parameterization (Issue #18)
HTML files now correctly show "HTML" type instead of "XML" for opaque elements in schema inference.
Bug Fixes
- UTF-8 Encoding (Issue #53): Fixed
html_extract_text()to properly handle UTF-8 encoded content - Documentation (Issue #54): Fixed README examples to match actual function behavior
- LIST Extraction: Fixed extraction of LIST values in multi-record schemas
- Record Serialization: Fixed serialization of record elements in
ExtractDataWithSchema - HTML Parser Selection: Fixed parser selection for multi-file reads to consistently use HTML parser for HTML files
Documentation
Read the Docs Integration
Comprehensive documentation is now available at Read the Docs (link TBD):
- Installation Guide - Platform-specific setup instructions
- Quick Start - Common usage patterns
- Function Reference - Complete API documentation
- XPath Guide - XPath expression syntax and examples
- Namespace Handling - Working with XML namespaces
- Schema Inference - How automatic type detection works
- Parameters Reference - All configuration options
Test Suite Improvements
- Added comprehensive test suites for GitHub Issues #4, #17, #33, #50, #53, #54, #55
- Added HTML feature parity tests (Priority 1 & 2)
- Added cross-record attribute discovery tests
- Total test coverage: 55 test files, 1648+ assertions
GitHub Issues Addressed
| Issue | Title | Status |
|---|---|---|
| #4 | Namespace extraction and XPath with namespaces | ✅ Fixed |
| #17 | Large file streaming | ✅ Tests added |
| #18 | HTML/XML feature parity | ✅ Implemented |
| #33 | Repeated element data loss | ✅ Fixed |
| #50 | Cross-record attribute discovery | ✅ Fixed |
| #51 | Custom HTML elements not recognized | ✅ Fixed |
| #53 | XPath functions return all matches + UTF-8 fix | ✅ Fixed |
| #54 | Documentation mismatches | ✅ Fixed |
| #55 | xml_extract_attributes investigation | 🔄 Tests added |
Upgrade Guide
- Update XPath queries that expect single values to use
[1]indexing - Update
xml_namespaces()usage from struct access to MAP functions:- Old:
xml_namespaces(xml)[1].prefix - New:
map_keys(xml_namespaces(xml))[1]
- Old:
- Test namespace-aware XPath - queries using declared prefixes should now work without workarounds
Known Issues
See TODO.md for tracked issues:
- Empty file list should return empty result (not error)
ignore_errorsdoesn't prevent "No files found" error- Invalid XPath returns 0 rows instead of error
- Parameter validation not fully implemented
v1.2.0 - Lots of Fixes for DuckDB v1.4.2
What's Changed
- Format all files with clang-format by @onnimonni in #12
- Add devenv and claude configs by @onnimonni in #11
- add array support for xml_read() function by @onnimonni in #14
- Bump to DuckDB v1.4.1 by @teaguesterling in #15
- Fix schema issues identified in #13 and #8 by @teaguesterling in #16
- Fix race conditions and error handling in multiple threads. by @teaguesterling in #19
- Api consistency updates by @teaguesterling in #20
- Api consistency updates by @teaguesterling in #23
- Fix XPath namespace warning with thread-safe per-context error handler by @teaguesterling in #24
- Fix XPath namespace warning with thread-safe per-context error handler by @teaguesterling in #25
- Add union_by_name parameter for read_xml() and read_html() by @onnimonni in #28
- Add html_escape and html_unescape functions with UTF-8 support by @onnimonni in #30
- Add support to read more elements than STANDARD_VECTOR_SIZE of 2048 by @onnimonni in #31
- Use function_name variable in all_varchar conflict error message by @Copilot in #37
- Add GitHub Action for pre-commit checks by @onnimonni in #29
- Add all_varchar parameter to force scalar types to VARCHAR by @teaguesterling in #34
- Fix type inference order to prioritize numeric types over boolean by @teaguesterling in #36
New Contributors
- @onnimonni made their first contribution in #12
- @Copilot made their first contribution in #37
Full Changelog: v1.1.0...v1.2.0
v1.1.1 - Fix schema introspection issues.
What's Changed
- Format all files with clang-format by @onnimonni in #12
- Add devenv and claude configs by @onnimonni in #11
- add array support for xml_read() function by @onnimonni in #14
- Issues/7 investigation by @teaguesterling in #21
New Contributors
- @onnimonni made their first contribution in #12
Full Changelog: v1.1.0...v1.1.1
v1.1.0 - DuckDB 1.4.0 Compatibility & More
- Added support for DuckDB v1.4.0+ (#6)
- Updated README to provide better examples of extracting tables from HTML (#5)
- Improved namespace handling in xml_to_json (#4, partially)
- Added parameters to match xmltodict style (with example macro). (#2, partially)
Full Changelog: v1.0.2...v1.1.0
Builds for all platforms
This fixes some dependency issues to allow building on all platforms. No feature changes.