23 Mar 17:01

teaguesterling

66eafc1

v1.5.0 Latest

Latest

New Features

datetime_format parameter — Control date/time detection and parsing in read_xml, read_html, parse_xml, and parse_html. Supports preset names (auto, none, us, eu, iso, etc.), custom strftime format strings, and lists of formats. Replaces regex-based temporal detection with DuckDB's StrpTimeFormat candidate elimination. (#38)
nullstr parameter — Custom NULL value representation for XML/HTML parsing (#40)
Lazy DOM extraction — Records are now extracted one at a time directly from the DOM instead of caching all rows, reducing peak memory usage (#17, Phase 1)
Type inference for elements with attributes — #text field now infers proper types (DOUBLE, INTEGER, DATE, BOOLEAN) instead of defaulting to VARCHAR (#49, #46)

Improvements

Increased default maximum_file_size from 16MB to 128MB (#66)

Bug Fixes

Fixed read_xml returning NULL for non-Latin text content — Cyrillic, CJK, and other multi-byte UTF-8 characters were being stripped by whitespace trimming (#64)

Assets 2

23 Feb 23:42

teaguesterling

v1.4.1

7d264c0

v1.4.1

Update for DuckDB v1.5 compatibility.

Assets 2

18 Feb 17:54

teaguesterling

v1.4.0

751f2f3

webbed v1.4.0

webbed v1.4.0 Release Notes

Overview

This release introduces new parse_xml and parse_html functions for parsing XML/HTML content directly from strings, complementing the existing file-based read_xml and read_html functions. Also includes a bug fix for CDATA section handling.

New Features

String-based XML/HTML Parsing

New table functions for parsing XML/HTML content from strings instead of files:

parse_xml_objects(xml_string) - Parse XML string and return raw content as XMLType
parse_html_objects(html_string) - Parse HTML string and return raw content as HTMLType
parse_xml(xml_string, [options]) - Parse XML string with schema inference
parse_html(html_string, [options]) - Parse HTML string with schema inference

Basic Usage:

-- Parse XML string to raw content
SELECT * FROM parse_xml_objects('<root><item>test</item></root>');

-- Parse XML with schema inference
SELECT title, price
FROM parse_xml('<catalog><book><title>DuckDB</title><price>29.99</price></book></catalog>');

-- Parse with explicit schema
SELECT * FROM parse_xml('<root><item>42</item></root>', columns := {item: 'INTEGER'});

-- Error handling
SELECT * FROM parse_xml_objects('invalid xml', ignore_errors := true);

Supported Parameters:

All parse_* functions support:

ignore_errors (BOOLEAN) - Return empty result instead of failing on invalid input

parse_xml and parse_html support all schema inference parameters from their read_* counterparts:

root_element, record_element, force_list
attr_mode, attr_prefix, text_key
namespaces, empty_elements
auto_detect, max_depth
unnest_as, all_varchar, columns

Bug Fixes

Fixed CDATA sections converted to empty objects in xml_to_json (#63) - CDATA content is now properly preserved when converting XML to JSON

Testing

Added comprehensive test suite for parse_xml and parse_html functions
17 new test assertions for string-based parsing

Assets 2

03 Jan 22:16

teaguesterling

v1.3.2

6ffff99

v1.3.2 - Bugfixes

Fixing #61 - Filename parameters, better parameter handling tests, and documentation

Assets 2

02 Jan 20:30

teaguesterling

v1.3.0

ad79f6b

v1.3.0

webbed v1.3.0 Release Notes

Overview

This release introduces the duck_blocks document processing system, significantly improved XML namespace handling, and several bug fixes. The duck_blocks API is designed to integrate seamlessly with the duck_block_utils extension for full Markdown support.

New Features

Duck Blocks Document Processing

New functions for converting HTML documents to and from structured block representations:

html_to_duck_blocks(html) - Parse HTML into a list of structured duck_block elements
duck_blocks_to_html(blocks) - Convert duck_block elements back to HTML

The duck_block structure provides a standardized representation:

STRUCT(kind, element_type, content, level, encoding, attributes, element_order)

Supported block types include: paragraph, heading, code_block, blockquote, list_item, horizontal_rule, table, image, metadata, and inline elements (text, bold, italic, code, link, image, linebreak, etc.).

Features:

Round-trip HTML preservation (HTML -> duck_blocks -> HTML)
Inline element support with structured parsing (#59)
Table rendering with proper <thead>/<tbody> structure (#57)
Frontmatter preservation using <script type="application/vnd.frontmatter+yaml"> tags (#56)

XML Namespace Improvements (#60)

New namespace modes for XPath functions:

'auto' (recommended) - Automatically detects undeclared prefixes and either looks up common URIs or creates mock URIs
'strict' - Requires all namespaces to be explicitly declared
'ignore' - Ignores namespace declarations

New helper functions:

xml_find_undefined_prefixes(xml, xpath) - Find undeclared namespace prefixes in XPath
xml_add_namespace_declarations(xml, map) - Inject namespace declarations into XML
xml_lookup_namespace(prefix) - Look up common namespace URIs (gml, svg, xlink, dc, etc.)

Updated functions to support namespace configuration:

xml_extract_text
xml_extract_elements
xml_extract_elements_string
xml_extract_attributes

Implicit Type Casting

XML and HTML types now implicitly cast to VARCHAR (cost 1), allowing string functions to work on XML/HTML values while preferring direct XML/HTML function overloads.

Duck Blocks API Rename

Functions have been created to convert HTML to duck_blocks for integration with duck_block_utils:

html_to_duck_blocks
duck_blocks_to_html

These can be used to convert to other document types (e.g., markdown) or integrate with pandoc.

HTML Namespace Parameter Removed

The namespace parameter has been removed from html_extract_text. HTML5 parsing doesn't support XML namespace declarations (prefixed elements are treated as literal names with colons).

Bug Fixes

Fixed namespace mode handling in XPath functions (#60)
Fixed HTML union_by_name bug (#48)

Documentation

Added comprehensive documentation for duck_block functions
Added namespace mode documentation with 'auto' mode recommendation
Added integration examples with duck_block_utils extension
Updated quick reference tables

Testing

54 new assertions for duck_block API
37 new assertions for table round-trips
Comprehensive namespace mode tests

Assets 2

01 Jan 00:53

teaguesterling

v1.2.1

e65b780

v1.2.1 - Bug fixes and XPaths

Release Notes: webbed v1.2.1

Overview

This release brings significant improvements to XPath functionality, namespace handling, HTML/XML feature parity, and comprehensive documentation. It includes several breaking changes to align with PostgreSQL's xpath() semantics.

Breaking Changes

XPath Functions Now Return LIST of All Matches (Issue #53)

All XPath extraction functions now return a LIST of all matching results instead of just the first match. This aligns with PostgreSQL's xpath() behavior.

Function	Old Return Type	New Return Type
`xml_extract_text(xml, xpath)`	`VARCHAR`	`LIST(VARCHAR)`
`html_extract_text(html, xpath)`	`VARCHAR`	`LIST(VARCHAR)`
`xml_extract_elements(xml, xpath)`	`XMLFragment`	`LIST(XMLFragment)`

Migration: Use list indexing [1] to get single values:

-- Before (v1.2.0)
SELECT xml_extract_text(xml, '//title');

-- After (v1.3.0)
SELECT xml_extract_text(xml, '//title')[1];  -- Get first match
SELECT xml_extract_text(xml, '//title');     -- Get all matches as LIST

`xml_namespaces()` Returns MAP Instead of LIST (Uncommitted)

The xml_namespaces() function now returns MAP(VARCHAR, VARCHAR) instead of LIST(STRUCT(prefix, uri)) for easier namespace lookups and merging.

-- New: Direct key access
SELECT map_extract_value(xml_namespaces(xml), 'gml');

-- New: Easy merging with map_concat
SELECT map_concat(xml_namespaces(xml), xml_common_namespaces());

New Features

XPath Namespace Prefix Auto-Registration (Issue #4)

XPath expressions with namespace prefixes now work when the namespace is declared in the document:

-- Now works! (was broken in v1.2.0)
SELECT xml_extract_text(
  '<root xmlns:gml="http://www.opengis.net/gml"><gml:posList>1 2 3</gml:posList></root>',
  '//gml:posList'
);
-- Returns: ['1 2 3']

Technical Details: libxml2's XPath engine requires explicit namespace registration. We now auto-register all xmlns: declarations from the document into the XPath context using xmlGetNsList() and xmlXPathRegisterNs().

New Namespace Helper Functions

Function	Description
`xml_common_namespaces()`	Returns MAP of ~25 well-known namespace prefixes (xsd, svg, gml, rdf, dc, soap, etc.)
`xml_detect_prefixes(xpath)`	Parses XPath expression and returns LIST of namespace prefixes used
`xml_mock_namespaces(prefixes)`	Creates mock URIs (`urn:mock:prefix`) for a list of prefixes

-- Get common namespaces
SELECT xml_common_namespaces();
-- Returns: {xml=..., xsd=..., svg=..., gml=..., rdf=..., dc=..., ...}

-- Detect prefixes in XPath
SELECT xml_detect_prefixes('//gml:posList | //svg:path');
-- Returns: ['gml', 'svg']

-- Create mock namespaces for testing
SELECT xml_mock_namespaces(['custom', 'app']);
-- Returns: {custom='urn:mock:custom', app='urn:mock:app'}

HTML/XML Feature Parity (Issue #18)

read_html() now supports the same parameters as read_xml():

max_depth - Control nesting depth
all_varchar - Force all columns to VARCHAR
force_list - Force specific elements to be LIST type
union_by_name - Combine multiple files with different schemas

Cross-Record Schema Inference (Issues #33, #50)

Schema inference now correctly examines ALL records to detect:

Repeated elements - Elements that appear multiple times become LIST type
Nested attributes - Attributes discovered in later records create proper STRUCT types

-- Elements with attributes now properly typed
-- <phone type="mobile">555-1234</phone> becomes:
-- STRUCT("#text" VARCHAR, "type" VARCHAR)

Opaque Type Parameterization (Issue #18)

HTML files now correctly show "HTML" type instead of "XML" for opaque elements in schema inference.

Bug Fixes

UTF-8 Encoding (Issue #53): Fixed html_extract_text() to properly handle UTF-8 encoded content
Documentation (Issue #54): Fixed README examples to match actual function behavior
LIST Extraction: Fixed extraction of LIST values in multi-record schemas
Record Serialization: Fixed serialization of record elements in ExtractDataWithSchema
HTML Parser Selection: Fixed parser selection for multi-file reads to consistently use HTML parser for HTML files

Documentation

Read the Docs Integration

Comprehensive documentation is now available at Read the Docs (link TBD):

Installation Guide - Platform-specific setup instructions
Quick Start - Common usage patterns
Function Reference - Complete API documentation
XPath Guide - XPath expression syntax and examples
Namespace Handling - Working with XML namespaces
Schema Inference - How automatic type detection works
Parameters Reference - All configuration options

Test Suite Improvements

Added comprehensive test suites for GitHub Issues #4, #17, #33, #50, #53, #54, #55
Added HTML feature parity tests (Priority 1 & 2)
Added cross-record attribute discovery tests
Total test coverage: 55 test files, 1648+ assertions

GitHub Issues Addressed

Issue	Title	Status
#4	Namespace extraction and XPath with namespaces	✅ Fixed
#17	Large file streaming	✅ Tests added
#18	HTML/XML feature parity	✅ Implemented
#33	Repeated element data loss	✅ Fixed
#50	Cross-record attribute discovery	✅ Fixed
#51	Custom HTML elements not recognized	✅ Fixed
#53	XPath functions return all matches + UTF-8 fix	✅ Fixed
#54	Documentation mismatches	✅ Fixed
#55	xml_extract_attributes investigation	🔄 Tests added

Upgrade Guide

Update XPath queries that expect single values to use [1] indexing
Update xml_namespaces() usage from struct access to MAP functions:
- Old: xml_namespaces(xml)[1].prefix
- New: map_keys(xml_namespaces(xml))[1]
Test namespace-aware XPath - queries using declared prefixes should now work without workarounds

Known Issues

See TODO.md for tracked issues:

Empty file list should return empty result (not error)
ignore_errors doesn't prevent "No files found" error
Invalid XPath returns 0 rows instead of error
Parameter validation not fully implemented

Assets 2

27 Nov 05:24

teaguesterling

v1.2.0

160b8e8

v1.2.0 - Lots of Fixes for DuckDB v1.4.2

What's Changed

Format all files with clang-format by @onnimonni in #12
Add devenv and claude configs by @onnimonni in #11
add array support for xml_read() function by @onnimonni in #14
Bump to DuckDB v1.4.1 by @teaguesterling in #15
Fix schema issues identified in #13 and #8 by @teaguesterling in #16
Fix race conditions and error handling in multiple threads. by @teaguesterling in #19
Api consistency updates by @teaguesterling in #20
Api consistency updates by @teaguesterling in #23
Fix XPath namespace warning with thread-safe per-context error handler by @teaguesterling in #24
Fix XPath namespace warning with thread-safe per-context error handler by @teaguesterling in #25
Add union_by_name parameter for read_xml() and read_html() by @onnimonni in #28
Add html_escape and html_unescape functions with UTF-8 support by @onnimonni in #30
Add support to read more elements than STANDARD_VECTOR_SIZE of 2048 by @onnimonni in #31
Use function_name variable in all_varchar conflict error message by @Copilot in #37
Add GitHub Action for pre-commit checks by @onnimonni in #29
Add all_varchar parameter to force scalar types to VARCHAR by @teaguesterling in #34
Fix type inference order to prioritize numeric types over boolean by @teaguesterling in #36

New Contributors

@onnimonni made their first contribution in #12
@Copilot made their first contribution in #37

Full Changelog: v1.1.0...v1.2.0

Contributors

teaguesterling and onnimonni

Assets 2

12 Oct 19:19

teaguesterling

v1.1.1

f2e49ff

v1.1.1 - Fix schema introspection issues.

What's Changed

Format all files with clang-format by @onnimonni in #12
Add devenv and claude configs by @onnimonni in #11
add array support for xml_read() function by @onnimonni in #14
Issues/7 investigation by @teaguesterling in #21

New Contributors

@onnimonni made their first contribution in #12

Full Changelog: v1.1.0...v1.1.1

Contributors

teaguesterling and onnimonni

Assets 2

05 Oct 00:43

teaguesterling

v1.1.0

487578a

v1.1.0 - DuckDB 1.4.0 Compatibility & More

Added support for DuckDB v1.4.0+ (#6)
Updated README to provide better examples of extracting tables from HTML (#5)
Improved namespace handling in xml_to_json (#4, partially)
Added parameters to match xmltodict style (with example macro). (#2, partially)

Full Changelog: v1.0.2...v1.1.0

Assets 2

13 Aug 01:40

teaguesterling

v1.0.2

d17300c

Builds for all platforms

This fixes some dependency issues to allow building on all platforms. No feature changes.

Assets 2

Releases: teaguesterling/duckdb_webbed

v1.5.0

New Features

Improvements

Bug Fixes

Uh oh!

v1.4.1

Uh oh!

webbed v1.4.0

webbed v1.4.0 Release Notes

Overview

New Features

String-based XML/HTML Parsing

Bug Fixes

Testing

Uh oh!

v1.3.2 - Bugfixes

Uh oh!

v1.3.0

webbed v1.3.0 Release Notes

Overview

New Features

Duck Blocks Document Processing

XML Namespace Improvements (#60)

Implicit Type Casting

Duck Blocks API Rename

HTML Namespace Parameter Removed

Bug Fixes

Documentation

Testing

Uh oh!

v1.2.1 - Bug fixes and XPaths

Release Notes: webbed v1.2.1

Overview

Breaking Changes

XPath Functions Now Return LIST of All Matches (Issue #53)

xml_namespaces() Returns MAP Instead of LIST (Uncommitted)

New Features

XPath Namespace Prefix Auto-Registration (Issue #4)

New Namespace Helper Functions

HTML/XML Feature Parity (Issue #18)

Cross-Record Schema Inference (Issues #33, #50)

Opaque Type Parameterization (Issue #18)

Bug Fixes

Documentation

Read the Docs Integration

Test Suite Improvements

GitHub Issues Addressed

Upgrade Guide

Known Issues

Uh oh!

v1.2.0 - Lots of Fixes for DuckDB v1.4.2

What's Changed

New Contributors

Contributors

Uh oh!

v1.1.1 - Fix schema introspection issues.

What's Changed

New Contributors

Contributors

Uh oh!

v1.1.0 - DuckDB 1.4.0 Compatibility & More

Uh oh!

Builds for all platforms

Uh oh!

`xml_namespaces()` Returns MAP Instead of LIST (Uncommitted)