Skip to content

Releases: teaguesterling/duckdb_webbed

v1.5.0

23 Mar 17:01

Choose a tag to compare

New Features

  • datetime_format parameter — Control date/time detection and parsing in read_xml, read_html, parse_xml, and parse_html. Supports preset names (auto, none, us, eu, iso, etc.), custom strftime format strings, and lists of formats. Replaces regex-based temporal detection with DuckDB's StrpTimeFormat candidate elimination. (#38)
  • nullstr parameter — Custom NULL value representation for XML/HTML parsing (#40)
  • Lazy DOM extraction — Records are now extracted one at a time directly from the DOM instead of caching all rows, reducing peak memory usage (#17, Phase 1)
  • Type inference for elements with attributes#text field now infers proper types (DOUBLE, INTEGER, DATE, BOOLEAN) instead of defaulting to VARCHAR (#49, #46)

Improvements

  • Increased default maximum_file_size from 16MB to 128MB (#66)

Bug Fixes

  • Fixed read_xml returning NULL for non-Latin text content — Cyrillic, CJK, and other multi-byte UTF-8 characters were being stripped by whitespace trimming (#64)

v1.4.1

23 Feb 23:42

Choose a tag to compare

Update for DuckDB v1.5 compatibility.

webbed v1.4.0

18 Feb 17:54

Choose a tag to compare

webbed v1.4.0 Release Notes

Overview

This release introduces new parse_xml and parse_html functions for parsing XML/HTML content directly from strings, complementing the existing file-based read_xml and read_html functions. Also includes a bug fix for CDATA section handling.

New Features

String-based XML/HTML Parsing

New table functions for parsing XML/HTML content from strings instead of files:

  • parse_xml_objects(xml_string) - Parse XML string and return raw content as XMLType
  • parse_html_objects(html_string) - Parse HTML string and return raw content as HTMLType
  • parse_xml(xml_string, [options]) - Parse XML string with schema inference
  • parse_html(html_string, [options]) - Parse HTML string with schema inference

Basic Usage:

-- Parse XML string to raw content
SELECT * FROM parse_xml_objects('<root><item>test</item></root>');

-- Parse XML with schema inference
SELECT title, price
FROM parse_xml('<catalog><book><title>DuckDB</title><price>29.99</price></book></catalog>');

-- Parse with explicit schema
SELECT * FROM parse_xml('<root><item>42</item></root>', columns := {item: 'INTEGER'});

-- Error handling
SELECT * FROM parse_xml_objects('invalid xml', ignore_errors := true);

Supported Parameters:

All parse_* functions support:

  • ignore_errors (BOOLEAN) - Return empty result instead of failing on invalid input

parse_xml and parse_html support all schema inference parameters from their read_* counterparts:

  • root_element, record_element, force_list
  • attr_mode, attr_prefix, text_key
  • namespaces, empty_elements
  • auto_detect, max_depth
  • unnest_as, all_varchar, columns

Bug Fixes

  • Fixed CDATA sections converted to empty objects in xml_to_json (#63) - CDATA content is now properly preserved when converting XML to JSON

Testing

  • Added comprehensive test suite for parse_xml and parse_html functions
  • 17 new test assertions for string-based parsing

v1.3.2 - Bugfixes

03 Jan 22:16

Choose a tag to compare

Fixing #61 - Filename parameters, better parameter handling tests, and documentation

v1.3.0

02 Jan 20:30

Choose a tag to compare

webbed v1.3.0 Release Notes

Overview

This release introduces the duck_blocks document processing system, significantly improved XML namespace handling, and several bug fixes. The duck_blocks API is designed to integrate seamlessly with the duck_block_utils extension for full Markdown support.

New Features

Duck Blocks Document Processing

New functions for converting HTML documents to and from structured block representations:

  • html_to_duck_blocks(html) - Parse HTML into a list of structured duck_block elements
  • duck_blocks_to_html(blocks) - Convert duck_block elements back to HTML

The duck_block structure provides a standardized representation:

STRUCT(kind, element_type, content, level, encoding, attributes, element_order)

Supported block types include: paragraph, heading, code_block, blockquote, list_item, horizontal_rule, table, image, metadata, and inline elements (text, bold, italic, code, link, image, linebreak, etc.).

Features:

  • Round-trip HTML preservation (HTML -> duck_blocks -> HTML)
  • Inline element support with structured parsing (#59)
  • Table rendering with proper <thead>/<tbody> structure (#57)
  • Frontmatter preservation using <script type="application/vnd.frontmatter+yaml"> tags (#56)

XML Namespace Improvements (#60)

New namespace modes for XPath functions:

  • 'auto' (recommended) - Automatically detects undeclared prefixes and either looks up common URIs or creates mock URIs
  • 'strict' - Requires all namespaces to be explicitly declared
  • 'ignore' - Ignores namespace declarations

New helper functions:

  • xml_find_undefined_prefixes(xml, xpath) - Find undeclared namespace prefixes in XPath
  • xml_add_namespace_declarations(xml, map) - Inject namespace declarations into XML
  • xml_lookup_namespace(prefix) - Look up common namespace URIs (gml, svg, xlink, dc, etc.)

Updated functions to support namespace configuration:

  • xml_extract_text
  • xml_extract_elements
  • xml_extract_elements_string
  • xml_extract_attributes

Implicit Type Casting

XML and HTML types now implicitly cast to VARCHAR (cost 1), allowing string functions to work on XML/HTML values while preferring direct XML/HTML function overloads.

Duck Blocks API Rename

Functions have been created to convert HTML to duck_blocks for integration with duck_block_utils:

  • html_to_duck_blocks
  • duck_blocks_to_html

These can be used to convert to other document types (e.g., markdown) or integrate with pandoc.

HTML Namespace Parameter Removed

The namespace parameter has been removed from html_extract_text. HTML5 parsing doesn't support XML namespace declarations (prefixed elements are treated as literal names with colons).

Bug Fixes

  • Fixed namespace mode handling in XPath functions (#60)
  • Fixed HTML union_by_name bug (#48)

Documentation

  • Added comprehensive documentation for duck_block functions
  • Added namespace mode documentation with 'auto' mode recommendation
  • Added integration examples with duck_block_utils extension
  • Updated quick reference tables

Testing

  • 54 new assertions for duck_block API
  • 37 new assertions for table round-trips
  • Comprehensive namespace mode tests

v1.2.1 - Bug fixes and XPaths

01 Jan 00:53

Choose a tag to compare

Release Notes: webbed v1.2.1

Overview

This release brings significant improvements to XPath functionality, namespace handling, HTML/XML feature parity, and comprehensive documentation. It includes several breaking changes to align with PostgreSQL's xpath() semantics.

Breaking Changes

XPath Functions Now Return LIST of All Matches (Issue #53)

All XPath extraction functions now return a LIST of all matching results instead of just the first match. This aligns with PostgreSQL's xpath() behavior.

Function Old Return Type New Return Type
xml_extract_text(xml, xpath) VARCHAR LIST(VARCHAR)
html_extract_text(html, xpath) VARCHAR LIST(VARCHAR)
xml_extract_elements(xml, xpath) XMLFragment LIST(XMLFragment)

Migration: Use list indexing [1] to get single values:

-- Before (v1.2.0)
SELECT xml_extract_text(xml, '//title');

-- After (v1.3.0)
SELECT xml_extract_text(xml, '//title')[1];  -- Get first match
SELECT xml_extract_text(xml, '//title');     -- Get all matches as LIST

xml_namespaces() Returns MAP Instead of LIST (Uncommitted)

The xml_namespaces() function now returns MAP(VARCHAR, VARCHAR) instead of LIST(STRUCT(prefix, uri)) for easier namespace lookups and merging.

-- New: Direct key access
SELECT map_extract_value(xml_namespaces(xml), 'gml');

-- New: Easy merging with map_concat
SELECT map_concat(xml_namespaces(xml), xml_common_namespaces());

New Features

XPath Namespace Prefix Auto-Registration (Issue #4)

XPath expressions with namespace prefixes now work when the namespace is declared in the document:

-- Now works! (was broken in v1.2.0)
SELECT xml_extract_text(
  '<root xmlns:gml="http://www.opengis.net/gml"><gml:posList>1 2 3</gml:posList></root>',
  '//gml:posList'
);
-- Returns: ['1 2 3']

Technical Details: libxml2's XPath engine requires explicit namespace registration. We now auto-register all xmlns: declarations from the document into the XPath context using xmlGetNsList() and xmlXPathRegisterNs().

New Namespace Helper Functions

Function Description
xml_common_namespaces() Returns MAP of ~25 well-known namespace prefixes (xsd, svg, gml, rdf, dc, soap, etc.)
xml_detect_prefixes(xpath) Parses XPath expression and returns LIST of namespace prefixes used
xml_mock_namespaces(prefixes) Creates mock URIs (urn:mock:prefix) for a list of prefixes
-- Get common namespaces
SELECT xml_common_namespaces();
-- Returns: {xml=..., xsd=..., svg=..., gml=..., rdf=..., dc=..., ...}

-- Detect prefixes in XPath
SELECT xml_detect_prefixes('//gml:posList | //svg:path');
-- Returns: ['gml', 'svg']

-- Create mock namespaces for testing
SELECT xml_mock_namespaces(['custom', 'app']);
-- Returns: {custom='urn:mock:custom', app='urn:mock:app'}

HTML/XML Feature Parity (Issue #18)

read_html() now supports the same parameters as read_xml():

  • max_depth - Control nesting depth
  • all_varchar - Force all columns to VARCHAR
  • force_list - Force specific elements to be LIST type
  • union_by_name - Combine multiple files with different schemas

Cross-Record Schema Inference (Issues #33, #50)

Schema inference now correctly examines ALL records to detect:

  • Repeated elements - Elements that appear multiple times become LIST type
  • Nested attributes - Attributes discovered in later records create proper STRUCT types
-- Elements with attributes now properly typed
-- <phone type="mobile">555-1234</phone> becomes:
-- STRUCT("#text" VARCHAR, "type" VARCHAR)

Opaque Type Parameterization (Issue #18)

HTML files now correctly show "HTML" type instead of "XML" for opaque elements in schema inference.

Bug Fixes

  • UTF-8 Encoding (Issue #53): Fixed html_extract_text() to properly handle UTF-8 encoded content
  • Documentation (Issue #54): Fixed README examples to match actual function behavior
  • LIST Extraction: Fixed extraction of LIST values in multi-record schemas
  • Record Serialization: Fixed serialization of record elements in ExtractDataWithSchema
  • HTML Parser Selection: Fixed parser selection for multi-file reads to consistently use HTML parser for HTML files

Documentation

Read the Docs Integration

Comprehensive documentation is now available at Read the Docs (link TBD):

  • Installation Guide - Platform-specific setup instructions
  • Quick Start - Common usage patterns
  • Function Reference - Complete API documentation
  • XPath Guide - XPath expression syntax and examples
  • Namespace Handling - Working with XML namespaces
  • Schema Inference - How automatic type detection works
  • Parameters Reference - All configuration options

Test Suite Improvements

  • Added comprehensive test suites for GitHub Issues #4, #17, #33, #50, #53, #54, #55
  • Added HTML feature parity tests (Priority 1 & 2)
  • Added cross-record attribute discovery tests
  • Total test coverage: 55 test files, 1648+ assertions

GitHub Issues Addressed

Issue Title Status
#4 Namespace extraction and XPath with namespaces ✅ Fixed
#17 Large file streaming ✅ Tests added
#18 HTML/XML feature parity ✅ Implemented
#33 Repeated element data loss ✅ Fixed
#50 Cross-record attribute discovery ✅ Fixed
#51 Custom HTML elements not recognized ✅ Fixed
#53 XPath functions return all matches + UTF-8 fix ✅ Fixed
#54 Documentation mismatches ✅ Fixed
#55 xml_extract_attributes investigation 🔄 Tests added

Upgrade Guide

  1. Update XPath queries that expect single values to use [1] indexing
  2. Update xml_namespaces() usage from struct access to MAP functions:
    • Old: xml_namespaces(xml)[1].prefix
    • New: map_keys(xml_namespaces(xml))[1]
  3. Test namespace-aware XPath - queries using declared prefixes should now work without workarounds

Known Issues

See TODO.md for tracked issues:

  • Empty file list should return empty result (not error)
  • ignore_errors doesn't prevent "No files found" error
  • Invalid XPath returns 0 rows instead of error
  • Parameter validation not fully implemented

v1.2.0 - Lots of Fixes for DuckDB v1.4.2

27 Nov 05:24

Choose a tag to compare

What's Changed

New Contributors

  • @onnimonni made their first contribution in #12
  • @Copilot made their first contribution in #37

Full Changelog: v1.1.0...v1.2.0

v1.1.1 - Fix schema introspection issues.

12 Oct 19:19
f2e49ff

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v1.1.0...v1.1.1

v1.1.0 - DuckDB 1.4.0 Compatibility & More

05 Oct 00:43

Choose a tag to compare

  • Added support for DuckDB v1.4.0+ (#6)
  • Updated README to provide better examples of extracting tables from HTML (#5)
  • Improved namespace handling in xml_to_json (#4, partially)
  • Added parameters to match xmltodict style (with example macro). (#2, partially)

Full Changelog: v1.0.2...v1.1.0

Builds for all platforms

13 Aug 01:40

Choose a tag to compare

This fixes some dependency issues to allow building on all platforms. No feature changes.