The structured data materialization pipeline was refactored to be generic and mapping-preserving.
- Synthetic remapping of authored mappings into internal
ex:*structures. - Implicit coercion/defaulting toward
Review/Thingin the generic materialization path. - Legacy JS transpile path (
yarrrml-parser-> temporary RML.ttl) in the materialization pipeline. - Review-specific postprocessing side effects in core pipeline execution:
_dedupe_review_notes_ensure_review_url- review author/rating injection/pruning hooks
MaterializationPipeline now removes target_type from generic materialization flow methods:
normalize(self, yarrrml: str, url: str, xhtml_path: Path, response: object | None = None) -> tuple[str, list[dict]]postprocess(self, jsonld_raw: dict, mappings: list[dict], cleaned_xhtml: str, dataset_uri: str, url: str) -> dictrun(self, yarrrml: str, url: str, cleaned_xhtml: str, dataset_uri: str, xhtml_path: Path, workdir: Path, response: object | None = None, strict_url_token: bool = False) -> tuple[dict, list[dict]]
postprocess_jsonld(...) in the generic YARRRML pipeline also no longer accepts target_type.
Mappings can use runtime tokens before materialization:
__XHTML__: replaced with local XHTML source path.__URL__: resolved fromresponse.web_page.urlfirst, then expliciturlargument.__ID__: resolved fromresponse.id.
URL token resolution policy:
- strict mode (
strict_url_token=True): fail if unresolved - default mode: warn and keep
__URL__unchanged
ID token resolution policy:
- fail-closed if
__ID__appears and no runtime ID is available
The SDK now passes YARRRML directly to morph-kgc.
What to update from legacy transpile flow:
- Remove operational dependencies on
yarrrml-parserin your runtime. - Validate mappings against
morph-kgcnative YARRRML behavior. - Update XPath/function expressions that relied on parser-specific transpile behavior.
If your integration relied on previous review-specific behavior, move that behavior to an adapter outside SDK core:
- Materialize using the SDK generic pipeline.
- Apply your review-specific transformations in your own postprocessing layer.
- Validate transformed output against the shapes you require.
This keeps the SDK customer-agnostic while preserving custom behavior in your project.
The SDK now provides a formal 2-axis ingestion model:
- Source axis (
INGEST_SOURCE):auto|urls|sitemap|sheets|local - Loader axis (
INGEST_LOADER):auto|simple|proxy|playwright|premium_scraper|web_scrape_api|passthrough
- Global default loader is now
web_scrape_api. - If
INGEST_LOADERandWEB_PAGE_IMPORT_MODEare both unset, loader resolves toweb_scrape_api.
Legacy WEB_PAGE_IMPORT_MODE |
New behavior |
|---|---|
default |
web_scrape_api |
proxy |
proxy |
premium_scraper |
premium_scraper |
INGEST_*keys win over legacy keys.- If
INGEST_*is unset, SDK resolves using legacy keys. - If both are set and disagree, SDK uses
INGEST_*and emits machine-parseable warningINGEST_CFG_CONFLICT.
If an item includes embedded HTML and INGEST_PASSTHROUGH_WHEN_HTML=true, SDK
uses passthrough before network loaders.