diff --git a/README.md b/README.md index 906e807..408d19e 100644 --- a/README.md +++ b/README.md @@ -1,12 +1,67 @@ +

+ + DocLang + +

+ # DocLang -Specification and reference validator for the [DocLang](https://www.doclang.ai/) document format. +[![PyPI version](https://img.shields.io/pypi/v/doclang)](https://pypi.org/project/doclang/) +![Python](https://img.shields.io/badge/python-3.10%20%7C%20%203.11%20%7C%203.12%20%7C%203.13%20%7C%203.14-blue) +[![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json)](https://github.com/astral-sh/uv) +[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff) +[![Checked with mypy](https://www.mypy-lang.org/static/mypy_badge.svg)](https://mypy-lang.org/) +[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit) +[![License Apache 2.0](https://img.shields.io/github/license/doclang-project/doclang)](https://opensource.org/licenses/Apache-2.0) + +**[DocLang](https://www.doclang.ai/) is the AI-native markup format for unstructured content** — including documents, images, and more. It maps cleanly to LLM tokens while preserving structure, semantics, layout, and geometry in a single, unambiguous representation. + +This repository is the home of the normative specification and the reference validator for DocLang. If you build with LLMs and VLMs on real-world content, this is where the standard lives. + +## Specification + +The source of the specification is available in [spec.md](https://github.com/doclang-project/doclang/blob/main/spec.md) +and exports to different formats can be found in the [exports/](https://github.com/doclang-project/doclang/tree/main/exports) +directory. + +## Reference Validator + +You can install the validator from PyPI: + +```bash +pip install doclang +``` + +You can then validate a DocLang document as follows: + +```bash +doclang validate -n my_document.dclg.xml +``` + +For more details, see the [doclang/README.md](https://github.com/doclang-project/doclang/blob/main/doclang/README.md). + +## Citation + +If you use DocLang in academic or technical work, please cite the specification: + +```bibtex +@misc{doclang_2026, + title = {DocLang: Universal AI Document Format}, + author = {{DocLang Project}}, + year = {2026}, + version = {main}, + howpublished = {\url{https://github.com/doclang-project/doclang}}, +} +``` + +## Development + +To work on this repository — setup, tests, reference generation, releases — see [CONTRIBUTING.md](https://github.com/doclang-project/doclang/blob/main/doclang/CONTRIBUTING.md). -This repository contains: +## We ❤️ Open Source AI -- **[spec.md](./spec.md)** — normative specification -- **`doclang/`** — reference validator (XSD + Schematron, `doclang` CLI on PyPI) +DocLang is developed in the open and supported by the [LF AI & Data Foundation](https://lfaidata.foundation/projects/). Learn more about the project at [doclang-project](https://github.com/doclang-project). -For validator usage (install, CLI, XSD/Schematron), see [doclang/README.md](./doclang/README.md). +## License -To work on this repository — setup, tests, reference generation, releases — see [CONTRIBUTING.md](./CONTRIBUTING.md). +DocLang is licensed under the Apache License 2.0. See [LICENSE](https://github.com/doclang-project/doclang/blob/main/LICENSE) for details. diff --git a/pyproject.toml b/pyproject.toml index b88f85b..76050a4 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -5,15 +5,37 @@ build-backend = "setuptools.build_meta" [project] name = "doclang" version = "0.4.0" # DO NOT EDIT MANUALLY, updated automatically -description = "DocLang specification and reference validator (XSD, Schematron, CLI)" +description = "DocLang reference validator" readme = "README.md" requires-python = ">=3.10" +license = "Apache-2.0" +license-files = ["LICENSE"] +authors = [{ name = "DocLang Project" }] +keywords = ["doclang", "xml", "validation", "xsd", "schematron", "documents", "llm"] +classifiers = [ + "Intended Audience :: Developers", + "Operating System :: OS Independent", + "Programming Language :: Python :: 3", + "Programming Language :: Python :: 3.10", + "Programming Language :: Python :: 3.11", + "Programming Language :: Python :: 3.12", + "Programming Language :: Python :: 3.13", + "Topic :: Software Development :: Libraries", + "Topic :: Text Processing :: Markup :: XML", +] dependencies = [ "lxml>=6.0.2", "saxonche>=12.9.0", "typer>=0.15.1", ] +[project.urls] +Homepage = "https://www.doclang.ai/" +Documentation = "https://github.com/doclang-project/doclang/blob/main/README.md" +Repository = "https://github.com/doclang-project/doclang" +Changelog = "https://github.com/doclang-project/doclang/blob/main/CHANGELOG.md" +Issues = "https://github.com/doclang-project/doclang/issues" + [project.scripts] doclang = "doclang.cli:app" diff --git a/resources/logo.png b/resources/logo.png new file mode 100644 index 0000000..19322b4 Binary files /dev/null and b/resources/logo.png differ diff --git a/spec.md b/spec.md index 82f01ee..8dbf9e7 100644 --- a/spec.md +++ b/spec.md @@ -260,7 +260,7 @@ In the simplest document example, document elements are in a flat list, ```xml - Research Paper Title + Research Paper Title Abstract This paper presents... Introduction @@ -274,7 +274,7 @@ In case of page-layout information, the coordinates are provided only at the sem ```xml - + Research Paper Title @@ -768,7 +768,7 @@ Field region with headings and complex layout: ```xml - Personal Information + Personal Information Full Name: @@ -857,7 +857,7 @@ Field region with mixed content: - Product Specifications + Product Specifications The following specifications apply to Model XYZ-2000: @@ -1114,8 +1114,8 @@ Field region with mixed content: ```xml - M31 - REDDITI DI CAPITALE SOGGETTI AD IMPOSIZIONE SOSTITUTIVA + M31 + REDDITI DI CAPITALE SOGGETTI AD IMPOSIZIONE SOSTITUTIVA 1 Tipo @@ -1151,8 +1151,8 @@ Field region with mixed content: Opzione tassazione ordinaria - M32 - PROVENTI DELLE OBBLIGAZIONI NON ASSOGGETTATI A IMPOSTA SOSTITUTIVA + M32 + PROVENTI DELLE OBBLIGAZIONI NON ASSOGGETTATI A IMPOSTA SOSTITUTIVA 1 Ammontare reddito @@ -1163,8 +1163,8 @@ Field region with mixed content: Aliquota % - M33 - PROVENTI DERIVANTI DA DEPOSITI IN GARANZIA + M33 + PROVENTI DERIVANTI DA DEPOSITI IN GARANZIA 1 Ammontare reddito @@ -1223,10 +1223,10 @@ Field region with mixed content: ![Form Example](examples/form/form_08.png) ```xml - QUADRO W - Investimenti e... + QUADRO W - Investimenti e... SEZIONE I - DATI RELATIVI... - W1 + W1 1 CODICE TITOLO POSSESSO @@ -1238,7 +1238,7 @@ Field region with mixed content: ... - W2 + W2 1 @@ -1267,9 +1267,9 @@ Field region with mixed content:
```xml - QUADRO C - Redditi di lavoro... + QUADRO C - Redditi di lavoro... - SEZIONE I - RE... + SEZIONE I - RE... Casi particolari @@ -1278,7 +1278,7 @@ Field region with mixed content: Codice Stato estero - C1 + C1 1 TIPO @@ -1299,7 +1299,7 @@ Field region with mixed content: ALTRI DATI - C2 + C2 1 TIPO @@ -1327,7 +1327,7 @@ Field region with mixed content: ALTRI DATI - C3 + C3 1 TIPO @@ -1348,8 +1348,8 @@ Field region with mixed content: ALTRI DATI - C4 - SOMME PER PREMI... + C4 + SOMME PER PREMI... 1 @@ -3004,21 +3004,31 @@ This appendix is informative and does not define conformance requirements. #### Pictures -For the `label.value` of `` elements, we recommend using the values defined below, or `undefined` if no more specific label is applicable: +For the `label.value` of `` elements, we recommend using the values defined below: | Context | Recommended values | | --- | --- | | `` | `bar_chart`, `box_plot`, `flow_chart`, `line_chart`, `pie_chart`, `scatter_plot` | | else, i.e. `` (default) | `full_page_image`, `page_thumbnail`, `photograph`, `chemistry_structure`, `bar_code`, `icon`, `logo`, `qr_code`, `signature`, `stamp`, `engineering_drawing`, `screenshot_from_computer`, `screenshot_from_manual`, `geographical_map`, `topographical_map`, `calendar`, `crossword_puzzle`, `music` | +Additional special cases: +- `other`: use when the picture was examined but does not fit any of the recommended values from above (e.g. a chart that is not bar/box/flow/line/pie/scatter). +- `undefined`: use when classification has not been performed (default [`label@value`](#label)). + +Note: [`picture@class="undefined"`](#picture) (default picture type) and `label@value="undefined"` (unclassified subclass) are independent. + #### Code -For the `label.value` of `` elements, we recommend using the values defined below, or `undefined` if no more specific label is applicable: +For the `label.value` of `` elements, we recommend using the values defined below: | Context | Recommended values | | --- | --- | | `` | [Linguist](https://github.com/github-linguist/linguist/blob/v9.5.0/lib/linguist/languages.yml) v9.5.0 language keys (e.g. `Python`) | +Additional special cases: +- `other`: use when the code was examined but does not match any of the recommended values from above. +- `undefined`: use when classification has not been performed (default [`label@value`](#label)). + ### Custom vocabulary naming and namespacing Content inside [``](#custom) is implementation-defined and not governed by this standard. @@ -3046,7 +3056,7 @@ The token vocabulary trades off size and inference cost: | `"/>` | end of self-closing element with attributes | | `` | [`text`](#text) start | | `` | [`text`](#text) end | -| `` | level-1 [`heading`](#heading) start | +| `` | level-1 [`heading`](#heading) start | | `` | level-2 [`heading`](#heading) start | | `` | level-3 [`heading`](#heading) start | | `` | level-4 [`heading`](#heading) start | @@ -3079,7 +3089,7 @@ The token vocabulary trades off size and inference cost: | `` | [`marker`](#marker) end | | `` | [`group`](#group) start | | `` | [`group`](#group) end | -| `` | level-1 [`field_heading`](#field_heading) start | +| `` | level-1 [`field_heading`](#field_heading) start | | `` | level-2 [`field_heading`](#field_heading) start | | `` | level-3 [`field_heading`](#field_heading) start | | `` | level-4 [`field_heading`](#field_heading) start |