Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 61 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,67 @@
<p align="center">
<a href="https://github.com/doclang-project/doclang">
<img loading="lazy" alt="DocLang" src="https://github.com/doclang-project/doclang/raw/main/resources/logo.png" width="30%"/>
</a>
</p>

# DocLang

Specification and reference validator for the [DocLang](https://www.doclang.ai/) document format.
[![PyPI version](https://img.shields.io/pypi/v/doclang)](https://pypi.org/project/doclang/)
![Python](https://img.shields.io/badge/python-3.10%20%7C%20%203.11%20%7C%203.12%20%7C%203.13%20%7C%203.14-blue)
[![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json)](https://github.com/astral-sh/uv)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![Checked with mypy](https://www.mypy-lang.org/static/mypy_badge.svg)](https://mypy-lang.org/)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
[![License Apache 2.0](https://img.shields.io/github/license/doclang-project/doclang)](https://opensource.org/licenses/Apache-2.0)

**[DocLang](https://www.doclang.ai/) is the AI-native markup format for unstructured content** — including documents, images, and more. It maps cleanly to LLM tokens while preserving structure, semantics, layout, and geometry in a single, unambiguous representation.

This repository is the home of the normative specification and the reference validator for DocLang. If you build with LLMs and VLMs on real-world content, this is where the standard lives.

## Specification

The source of the specification is available in [spec.md](https://github.com/doclang-project/doclang/blob/main/spec.md)
and exports to different formats can be found in the [exports/](https://github.com/doclang-project/doclang/tree/main/exports)
directory.

## Reference Validator

You can install the validator from PyPI:

```bash
pip install doclang
```

You can then validate a DocLang document as follows:

```bash
doclang validate -n my_document.dclg.xml
```

For more details, see the [doclang/README.md](https://github.com/doclang-project/doclang/blob/main/doclang/README.md).

## Citation

If you use DocLang in academic or technical work, please cite the specification:

```bibtex
@misc{doclang_2026,
title = {DocLang: Universal AI Document Format},
author = {{DocLang Project}},
year = {2026},
version = {main},
howpublished = {\url{https://github.com/doclang-project/doclang}},
}
```

## Development

To work on this repository — setup, tests, reference generation, releases — see [CONTRIBUTING.md](https://github.com/doclang-project/doclang/blob/main/doclang/CONTRIBUTING.md).

This repository contains:
## We ❤️ Open Source AI

- **[spec.md](./spec.md)** — normative specification
- **`doclang/`** — reference validator (XSD + Schematron, `doclang` CLI on PyPI)
DocLang is developed in the open and supported by the [LF AI & Data Foundation](https://lfaidata.foundation/projects/). Learn more about the project at [doclang-project](https://github.com/doclang-project).

For validator usage (install, CLI, XSD/Schematron), see [doclang/README.md](./doclang/README.md).
## License

To work on this repository — setup, tests, reference generation, releases — see [CONTRIBUTING.md](./CONTRIBUTING.md).
DocLang is licensed under the Apache License 2.0. See [LICENSE](https://github.com/doclang-project/doclang/blob/main/LICENSE) for details.
24 changes: 23 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,37 @@ build-backend = "setuptools.build_meta"
[project]
name = "doclang"
version = "0.4.0" # DO NOT EDIT MANUALLY, updated automatically
description = "DocLang specification and reference validator (XSD, Schematron, CLI)"
description = "DocLang reference validator"
readme = "README.md"
requires-python = ">=3.10"
license = "Apache-2.0"
license-files = ["LICENSE"]
authors = [{ name = "DocLang Project" }]
keywords = ["doclang", "xml", "validation", "xsd", "schematron", "documents", "llm"]
classifiers = [
"Intended Audience :: Developers",
"Operating System :: OS Independent",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
"Topic :: Software Development :: Libraries",
"Topic :: Text Processing :: Markup :: XML",
]
dependencies = [
"lxml>=6.0.2",
"saxonche>=12.9.0",
"typer>=0.15.1",
]

[project.urls]
Homepage = "https://www.doclang.ai/"
Documentation = "https://github.com/doclang-project/doclang/blob/main/README.md"
Repository = "https://github.com/doclang-project/doclang"
Changelog = "https://github.com/doclang-project/doclang/blob/main/CHANGELOG.md"
Issues = "https://github.com/doclang-project/doclang/issues"

[project.scripts]
doclang = "doclang.cli:app"

Expand Down
Binary file added resources/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
58 changes: 34 additions & 24 deletions spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -260,7 +260,7 @@ In the simplest document example, document elements are in a flat list,

```xml
<doclang>
<heading level="1">Research Paper Title</heading>
<heading>Research Paper Title</heading>
<heading level="2">Abstract</heading>
<text>This paper presents...</text>
<heading level="2">Introduction</heading>
Expand All @@ -274,7 +274,7 @@ In case of page-layout information, the coordinates are provided only at the sem

```xml
<doclang>
<heading level="1">
<heading>
<location value="10"/><location value="20"/><location value="30"/><location value="40"/>
Research Paper Title
</heading>
Expand Down Expand Up @@ -768,7 +768,7 @@ Field region with headings and complex layout:

```xml
<field_region>
<field_heading level="1">Personal Information</field_heading>
<field_heading>Personal Information</field_heading>

<field_item>
<text><key>Full Name:</key></text>
Expand Down Expand Up @@ -857,7 +857,7 @@ Field region with mixed content:
<location value="50"/><location value="100"/>
<location value="500"/><location value="400"/>

<field_heading level="1">Product Specifications</field_heading>
<field_heading>Product Specifications</field_heading>

<text>The following specifications apply to Model XYZ-2000:</text>

Expand Down Expand Up @@ -1114,8 +1114,8 @@ Field region with mixed content:

```xml
<field_region>
<form_heading level="1">M31</field_heading>
<form_heading level="2">REDDITI DI CAPITALE SOGGETTI AD IMPOSIZIONE SOSTITUTIVA</field_heading>
<field_heading>M31</field_heading>
<field_heading level="2">REDDITI DI CAPITALE SOGGETTI AD IMPOSIZIONE SOSTITUTIVA</field_heading>
<field_item>
<marker>1</marker>
<key>Tipo</key>
Expand Down Expand Up @@ -1151,8 +1151,8 @@ Field region with mixed content:
<key>Opzione tassazione ordinaria</key>
<value></value>
</field_item>
<form_heading level="1">M32</field_heading>
<form_heading level="2">PROVENTI DELLE OBBLIGAZIONI NON ASSOGGETTATI A IMPOSTA SOSTITUTIVA</field_heading>
<field_heading>M32</field_heading>
<field_heading level="2">PROVENTI DELLE OBBLIGAZIONI NON ASSOGGETTATI A IMPOSTA SOSTITUTIVA</field_heading>
<field_item>
<marker>1</marker>
<key>Ammontare reddito</key>
Expand All @@ -1163,8 +1163,8 @@ Field region with mixed content:
<key>Aliquota %</key>
<value></value>
</field_item>
<form_heading level="1">M33</field_heading>
<form_heading level="2">PROVENTI DERIVANTI DA DEPOSITI IN GARANZIA</field_heading>
<field_heading>M33</field_heading>
<field_heading level="2">PROVENTI DERIVANTI DA DEPOSITI IN GARANZIA</field_heading>
<field_item>
<marker>1</marker>
<key>Ammontare reddito</key>
Expand Down Expand Up @@ -1223,10 +1223,10 @@ Field region with mixed content:
![Form Example](examples/form/form_08.png)

```xml
<heading level="1">QUADRO W - Investimenti e...</heading>
<heading>QUADRO W - Investimenti e...</heading>
<heading level="2">SEZIONE I - DATI RELATIVI...</heading>
<field_region>
<form_heading level="1">W1</field_heading>
<field_heading>W1</field_heading>
<field_item>
<marker>1</marker>
<key>CODICE TITOLO POSSESSO</key>
Expand All @@ -1238,7 +1238,7 @@ Field region with mixed content:
<value></value>
</field_item>
...
<form_heading level="1">W2</field_heading>
<field_heading>W2</field_heading>
<field_item>
<marker>1</marker>
<value></value>
Expand Down Expand Up @@ -1267,9 +1267,9 @@ Field region with mixed content:
<table><tr><td>

```xml
<heading level="1">QUADRO C - Redditi di lavoro...</heading>
<heading>QUADRO C - Redditi di lavoro...</heading>
<field_region>
<form_heading level="1">SEZIONE I - RE...</field_heading>
<field_heading>SEZIONE I - RE...</field_heading>
<field_item>
<key>Casi particolari</key>
<checkbox class="unselected"/>
Expand All @@ -1278,7 +1278,7 @@ Field region with mixed content:
<key>Codice Stato estero</key>
<value></value>
</field_item>
<form_heading level="2">C1</field_heading>
<field_heading level="2">C1</field_heading>
<field_item>
<marker>1</marker>
<key>TIPO</key>
Expand All @@ -1299,7 +1299,7 @@ Field region with mixed content:
<key>ALTRI DATI</key>
<checkbox class="unselected"/>
</field_item>
<form_heading level="2">C2</field_heading>
<field_heading level="2">C2</field_heading>
<field_item>
<marker>1</marker>
<key>TIPO</key>
Expand Down Expand Up @@ -1327,7 +1327,7 @@ Field region with mixed content:
<key>ALTRI DATI</key>
<checkbox class="unselected"/>
</field_item>
<form_heading level="2">C3</field_heading>
<field_heading level="2">C3</field_heading>
<field_item>
<marker>1</marker>
<key>TIPO</key>
Expand All @@ -1348,8 +1348,8 @@ Field region with mixed content:
<key>ALTRI DATI</key>
<checkbox class="unselected"/>
</field_item>
<form_heading level="2">C4</field_heading>
<form_heading level="3">SOMME PER PREMI...
<field_heading level="2">C4</field_heading>
<field_heading level="3">SOMME PER PREMI...
</field_heading>
<field_item>
<marker>1</marker>
Expand Down Expand Up @@ -3004,21 +3004,31 @@ This appendix is informative and does not define conformance requirements.

#### Pictures

For the `label.value` of `<picture>` elements, we recommend using the values defined below, or `undefined` if no more specific label is applicable:
For the `label.value` of `<picture>` elements, we recommend using the values defined below:

| Context | Recommended values |
| --- | --- |
| `<picture class="chart">` | `bar_chart`, `box_plot`, `flow_chart`, `line_chart`, `pie_chart`, `scatter_plot` |
| else, i.e. `<picture class="undefined">` (default) | `full_page_image`, `page_thumbnail`, `photograph`, `chemistry_structure`, `bar_code`, `icon`, `logo`, `qr_code`, `signature`, `stamp`, `engineering_drawing`, `screenshot_from_computer`, `screenshot_from_manual`, `geographical_map`, `topographical_map`, `calendar`, `crossword_puzzle`, `music` |

Additional special cases:
- `other`: use when the picture was examined but does not fit any of the recommended values from above (e.g. a chart that is not bar/box/flow/line/pie/scatter).
- `undefined`: use when classification has not been performed (default [`label@value`](#label)).

Note: [`picture@class="undefined"`](#picture) (default picture type) and `label@value="undefined"` (unclassified subclass) are independent.

#### Code

For the `label.value` of `<code>` elements, we recommend using the values defined below, or `undefined` if no more specific label is applicable:
For the `label.value` of `<code>` elements, we recommend using the values defined below:

| Context | Recommended values |
| --- | --- |
| `<code>` | [Linguist](https://github.com/github-linguist/linguist/blob/v9.5.0/lib/linguist/languages.yml) v9.5.0 language keys (e.g. `Python`) |

Additional special cases:
- `other`: use when the code was examined but does not match any of the recommended values from above.
- `undefined`: use when classification has not been performed (default [`label@value`](#label)).

### Custom vocabulary naming and namespacing

Content inside [`<custom>`](#custom) is implementation-defined and not governed by this standard.
Expand Down Expand Up @@ -3046,7 +3056,7 @@ The token vocabulary trades off size and inference cost:
| `"/>` | end of self-closing element with attributes |
| `<text>` | [`text`](#text) start |
| `</text>` | [`text`](#text) end |
| `<heading level="1">` | level-1 [`heading`](#heading) start |
| `<heading>` | level-1 [`heading`](#heading) start |
| `<heading level="2">` | level-2 [`heading`](#heading) start |
| `<heading level="3">` | level-3 [`heading`](#heading) start |
| `<heading level="4">` | level-4 [`heading`](#heading) start |
Expand Down Expand Up @@ -3079,7 +3089,7 @@ The token vocabulary trades off size and inference cost:
| `</marker>` | [`marker`](#marker) end |
| `<group>` | [`group`](#group) start |
| `</group>` | [`group`](#group) end |
| `<field_heading level="1">` | level-1 [`field_heading`](#field_heading) start |
| `<field_heading>` | level-1 [`field_heading`](#field_heading) start |
| `<field_heading level="2">` | level-2 [`field_heading`](#field_heading) start |
| `<field_heading level="3">` | level-3 [`field_heading`](#field_heading) start |
| `<field_heading level="4">` | level-4 [`field_heading`](#field_heading) start |
Expand Down
Loading