Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ version: 2

# Set the OS, Python version and other tools you might need
build:
os: ubuntu-22.04
os: ubuntu-24.04
tools:
python: "3.11"
python: "3.13"
# You can also specify other tool versions:
# nodejs: "20"
# rust: "1.70"
Expand All @@ -33,3 +33,7 @@ sphinx:
python:
install:
- requirements: docs/requirements.txt
# install the checked-out source so autodoc and the version reflect this
# branch/tag rather than the released package from PyPI
- method: pip
path: .
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
## Changelog

## 1.10.0
- maintenance: modernize typing, packaging and code
- evaluation: review and correct benchmark ground-truth labels, update and speed up alternatives
- performance: stable day-granular cache key and reduced copying
- fixes: preserve tails in element cleaning

## 1.9.4
- maintenance: remove LXML version constraint (#184)

Expand Down
23 changes: 12 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ $ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html
YMD](https://en.wikipedia.org/wiki/ISO_8601)).
- Detection of both original and updated dates.
- Multilingual.
- Compatible with all recent versions of Python.
- Compatible with Python 3.10 and later.

### How it works

Expand All @@ -77,31 +77,32 @@ Finally, the output is validated and converted to the chosen format.

## Performance

1000 web pages containing identifiable dates (as of 2023-11-13 on Python 3.10)
1000 web pages containing identifiable dates (as of 2026-06-01 on Python 3.13)

| Python Package | Precision | Recall | Accuracy | F-Score | Time |
| -------------- | --------- | ------ | -------- | ------- | ---- |
| articleDateExtractor 0.20 | 0.803 | 0.734 | 0.622 | 0.767 | 5x |
| date_guesser 2.1.4 | 0.781 | 0.600 | 0.514 | 0.679 | 18x |
| goose3 3.1.17 | 0.869 | 0.532 | 0.493 | 0.660 | 15x |
| htmldate\[all\] 1.6.0 (fast) | **0.883** | 0.924 | 0.823 | 0.903 | **1x** |
| htmldate\[all\] 1.6.0 (extensive) | 0.870 | **0.993** | **0.865** | **0.928** | 1.7x |
| newspaper3k 0.2.8 | 0.769 | 0.667 | 0.556 | 0.715 | 15x |
| news-please 1.5.35 | 0.801 | 0.768 | 0.645 | 0.784 | 34x |
| articleDateExtractor 0.20 | 0.846 | 0.745 | 0.656 | 0.792 | 3x |
| date_guesser 2.1.4 | 0.832 | 0.611 | 0.544 | 0.705 | 11x |
| goose3 3.1.21 | **0.930** | 0.568 | 0.545 | 0.706 | 14x |
| htmldate\[all\] 1.10.0 (fast) | 0.924 | 0.927 | 0.861 | 0.925 | **1x** |
| htmldate\[all\] 1.10.0 (extensive) | 0.908 | **0.993** | **0.903** | **0.949** | 1.8x |
| newspaper4k 0.9.5 | 0.912 | 0.728 | 0.680 | 0.810 | 2.5x |
| news-please 1.6.16 | 0.845 | 0.777 | 0.680 | 0.810 | 29x |

For the complete results and explanations see [evaluation
page](https://htmldate.readthedocs.io/en/latest/evaluation.html).

## Installation

Htmldate is tested on Linux, macOS and Windows systems, it is compatible
with Python 3.8 upwards. It can notably be installed with `pip` (`pip3`
with Python 3.10 upwards. It can notably be installed with `pip` (`pip3`
where applicable) from the PyPI package repository:

- `pip install htmldate`
- (optionally) `pip install htmldate[speed]`

The last version to support Python 3.6 and 3.7 is `htmldate==1.8.1`.
The last version to support Python 3.6 and 3.7 is `htmldate==1.8.1`; for
Python 3.8 and 3.9 use the `1.9.x` series.

## Documentation

Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
# -- Project information -----------------------------------------------------

project = 'htmldate'
copyright = '2023, <a href="https://adrien.barbaresi.eu/">Adrien Barbaresi</a>'
copyright = '2017-2026, <a href="https://adrien.barbaresi.eu/">Adrien Barbaresi</a>'
author = 'Adrien Barbaresi'

# -- General configuration ---------------------------------------------------
Expand Down
23 changes: 21 additions & 2 deletions docs/evaluation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ There are comparable software solutions in Python, the following date extraction
- `date_guesser <https://github.com/mitmedialab/date_guesser>`_ extracts publication dates from a web pages along with an accuracy measure (not used here),
- `goose3 <https://github.com/goose3/goose3>`_ can extract information for embedded content,
- `htmldate <https://github.com/adbar/htmldate>`_ is the software package described here, it is designed to extract original and updated publication dates of web pages,
- `newspaper <https://github.com/codelucas/newspaper>`_ is mostly geared towards newspaper texts,
- `newspaper4k <https://github.com/AndyTheFactory/newspaper4k>`_ (the maintained successor of newspaper3k) is mostly geared towards newspaper texts,
- `news-please <https://github.com/fhamborg/news-please>`_ is a news crawler that extracts structured information.

Two alternative packages are not tested here but could be used in addition:
Expand All @@ -36,7 +36,7 @@ Description

**Time**: the execution time cannot be easily compared in all cases as some solutions perform a whole series of operations which are irrelevant to this task.

**Errors:** *goose3*'s output isn't always meaningful and/or in a standardized format, these cases were discarded. *news-please* seems to have trouble with some encodings (e.g. in Chinese), in which case it leads to an exception.
**Errors:** *goose3*'s output isn't always meaningful and/or in a standardized format, these cases were discarded.


Results
Expand All @@ -45,6 +45,23 @@ Results
The results below show that **date extraction is not a completely solved task** but one for which extractors have to resort to heuristics and guesses. The figures documenting recall and accuracy capture the real-world performance of the tools as the absence of a date output impacts the result.


================================ ========= ========= ========= ========= =======
1000 web pages containing identifiable dates (as of 2026-06-01 on Python 3.13)
--------------------------------------------------------------------------------
Python Package Precision Recall Accuracy F-Score Time
================================ ========= ========= ========= ========= =======
articleDateExtractor 0.20 0.846 0.745 0.656 0.792 3x
date_guesser 2.1.4 0.832 0.611 0.544 0.705 11x
goose3 3.1.21 **0.930** 0.568 0.545 0.706 14x
htmldate[all] 1.10.0 (fast) 0.924 0.927 0.861 0.925 **1x**
htmldate[all] 1.10.0 (extensive) 0.908 **0.993** **0.903** **0.949** 1.8x
newspaper4k 0.9.5 0.912 0.728 0.680 0.810 2.5x
news-please 1.6.16 0.845 0.777 0.680 0.810 29x
================================ ========= ========= ========= ========= =======

This run uses a reviewed version of the ground-truth labels (publication-date corrections) and the maintained *newspaper4k* fork in place of the now-unmaintained *newspaper3k*.


=============================== ========= ========= ========= ========= =======
1000 web pages containing identifiable dates (as of 2023-11-13 on Python 3.10)
-------------------------------------------------------------------------------
Expand All @@ -62,6 +79,8 @@ news-please 1.5.35 0.801 0.768 0.645 0.784 34x

Additional data for new pages in English collected by the `Data Culture Group <https://dataculturegroup.org>`_ at Northeastern University.

The discussion below refers to the most recent run (top table), measured against a reviewed version of the publication-date labels.

Precision describes if the dates given as output are correct: *goose3* fares well precision-wise but it fails to extract dates in a large majority of cases (poor recall). The difference in accuracy between *date_guesser* and *newspaper* is consistent with tests described on the `website of the former <https://github.com/mitmedialab/date_guesser>`_.

It turns out that *htmldate* performs better than the other solutions overall. It is also noticeably faster than the strictly comparable packages (*articleDateExtractor* and most certainly *date_guesser*). Despite being measured on a sample, **the higher accuracy and faster processing time are highly significant**. Especially for smaller news outlets, websites and blogs, as well as pages written in languages other than English (in this case mostly but not exclusively German), *htmldate* greatly extends date extraction coverage without sacrificing precision.
Expand Down
26 changes: 8 additions & 18 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ Features
- URLs, HTML files, or HTML trees are given as input (includes batch processing)
- Output as string in any date format (defaults to `ISO 8601 YMD <https://en.wikipedia.org/wiki/ISO_8601>`_)
- Detection of both original and updated dates
- Compatible with all recent versions of Python
- Compatible with Python 3.10 and later


``htmldate`` can examine markup and text. It provides the following ways to date an HTML document:
Expand All @@ -94,7 +94,7 @@ Features

The output is thoroughly verified in terms of plausibility and adequateness. If a valid date has been found the library outputs a date string corresponding to either the last update or the original publishing statement (the default), in the desired format.

Markup-based extraction is multilingual by nature, text-based refinements for better coverage currently support German, English and Turkish.
Markup-based extraction is multilingual by nature, text-based refinements for better coverage currently support English, French, German, Indonesian and Turkish.


Installation
Expand All @@ -103,16 +103,16 @@ Installation
Main package
~~~~~~~~~~~~

This Python package is tested on Linux, macOS and Windows systems; it is compatible with Python 3.8 upwards. It is available on the package repository `PyPI <https://pypi.org/>`_ and can notably be installed with ``pip`` or ``pipenv``:
This Python package is tested on Linux, macOS and Windows systems; it is compatible with Python 3.10 upwards. It is available on the package repository `PyPI <https://pypi.org/>`_ and can notably be installed with ``pip`` or ``pipenv``:

.. code-block:: bash

$ pip install htmldate # pip3 install on systems where both Python 2 and 3 are installed
$ pip install htmldate
$ pip install --upgrade htmldate # to make sure you have the latest version
$ pip install git+https://github.com/adbar/htmldate.git # latest available code (see build status above)


The last version to support Python 3.6 and 3.7 is ``htmldate==1.8.1``.
The last version to support Python 3.6 and 3.7 is ``htmldate==1.8.1``; for Python 3.8 and 3.9 use the ``1.9.x`` series.


Optional
Expand All @@ -131,16 +131,6 @@ The ``dateparser`` package is noticeably slower in its latest versions, version
*For infos on dependency management of Python packages see* `this discussion thread <https://stackoverflow.com/questions/41573587/what-is-the-difference-between-venv-pyvenv-pyenv-virtualenv-virtualenvwrappe>`_.


Experimental
~~~~~~~~~~~~

Experimental compilation with ``mypyc``, as using pre-compiled library may shorten processing speed:

1. Install ``mypy``: ``pip3 install mypy``
2. Compile the package: ``python setup.py --use-mypyc bdist_wheel``
3. Use the newly created wheel: ``pip3 install dist/...``


With Python
-----------

Expand All @@ -162,7 +152,7 @@ In case the web page features easily readable metadata in the header, the extrac
.. code-block:: python

>>> find_date('https://creativecommons.org/about/')
'2017-08-11' # has been updated since
'2017-08-11' # may change
>>> find_date('https://creativecommons.org/about/', extensive_search=False)
>>>

Expand All @@ -189,7 +179,7 @@ Change the output to a format known to Python's ``datetime`` module, the default
.. code-block:: python

>>> find_date('https://www.gnu.org/licenses/gpl-3.0.en.html', outputformat='%d %B %Y')
'18 November 2016' # may have changed since
'18 November 2016' # may change


Original vs. updated dates
Expand All @@ -200,7 +190,7 @@ Although the time delta between original publication and "last modified" info is
.. code-block:: python

>>> find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern/', original_date=True) # modified behavior
'2016-06-23'
'2016-06-23' # may change

For more information see `options page <options.html>`_.

Expand Down
12 changes: 5 additions & 7 deletions docs/options.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,15 +27,15 @@ An external module can be used for download, as described in versions anterior t
>>> import requests
>>> r = requests.get('https://creativecommons.org/about/')
>>> find_date(r.text)
'2017-11-28' # may have changed since
'2017-11-28' # may change
# using htmldate's own fetch_url function
>>> from htmldate.utils import fetch_url
>>> htmldoc = fetch_url('https://blog.wikimedia.org/2018/06/28/interactive-maps-now-in-your-language/')
>>> find_date(htmldoc)
'2018-06-28'
'2018-06-28' # may change
# or simply
>>> find_date('https://blog.wikimedia.org/2018/06/28/interactive-maps-now-in-your-language/') # URL detected
'2018-06-28'
'2018-06-28' # may change


Date format
Expand All @@ -46,7 +46,7 @@ Change the output to a format known to Python's ``datetime`` module, the default
.. code-block:: python

>>> find_date('https://www.gnu.org/licenses/gpl-3.0.en.html', outputformat='%d %B %Y')
'18 November 2016' # may have changed since
'18 November 2016' # may change
>>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html', outputformat='%Y-%m-%dT%H:%M:%S%z')
'2016-12-23T05:11:00-0500'

Expand All @@ -62,7 +62,7 @@ Although the time delta between the original publication and the "last modified"
.. code-block:: python

>>> find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern/') # default setting
'2019-06-24'
'2019-06-24' # may change
>>> find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern/', original_date=True) # modified behavior
'2016-06-23'

Expand All @@ -77,8 +77,6 @@ See ``settings.py`` file:
:show-inheritance:
:undoc-members:

The module can then be re-compiled locally to apply changes to the settings.


Clearing caches
~~~~~~~~~~~~~~~
Expand Down
5 changes: 2 additions & 3 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
# version required
sphinx>=8.1.3
# without version specifier
htmldate
sphinx>=9.1.0
# htmldate itself is installed from the repo root (see .readthedocs.yaml)
2 changes: 1 addition & 1 deletion htmldate/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
__author__ = "Adrien Barbaresi"
__license__ = "Apache-2.0"
__copyright__ = "Copyright 2017-present, Adrien Barbaresi"
__version__ = "1.9.4"
__version__ = "1.10.0"


import logging
Expand Down
4 changes: 2 additions & 2 deletions htmldate/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,13 +81,13 @@ def process_args(args: argparse.Namespace) -> None:
if args.URL:
htmlstring = fetch_url(args.URL)
if htmlstring is None:
sys.exit(f"No data for URL: {args.URL}" + "\n")
sys.exit(f"No data for URL: {args.URL}\n")
# unicode check
else:
try:
htmlstring = sys.stdin.read()
except UnicodeDecodeError as err:
sys.exit(f"Wrong buffer encoding: {str(err)}" + "\n")
sys.exit(f"Wrong buffer encoding: {err}\n")
result = cli_examine(htmlstring, args)
if result is not None:
sys.stdout.write(result + "\n")
Expand Down
26 changes: 10 additions & 16 deletions htmldate/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,6 @@
FAST_PREPEND,
SLOW_PREPEND,
FREE_TEXT_EXPRESSIONS,
MAX_SEGMENT_LEN,
MIN_SEGMENT_LEN,
YEAR_PATTERN,
YMD_PATTERN,
COPYRIGHT_PATTERN,
Expand All @@ -54,11 +52,18 @@
THREE_COMP_REGEX_B,
TWO_COMP_REGEX,
)
from .settings import CACHE_SIZE, CLEANING_LIST, MAX_POSSIBLE_CANDIDATES
from .settings import (
CACHE_SIZE,
CLEANING_LIST,
MAX_POSSIBLE_CANDIDATES,
MAX_SEGMENT_LEN,
MIN_SEGMENT_LEN,
)
from .utils import Extractor, clean_html, load_html, trim_text
from .validators import (
check_extracted_reference,
compare_values,
correct_year,
filter_ymd_candidate,
get_min_date,
get_max_date,
Expand Down Expand Up @@ -563,7 +568,7 @@ def normalize_match(match: re.Match[str] | None) -> str:
and optionally expand the year from two to four digits."""
day, month, year = (g.zfill(2) for g in match.groups() if g) # type: ignore[union-attr]
if len(year) == 2:
year = f"19{year}" if year[0] == "9" else f"20{year}"
year = str(correct_year(int(year)))
return f"{year}-{month}-{day}"


Expand Down Expand Up @@ -852,8 +857,6 @@ def find_date(
original_date,
outputformat,
)
# unclear what this line is for and it impedes type checking:
# find_date.extensive_search = extensive_search

# URL
if url is None:
Expand Down Expand Up @@ -891,9 +894,7 @@ def find_date(
# costly deepcopy of the whole document
pruning_tree = deepcopy(tree) if isinstance(htmlobject, HtmlElement) else tree
try:
search_tree, discarded = discard_unwanted(
clean_html(pruning_tree, CLEANING_LIST)
)
search_tree = discard_unwanted(clean_html(pruning_tree, CLEANING_LIST))
# rare LXML error: no NULL bytes or control characters
except ValueError: # pragma: no cover
search_tree = tree
Expand Down Expand Up @@ -923,13 +924,6 @@ def find_date(
if result is not None:
return result

# TODO: decide on this
# search in discarded parts (e.g. archive.org-banner)
# for subtree in discarded:
# dateresult = examine_date_elements(subtree, DATE_EXPRESSIONS, options)
# if dateresult is not None:
# return dateresult

# robust conversion to string
try:
htmlstring = tostring(search_tree, pretty_print=False, encoding="unicode")
Expand Down
Loading
Loading