diff --git a/.readthedocs.yaml b/.readthedocs.yaml index f6a12ce2..d2386598 100644 --- a/.readthedocs.yaml +++ b/.readthedocs.yaml @@ -6,9 +6,9 @@ version: 2 # Set the OS, Python version and other tools you might need build: - os: ubuntu-22.04 + os: ubuntu-24.04 tools: - python: "3.11" + python: "3.13" # You can also specify other tool versions: # nodejs: "20" # rust: "1.70" @@ -33,3 +33,7 @@ sphinx: python: install: - requirements: docs/requirements.txt + # install the checked-out source so autodoc and the version reflect this + # branch/tag rather than the released package from PyPI + - method: pip + path: . diff --git a/CHANGELOG.md b/CHANGELOG.md index 5c527d90..61662294 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,11 @@ ## Changelog +## 1.10.0 +- maintenance: modernize typing, packaging and code +- evaluation: review and correct benchmark ground-truth labels, update and speed up alternatives +- performance: stable day-granular cache key and reduced copying +- fixes: preserve tails in element cleaning + ## 1.9.4 - maintenance: remove LXML version constraint (#184) diff --git a/README.md b/README.md index 7e9ea895..3ed5a84a 100644 --- a/README.md +++ b/README.md @@ -54,7 +54,7 @@ $ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html YMD](https://en.wikipedia.org/wiki/ISO_8601)). - Detection of both original and updated dates. - Multilingual. -- Compatible with all recent versions of Python. +- Compatible with Python 3.10 and later. ### How it works @@ -77,17 +77,17 @@ Finally, the output is validated and converted to the chosen format. ## Performance -1000 web pages containing identifiable dates (as of 2023-11-13 on Python 3.10) +1000 web pages containing identifiable dates (as of 2026-06-01 on Python 3.13) | Python Package | Precision | Recall | Accuracy | F-Score | Time | | -------------- | --------- | ------ | -------- | ------- | ---- | -| articleDateExtractor 0.20 | 0.803 | 0.734 | 0.622 | 0.767 | 5x | -| date_guesser 2.1.4 | 0.781 | 0.600 | 0.514 | 0.679 | 18x | -| goose3 3.1.17 | 0.869 | 0.532 | 0.493 | 0.660 | 15x | -| htmldate\[all\] 1.6.0 (fast) | **0.883** | 0.924 | 0.823 | 0.903 | **1x** | -| htmldate\[all\] 1.6.0 (extensive) | 0.870 | **0.993** | **0.865** | **0.928** | 1.7x | -| newspaper3k 0.2.8 | 0.769 | 0.667 | 0.556 | 0.715 | 15x | -| news-please 1.5.35 | 0.801 | 0.768 | 0.645 | 0.784 | 34x | +| articleDateExtractor 0.20 | 0.846 | 0.745 | 0.656 | 0.792 | 3x | +| date_guesser 2.1.4 | 0.832 | 0.611 | 0.544 | 0.705 | 11x | +| goose3 3.1.21 | **0.930** | 0.568 | 0.545 | 0.706 | 14x | +| htmldate\[all\] 1.10.0 (fast) | 0.924 | 0.927 | 0.861 | 0.925 | **1x** | +| htmldate\[all\] 1.10.0 (extensive) | 0.908 | **0.993** | **0.903** | **0.949** | 1.8x | +| newspaper4k 0.9.5 | 0.912 | 0.728 | 0.680 | 0.810 | 2.5x | +| news-please 1.6.16 | 0.845 | 0.777 | 0.680 | 0.810 | 29x | For the complete results and explanations see [evaluation page](https://htmldate.readthedocs.io/en/latest/evaluation.html). @@ -95,13 +95,14 @@ page](https://htmldate.readthedocs.io/en/latest/evaluation.html). ## Installation Htmldate is tested on Linux, macOS and Windows systems, it is compatible -with Python 3.8 upwards. It can notably be installed with `pip` (`pip3` +with Python 3.10 upwards. It can notably be installed with `pip` (`pip3` where applicable) from the PyPI package repository: - `pip install htmldate` - (optionally) `pip install htmldate[speed]` -The last version to support Python 3.6 and 3.7 is `htmldate==1.8.1`. +The last version to support Python 3.6 and 3.7 is `htmldate==1.8.1`; for +Python 3.8 and 3.9 use the `1.9.x` series. ## Documentation diff --git a/docs/conf.py b/docs/conf.py index d4856e55..4089bcfb 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -21,7 +21,7 @@ # -- Project information ----------------------------------------------------- project = 'htmldate' -copyright = '2023, Adrien Barbaresi' +copyright = '2017-2026, Adrien Barbaresi' author = 'Adrien Barbaresi' # -- General configuration --------------------------------------------------- diff --git a/docs/evaluation.rst b/docs/evaluation.rst index f485d4ac..47d88bf2 100644 --- a/docs/evaluation.rst +++ b/docs/evaluation.rst @@ -18,7 +18,7 @@ There are comparable software solutions in Python, the following date extraction - `date_guesser `_ extracts publication dates from a web pages along with an accuracy measure (not used here), - `goose3 `_ can extract information for embedded content, - `htmldate `_ is the software package described here, it is designed to extract original and updated publication dates of web pages, -- `newspaper `_ is mostly geared towards newspaper texts, +- `newspaper4k `_ (the maintained successor of newspaper3k) is mostly geared towards newspaper texts, - `news-please `_ is a news crawler that extracts structured information. Two alternative packages are not tested here but could be used in addition: @@ -36,7 +36,7 @@ Description **Time**: the execution time cannot be easily compared in all cases as some solutions perform a whole series of operations which are irrelevant to this task. -**Errors:** *goose3*'s output isn't always meaningful and/or in a standardized format, these cases were discarded. *news-please* seems to have trouble with some encodings (e.g. in Chinese), in which case it leads to an exception. +**Errors:** *goose3*'s output isn't always meaningful and/or in a standardized format, these cases were discarded. Results @@ -45,6 +45,23 @@ Results The results below show that **date extraction is not a completely solved task** but one for which extractors have to resort to heuristics and guesses. The figures documenting recall and accuracy capture the real-world performance of the tools as the absence of a date output impacts the result. +================================ ========= ========= ========= ========= ======= +1000 web pages containing identifiable dates (as of 2026-06-01 on Python 3.13) +-------------------------------------------------------------------------------- +Python Package Precision Recall Accuracy F-Score Time +================================ ========= ========= ========= ========= ======= +articleDateExtractor 0.20 0.846 0.745 0.656 0.792 3x +date_guesser 2.1.4 0.832 0.611 0.544 0.705 11x +goose3 3.1.21 **0.930** 0.568 0.545 0.706 14x +htmldate[all] 1.10.0 (fast) 0.924 0.927 0.861 0.925 **1x** +htmldate[all] 1.10.0 (extensive) 0.908 **0.993** **0.903** **0.949** 1.8x +newspaper4k 0.9.5 0.912 0.728 0.680 0.810 2.5x +news-please 1.6.16 0.845 0.777 0.680 0.810 29x +================================ ========= ========= ========= ========= ======= + +This run uses a reviewed version of the ground-truth labels (publication-date corrections) and the maintained *newspaper4k* fork in place of the now-unmaintained *newspaper3k*. + + =============================== ========= ========= ========= ========= ======= 1000 web pages containing identifiable dates (as of 2023-11-13 on Python 3.10) ------------------------------------------------------------------------------- @@ -62,6 +79,8 @@ news-please 1.5.35 0.801 0.768 0.645 0.784 34x Additional data for new pages in English collected by the `Data Culture Group `_ at Northeastern University. +The discussion below refers to the most recent run (top table), measured against a reviewed version of the publication-date labels. + Precision describes if the dates given as output are correct: *goose3* fares well precision-wise but it fails to extract dates in a large majority of cases (poor recall). The difference in accuracy between *date_guesser* and *newspaper* is consistent with tests described on the `website of the former `_. It turns out that *htmldate* performs better than the other solutions overall. It is also noticeably faster than the strictly comparable packages (*articleDateExtractor* and most certainly *date_guesser*). Despite being measured on a sample, **the higher accuracy and faster processing time are highly significant**. Especially for smaller news outlets, websites and blogs, as well as pages written in languages other than English (in this case mostly but not exclusively German), *htmldate* greatly extends date extraction coverage without sacrificing precision. diff --git a/docs/index.rst b/docs/index.rst index 16b85ead..ba9d2132 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -80,7 +80,7 @@ Features - URLs, HTML files, or HTML trees are given as input (includes batch processing) - Output as string in any date format (defaults to `ISO 8601 YMD `_) - Detection of both original and updated dates -- Compatible with all recent versions of Python +- Compatible with Python 3.10 and later ``htmldate`` can examine markup and text. It provides the following ways to date an HTML document: @@ -94,7 +94,7 @@ Features The output is thoroughly verified in terms of plausibility and adequateness. If a valid date has been found the library outputs a date string corresponding to either the last update or the original publishing statement (the default), in the desired format. -Markup-based extraction is multilingual by nature, text-based refinements for better coverage currently support German, English and Turkish. +Markup-based extraction is multilingual by nature, text-based refinements for better coverage currently support English, French, German, Indonesian and Turkish. Installation @@ -103,16 +103,16 @@ Installation Main package ~~~~~~~~~~~~ -This Python package is tested on Linux, macOS and Windows systems; it is compatible with Python 3.8 upwards. It is available on the package repository `PyPI `_ and can notably be installed with ``pip`` or ``pipenv``: +This Python package is tested on Linux, macOS and Windows systems; it is compatible with Python 3.10 upwards. It is available on the package repository `PyPI `_ and can notably be installed with ``pip`` or ``pipenv``: .. code-block:: bash - $ pip install htmldate # pip3 install on systems where both Python 2 and 3 are installed + $ pip install htmldate $ pip install --upgrade htmldate # to make sure you have the latest version $ pip install git+https://github.com/adbar/htmldate.git # latest available code (see build status above) -The last version to support Python 3.6 and 3.7 is ``htmldate==1.8.1``. +The last version to support Python 3.6 and 3.7 is ``htmldate==1.8.1``; for Python 3.8 and 3.9 use the ``1.9.x`` series. Optional @@ -131,16 +131,6 @@ The ``dateparser`` package is noticeably slower in its latest versions, version *For infos on dependency management of Python packages see* `this discussion thread `_. -Experimental -~~~~~~~~~~~~ - -Experimental compilation with ``mypyc``, as using pre-compiled library may shorten processing speed: - -1. Install ``mypy``: ``pip3 install mypy`` -2. Compile the package: ``python setup.py --use-mypyc bdist_wheel`` -3. Use the newly created wheel: ``pip3 install dist/...`` - - With Python ----------- @@ -162,7 +152,7 @@ In case the web page features easily readable metadata in the header, the extrac .. code-block:: python >>> find_date('https://creativecommons.org/about/') - '2017-08-11' # has been updated since + '2017-08-11' # may change >>> find_date('https://creativecommons.org/about/', extensive_search=False) >>> @@ -189,7 +179,7 @@ Change the output to a format known to Python's ``datetime`` module, the default .. code-block:: python >>> find_date('https://www.gnu.org/licenses/gpl-3.0.en.html', outputformat='%d %B %Y') - '18 November 2016' # may have changed since + '18 November 2016' # may change Original vs. updated dates @@ -200,7 +190,7 @@ Although the time delta between original publication and "last modified" info is .. code-block:: python >>> find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern/', original_date=True) # modified behavior - '2016-06-23' + '2016-06-23' # may change For more information see `options page `_. diff --git a/docs/options.rst b/docs/options.rst index 71c90f46..a2754545 100644 --- a/docs/options.rst +++ b/docs/options.rst @@ -27,15 +27,15 @@ An external module can be used for download, as described in versions anterior t >>> import requests >>> r = requests.get('https://creativecommons.org/about/') >>> find_date(r.text) - '2017-11-28' # may have changed since + '2017-11-28' # may change # using htmldate's own fetch_url function >>> from htmldate.utils import fetch_url >>> htmldoc = fetch_url('https://blog.wikimedia.org/2018/06/28/interactive-maps-now-in-your-language/') >>> find_date(htmldoc) - '2018-06-28' + '2018-06-28' # may change # or simply >>> find_date('https://blog.wikimedia.org/2018/06/28/interactive-maps-now-in-your-language/') # URL detected - '2018-06-28' + '2018-06-28' # may change Date format @@ -46,7 +46,7 @@ Change the output to a format known to Python's ``datetime`` module, the default .. code-block:: python >>> find_date('https://www.gnu.org/licenses/gpl-3.0.en.html', outputformat='%d %B %Y') - '18 November 2016' # may have changed since + '18 November 2016' # may change >>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html', outputformat='%Y-%m-%dT%H:%M:%S%z') '2016-12-23T05:11:00-0500' @@ -62,7 +62,7 @@ Although the time delta between the original publication and the "last modified" .. code-block:: python >>> find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern/') # default setting - '2019-06-24' + '2019-06-24' # may change >>> find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern/', original_date=True) # modified behavior '2016-06-23' @@ -77,8 +77,6 @@ See ``settings.py`` file: :show-inheritance: :undoc-members: -The module can then be re-compiled locally to apply changes to the settings. - Clearing caches ~~~~~~~~~~~~~~~ diff --git a/docs/requirements.txt b/docs/requirements.txt index 8d0cbee9..b02618bd 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -1,4 +1,3 @@ # version required -sphinx>=8.1.3 -# without version specifier -htmldate +sphinx>=9.1.0 +# htmldate itself is installed from the repo root (see .readthedocs.yaml) diff --git a/htmldate/__init__.py b/htmldate/__init__.py index 19678528..c8b0023c 100644 --- a/htmldate/__init__.py +++ b/htmldate/__init__.py @@ -7,7 +7,7 @@ __author__ = "Adrien Barbaresi" __license__ = "Apache-2.0" __copyright__ = "Copyright 2017-present, Adrien Barbaresi" -__version__ = "1.9.4" +__version__ = "1.10.0" import logging diff --git a/htmldate/cli.py b/htmldate/cli.py index abba2e13..ac58f6c3 100644 --- a/htmldate/cli.py +++ b/htmldate/cli.py @@ -81,13 +81,13 @@ def process_args(args: argparse.Namespace) -> None: if args.URL: htmlstring = fetch_url(args.URL) if htmlstring is None: - sys.exit(f"No data for URL: {args.URL}" + "\n") + sys.exit(f"No data for URL: {args.URL}\n") # unicode check else: try: htmlstring = sys.stdin.read() except UnicodeDecodeError as err: - sys.exit(f"Wrong buffer encoding: {str(err)}" + "\n") + sys.exit(f"Wrong buffer encoding: {err}\n") result = cli_examine(htmlstring, args) if result is not None: sys.stdout.write(result + "\n") diff --git a/htmldate/core.py b/htmldate/core.py index f40489d6..e7baf346 100644 --- a/htmldate/core.py +++ b/htmldate/core.py @@ -28,8 +28,6 @@ FAST_PREPEND, SLOW_PREPEND, FREE_TEXT_EXPRESSIONS, - MAX_SEGMENT_LEN, - MIN_SEGMENT_LEN, YEAR_PATTERN, YMD_PATTERN, COPYRIGHT_PATTERN, @@ -54,11 +52,18 @@ THREE_COMP_REGEX_B, TWO_COMP_REGEX, ) -from .settings import CACHE_SIZE, CLEANING_LIST, MAX_POSSIBLE_CANDIDATES +from .settings import ( + CACHE_SIZE, + CLEANING_LIST, + MAX_POSSIBLE_CANDIDATES, + MAX_SEGMENT_LEN, + MIN_SEGMENT_LEN, +) from .utils import Extractor, clean_html, load_html, trim_text from .validators import ( check_extracted_reference, compare_values, + correct_year, filter_ymd_candidate, get_min_date, get_max_date, @@ -563,7 +568,7 @@ def normalize_match(match: re.Match[str] | None) -> str: and optionally expand the year from two to four digits.""" day, month, year = (g.zfill(2) for g in match.groups() if g) # type: ignore[union-attr] if len(year) == 2: - year = f"19{year}" if year[0] == "9" else f"20{year}" + year = str(correct_year(int(year))) return f"{year}-{month}-{day}" @@ -852,8 +857,6 @@ def find_date( original_date, outputformat, ) - # unclear what this line is for and it impedes type checking: - # find_date.extensive_search = extensive_search # URL if url is None: @@ -891,9 +894,7 @@ def find_date( # costly deepcopy of the whole document pruning_tree = deepcopy(tree) if isinstance(htmlobject, HtmlElement) else tree try: - search_tree, discarded = discard_unwanted( - clean_html(pruning_tree, CLEANING_LIST) - ) + search_tree = discard_unwanted(clean_html(pruning_tree, CLEANING_LIST)) # rare LXML error: no NULL bytes or control characters except ValueError: # pragma: no cover search_tree = tree @@ -923,13 +924,6 @@ def find_date( if result is not None: return result - # TODO: decide on this - # search in discarded parts (e.g. archive.org-banner) - # for subtree in discarded: - # dateresult = examine_date_elements(subtree, DATE_EXPRESSIONS, options) - # if dateresult is not None: - # return dateresult - # robust conversion to string try: htmlstring = tostring(search_tree, pretty_print=False, encoding="unicode") diff --git a/htmldate/extractors.py b/htmldate/extractors.py index 01ad9025..9e5efd87 100644 --- a/htmldate/extractors.py +++ b/htmldate/extractors.py @@ -18,9 +18,9 @@ from lxml.html import HtmlElement # own -from .settings import CACHE_SIZE +from .settings import CACHE_SIZE, MAX_SEGMENT_LEN from .utils import Extractor, trim_text -from .validators import convert_date, is_valid_date, validate_and_convert +from .validators import convert_date, correct_year, is_valid_date, validate_and_convert LOGGER = logging.getLogger(__name__) @@ -80,8 +80,6 @@ # or contains(@id, 'lastmod') or contains(@class, 'updated') FREE_TEXT_EXPRESSIONS = XPath(FAST_PREPEND + "/text()") -MIN_SEGMENT_LEN = 6 -MAX_SEGMENT_LEN = 52 # discard parts of the webpage # archive.org banner inserts @@ -209,13 +207,11 @@ SIMPLE_PATTERN = re.compile(rf"(? tuple[HtmlElement, list[HtmlElement]]: - """Delete unwanted sections of an HTML document and return them as a list""" - my_discarded = [] +def discard_unwanted(tree: HtmlElement) -> HtmlElement: + """Delete unwanted sections of an HTML document.""" for subtree in DISCARD_EXPRESSIONS(tree): - my_discarded.append(subtree) subtree.getparent().remove(subtree) - return tree, my_discarded + return tree def extract_url_date( @@ -237,13 +233,6 @@ def extract_url_date( return None -def correct_year(year: int) -> int: - """Adapt year from YY to YYYY format""" - if year < 100: - year += 1900 if year >= 90 else 2000 - return year - - def try_swap_values(day: int, month: int) -> tuple[int, int]: """Swap day and month values if it seems feasible.""" return (month, day) if month > 12 and day <= 12 else (day, month) diff --git a/htmldate/settings.py b/htmldate/settings.py index 2f1aa3d4..5e8fc1ab 100644 --- a/htmldate/settings.py +++ b/htmldate/settings.py @@ -18,6 +18,10 @@ # set an upper limit to the number of candidates MAX_POSSIBLE_CANDIDATES: int = 1000 +# Text segment length bounds (in characters) for date extraction +MIN_SEGMENT_LEN: int = 6 +MAX_SEGMENT_LEN: int = 52 + CLEANING_LIST = [ "applet", "audio", diff --git a/htmldate/utils.py b/htmldate/utils.py index 90382d53..10855f8e 100644 --- a/htmldate/utils.py +++ b/htmldate/utils.py @@ -8,6 +8,7 @@ from dataclasses import dataclass from datetime import datetime +from typing import Any import urllib3 @@ -109,15 +110,14 @@ def decode_file(filecontent: bytes | str) -> str: return htmltext or str(filecontent, encoding="utf-8", errors="replace") -def decode_response(response: urllib3.response.HTTPResponse | bytes) -> str: - """Read the urllib3 object corresponding to the server response, then - try to guess its encoding and decode it to return a unicode string""" - # urllib3 response object / bytes switch - if isinstance(response, urllib3.response.HTTPResponse): - resp_content = response.data - else: - resp_content = response - return decode_file(resp_content) +def decode_response(response: Any) -> str: + """Read the data from a response object exposing the body via ``.data`` + (e.g. urllib3 or a compatible response) or from a bytestring, then guess + its encoding and decode it to return a unicode string.""" + # accept any response-like object exposing the body via .data, or raw bytes; + # .data may be None, so guard before decoding + data = response.data if hasattr(response, "data") else response + return decode_file(data) if data else "" def fetch_url(url: str) -> str | None: diff --git a/htmldate/validators.py b/htmldate/validators.py index d8c5462d..a04062d2 100644 --- a/htmldate/validators.py +++ b/htmldate/validators.py @@ -90,6 +90,13 @@ def is_valid_format(outputformat: str) -> bool: return True +def correct_year(year: int) -> int: + """Adapt year from YY to YYYY format""" + if year < 100: + year += 1900 if year >= 90 else 2000 + return year + + def plausible_year_filter( htmlstring: str, *, @@ -114,8 +121,7 @@ def plausible_year_filter( if not incomplete: potential_year = int(lastdigits) else: - century = "19" if lastdigits[0] == "9" else "20" - potential_year = int(century + lastdigits) + potential_year = correct_year(int(lastdigits)) if not min_year <= potential_year <= max_year: LOGGER.debug("no potential year: %s", item) diff --git a/tests/comparison.py b/tests/comparison.py index a3de094c..69460a6f 100644 --- a/tests/comparison.py +++ b/tests/comparison.py @@ -4,8 +4,6 @@ import argparse import contextlib -import json -import os import sys import time @@ -15,6 +13,7 @@ from evaluation import ( + EVAL_PAGES, evaluate_result, load_document, run_htmldate_extensive, @@ -27,17 +26,6 @@ ) -TEST_DIR = os.path.abspath(os.path.dirname(__file__)) -# list the jsons containing the pages here -eval_paths = ["eval_mediacloud_2020.json", "eval_default.json"] -# load the pages here -EVAL_PAGES = {} -for each in eval_paths: - evalpath = os.path.join(TEST_DIR, each) - with open(evalpath, "r", encoding="utf-8") as f: - EVAL_PAGES.update(json.load(f)) - - PARSER = argparse.ArgumentParser(description="Run the evaluation") PARSER.add_argument( "--small", @@ -63,20 +51,19 @@ FUNC_DICT = { "htmldate_extensive": run_htmldate_extensive, "htmldate_fast": run_htmldate_fast, - **{ - key: func - for key, func in [ - ("newspaper", run_newspaper), - ("newsplease", run_newsplease), - ("articledateextractor", run_articledateextractor), - ("date_guesser", run_dateguesser), - ("goose", run_goose), - ] - if not ARGS.small - }, } +if not ARGS.small: + FUNC_DICT.update( + { + "newspaper": run_newspaper, + "newsplease": run_newsplease, + "articledateextractor": run_articledateextractor, + "date_guesser": run_dateguesser, + "goose": run_goose, + } + ) -RESULTS_DICT = {key: TEMPLATE_DICT.copy() for key, value in FUNC_DICT.items()} +RESULTS_DICT = {key: TEMPLATE_DICT.copy() for key in FUNC_DICT} def calculate_scores(name, mydict): diff --git a/tests/eval-requirements.txt b/tests/eval-requirements.txt index cc552d17..00492a8d 100644 --- a/tests/eval-requirements.txt +++ b/tests/eval-requirements.txt @@ -1,13 +1,16 @@ # package -htmldate>=1.9.2 +htmldate>=1.10.0 # alternatives articleDateExtractor==0.20 date_guesser==2.1.4 -goose3==3.1.19 -newspaper3k==0.2.8 -news-please==1.6.13 +goose3==3.1.21 +# newspaper4k succeeds the unmaintained newspaper3k. Extras = tokenizers the +# corpus needs (via news-please): nltk, tinysegmenter (ja), jieba (zh). +# Also requires NLTK data: python -m nltk.downloader punkt_tab stopwords +newspaper4k[nlp,ja,zh]==0.9.5 +news-please==1.6.16 # helpers -tabulate==0.9.0 -tqdm==4.67.0 +tabulate==0.10.0 +tqdm==4.67.3 diff --git a/tests/eval_default.json b/tests/eval_default.json index ef77e503..b4600526 100644 --- a/tests/eval_default.json +++ b/tests/eval_default.json @@ -525,11 +525,11 @@ }, "https://www.uusisuomi.fi/uutiset/sanna-marin-tapasi-angela-merkelin-myos-saksa-haluaa-pitaa-kiinni-maataloustuista-meidan-nakemyksiamme-suurimpana-nettomaksajana-ei-ole-otettu-riittavasti-huomioon/b29c11d3-9590-4045-8e2c-a568f9f24617": { "file": "uusisuomi.fi.angela.html", - "date": "2019-02-19" + "date": "2020-02-19" }, "https://yle.fi/uutiset/3-11212601": { "file": "yle.fi.3-11212601.html", - "date": "2019-02-19" + "date": "2020-02-19" }, "https://www.tofugu.com/travel/dezuka-suisan/": { "file": "tofugu.com.dezuka-suisan.html", @@ -713,7 +713,7 @@ }, "https://zahlenzauberin.wordpress.com/2012/08/22/was-zum-horen-in-den-ferien/": { "file": "zahlenzauberin.wordpress.com.ferien.html", - "date": "2010-08-22" + "date": "2012-08-22" }, "https://www.deutschlandfunk.de/die-zukunft-der-arbeit-wir-dekorieren-auf-der-titanic-die.911.de.html?dram:article_id=385022": { "file": "deutschlandfunk.de.titanic.html", @@ -877,7 +877,7 @@ }, "https://www.theplanetarypress.com/2020/01/management-of-intact-forestlands-by-indigenous-peoples-key-to-protecting-climate/": { "file": "theplanetarypress.com.forestlands.html", - "date": "2020-01-19" + "date": "2020-01-17" }, "https://wikimediafoundation.org/news/2020/01/15/access-to-wikipedia-restored-in-turkey-after-more-than-two-and-a-half-years/": { "file": "wikimediafoundation.org.turkey.html", @@ -937,7 +937,7 @@ }, "https://www.tomshardware.com/uk/news/where-and-how-to-buy-rtx-3080-3090-3070": { "file": "tomshardware.com.rtx.html", - "date": "2020-11-02" + "date": "2020-11-04" }, "https://stardewvalleywiki.com/Penny": { "file": "stardewvalleywiki.com.penny.html", @@ -985,7 +985,7 @@ }, "https://diem25.org/the-eus-green-deal-isnt-enough-save-from-climate-catastrophe/": { "file": "diem25.org.climate.html", - "date": "2020-12-12" + "date": "2020-10-12" }, "https://www.economist.com/open-future/2018/06/18/why-collaborative-thinking-beats-individual-smarts": { "file": "economist.com.thinking.html", @@ -1041,7 +1041,7 @@ }, "https://mywakenews.wordpress.com/2016/07/09/nwo-psyop-unitedwestrike-radio-marathon/": { "file": "mywakenews.wordpress.com.psyop.html", - "date": "2016-06-09" + "date": "2016-07-09" }, "https://web.archive.org/web/20130307194448/the-pain.net/2008/05/silkroad-roc-mountain-quests-und-npcs.html": { "file": "archive.org.the-pain.net.silkroad.html", @@ -1373,7 +1373,7 @@ }, "https://berkutschi.com/de/front/news/10759-marius-lindvik-gewinnt-in-willingen": { "file": "berkutschi.com-willingen.html", - "date": "2022-01-31" + "date": "2022-01-30" }, "https://www.berliner-feuerwehr.de/aktuelles/nachrichten/feuerwehr-und-katastrophenschutz-ehrenzeichen-verliehen-3896/": { "file": "berliner-feuerwehr.de-Ehrenzeichen.html", @@ -1661,7 +1661,7 @@ }, "https://www.ekiba.de/detail/nachricht-seite/id/35204-trauern-digital-am-ewigkeitssonntag/?default=true": { "file": "ekiba.de-trauer.html", - "date": "2021-12-12" + "date": "2021-11-12" }, "https://emacspeak.blogspot.com/2019/10/meta-programming-in-emacs-using.html": { "file": "emacspeak.blogspot.com.meta.html", @@ -1825,7 +1825,7 @@ }, "https://www.handwerksblatt.de/themen-specials/coronaschutz-im-betrieb/2g-3g-was-gilt-beim-friseurbesuch": { "file": "handwerksblatt.de-Friseurbesuch.html", - "date": "2022-01-01" + "date": "2022-01-14" }, "https://www.haus.de/bauen/vorsatzschalung-33656": { "file": "haus.de-Vorsatzschallung.html", @@ -2121,7 +2121,7 @@ }, "https://redtri.com/best-jokes-for-kids/slide/1": { "file": "redtri.com.jokes.html", - "date": "2020-11-03" + "date": "2021-09-19" }, "https://www.refinery29.com/de-de/vreni-frost-instagram-werbung-abmahnung": { "file": "refiner29.com-Verni.html", @@ -2169,7 +2169,7 @@ }, "https://www.selbst.de/wurmkiste-39572.html": { "file": "selbst.de-wurmkiste.html", - "date": "2022-01-22" + "date": "2021-02-22" }, "https://www.siegessaeule.de/magazin/p%C3%A4dophilie-als-politisches-machtinstrument/": { "file": "siegessaeule.de-Machtinstrument.html", @@ -2237,7 +2237,7 @@ }, "https://www.tennismagazin.de/news/zverev-zieht-ins-viertelfinale-von-montpellier-ein/": { "file": "tennismagazin.de-viertelfinale.html", - "date": "2022-02-04" + "date": "2022-02-03" }, "https://www.tennisnet.com/news/diese-ymers-zwei-ueberraschungen-an-einem-tag": { "file": "tennisnet.com-ueberraschungen.html", @@ -2265,7 +2265,7 @@ }, "https://www.tierwelt.ch/news/natur-umwelt/immer-mehr-modemarken-werden-pelzfrei-so-erkennen-sie-echtpelz-im-laden": { "file": "tierwelt.ch-plez.html", - "date": "2022-02-02" + "date": "2022-02-01" }, "https://www.tonight.de/unterhaltung/promis/daniela-buechner-danni-und-ennesto-monte-trennen-sich-arschloch_114240.html": { "file": "tonight.de-Arschloch.html", @@ -2325,7 +2325,7 @@ }, "https://www.wochenblatt.com/landwirtschaft/agrarpolitik/heinen-esser-offen-fuer-existenzgruendungspraemie-12810183.html": { "file": "wochenblatt.com-Heinen-Essen.html", - "date": "2022-02-21" + "date": "2022-01-21" }, "https://www.wolfgangmichal.de/2017/06/07/publizistische-sorgfaltspflicht-statt-netzwerkdurchsetzungsgesetz/": { "file": "wolfgangmichal.de.sorgfaltspflicht.html", @@ -3045,7 +3045,7 @@ }, "https://popkultur.de/homosexuelle-schauspieler/": { "file": "Popkultur.de-Schauspieler.html", - "date": "2023-05-06" + "date": "2023-06-05" }, "https://presse-augsburg.de/augsburger-verkehrs-und-tarifverbund-avv-erhoeht-die-oepnv-preise-deutlich/909665/": { "file": "presse-ausburg.de-Tarifverbund.html", diff --git a/tests/eval_mediacloud_2020.json b/tests/eval_mediacloud_2020.json index 1b84a29a..94166423 100644 --- a/tests/eval_mediacloud_2020.json +++ b/tests/eval_mediacloud_2020.json @@ -1 +1 @@ -{"https://zaxid.net/news/showNews.do?nastupnogo_tizhnya_na_ukrayinu_chekayut_anomalna_speka_ta_grozi&objectId=1503302": {"file": "1628285861.html", "date": "2020-06-07"}, "https://24tv.ua/yak-venediktova-zahishhala-nardepa-vid-slugi-narodu_n1419177": {"file": "1716869064.html", "date": "2020-09-21"}, "https://www.mynet.com/samsunda-sobadan-zehirlenen-2-cocuk-hastanelik-oldu-110106661445": {"file": "1783862633.html", "date": "2020-11-30"}, "http://www.detaykibris.com/yikilan-binalarkurtarma-calismalari-izmirden-goruntuler-2196g.htm": {"file": "1754856212.html", "date": "2020-10-30"}, "https://www.hatawtabloid.com/2020/02/18/aktres-sunod-sunuran-sa-aktor-bf/": {"file": "1523761669.html", "date": "2020-02-18"}, "http://auto-door16814.ttblogs.com/2028064/%E0%B8%9C-%E0%B8%9C%E0%B8%A5-%E0%B8%95-%E0%B9%81%E0%B8%A5%E0%B8%B0%E0%B8%88%E0%B8%B3%E0%B8%AB%E0%B8%99-%E0%B8%B2%E0%B8%A2%E0%B9%82%E0%B8%8B-%E0%B8%AD-%E0%B8%95%E0%B8%AA%E0%B8%B2%E0%B8%AB%E0%B8%81%E0%B8%A3%E0%B8%A3%E0%B8%A1%E0%B8%97-%E0%B8%81%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%A0%E0%B8%97-%E0%B8%97-%E0%B8%81%E0%B8%8A%E0%B8%99-%E0%B8%94": {"file": "1799801548.html", "date": NaN}, "https://www.hbl.fi/artikel/bbc-premier-league-omstart-17-juli/": {"file": "1619092544.html", "date": "2020-05-28"}, "https://avaz.ba/kantoni/republika-srpska/604722/generalni-direktor-eprs-pozitivan-na-koronavirus": {"file": "1750355638.html", "date": "2020-10-26"}, "https://www.koha.net/kronike-e-zeze/233165/arrestohen-dy-persona-ne-peje-per-organizim-te-lojerave-te-fatit/": {"file": "1682093320.html", "date": "2020-08-13"}, "http://www.standard.al/2020/04/15/dite-zie-shenohen-1-mije-e-438-viktima-nga-covid-19-ne-24-oret-e-fundit-ne-france/": {"file": "1579090372.html", "date": "2020-04-15"}, "https://oren.mk.ru/social/2020/02/18/orenburgskim-chinovnikam-kupyat-eshhe-pyat-avtomobiley.html": {"file": "1523821409.html", "date": "2020-02-18"}, "https://www.inform.kz/ru/polnost-yu-obespechit-region-myasom-pticy-namereny-v-sko_a3725092": {"file": "1784458132.html", "date": "2020-12-01"}, "https://www.ukrinform.ru/rubric-kyiv/3103218-policia-napravila-delo-minera-stolicnogo-metro-v-sud.html": {"file": "1716561772.html", "date": NaN}, "https://rzn.mk.ru/social/2020/01/13/v-ryazani-nagradili-luchshikh-zhurnalistov.html": {"file": "1493663910.html", "date": "2020-01-13"}, "https://www.mk.ru/politics/2020/09/23/prichinoy-taynoy-inauguracii-lukashenko-stal-styd.html": {"file": "1719330438.html", "date": "2020-09-23"}, "https://tass.ru/moskva/8836033": {"file": "1647262025.html", "date": "2020-06-28"}, "https://www.mk.ru/social/2020/12/29/prisyazhnye-opravdali-muzhchinu-kotoryy-obvinyalsya-v-ubiystve-aktivista-v-serpukhove.html": {"file": "1810559448.html", "date": "2020-12-29"}, "https://aif.ru/politics/world/otrezat_golovu_vo_imya_proroka_terakt_vo_francii_obnazhil_starye_problemy": {"file": "1742998174.html", "date": "2020-10-19"}, "https://www.unn.com.ua/ru/news/1906059-parlament-moldovi-z-podachi-dodona-obmezhiv-povnovazhennya-sandu-sche-do-yiyi-inavguratsiyi": {"file": "1786902045.html", "date": "2020-12-03"}, "https://news.yam.md/ro/story/10939686": {"file": "1716324024.html", "date": NaN}, "https://www.noticiaexata.com.br/artigo/brasileiro-goias-supera-palmeiras-com-gol-no-apagar-das-luzes": {"file": "1777110076.html", "date": "2020-11-22"}, "http://feedproxy.google.com/~r/PublicoRSS/~3/iyen1-bXCVc/covid19-trump-ja-nao-corre-risco-infectar-terceiros-medico-casa-branca-1934765": {"file": "1735359864.html", "date": "2020-10-11"}, "https://economia.ig.com.br/2020-11-26/agronegocio-quer-salvar-relacao-com-a-china-apos-acusacoes-de-eduardo-bolsonaro.html": {"file": "1781093249.html", "date": "2020-11-26"}, "https://noticias.uol.com.br/ultimas-noticias/reuters/2020/03/14/sancoes-dos-eua-dificultam-severamente-luta-do-ira-contra-coronavirus-diz-rouhani.htm": {"file": "1548435131.html", "date": "2020-03-14"}, "https://opapel.com/quintal-dentro-do-ape-tres-dicas-para-integrar-a-varanda-a-sala-de-estar/": {"file": "1756260927.html", "date": "2020-11-01"}, "https://www.uol.com.br/esporte/futebol/ultimas-noticias/2020/05/21/witzel-ignora-decreto-e-diz-que-volta-e-de-responsabilidade-dos-clubes.htm": {"file": "1613345316.html", "date": "2020-05-21"}, "https://www.otempo.com.br/diversao/benjamin-moser-fala-sobre-a-autora-que-o-conquistou-aos-19-anos-1.2423262": {"file": "1793766917.html", "date": "2020-12-10"}, "https://revistaforum.com.br/brasil/tuca-almeida-do-the-voice-kids-morre-baleado/": {"file": "1594316187.html", "date": "2020-05-01"}, "https://g1.globo.com/ro/rondonia/noticia/2020/03/08/estudante-faz-vaquinha-virtual-para-tratamento-de-pitbull-abandonado-com-deficiencia.ghtml": {"file": "1542250510.html", "date": "2020-03-08"}, "https://www.uol.com.br/carros/videos/2020/06/23/comprador-capota-vw-polo-retirando-carro-da-concessionaria-assista.htm": {"file": "1643070775.html", "date": "2020-06-23"}, "https://trojmiasto.wyborcza.pl/trojmiasto/7,35612,25651465,jest-nowy-prezes-grupy-lotos-ma-doswiadczenie-na-kierowniczych.html": {"file": "1511056469.html", "date": "2020-01-31"}, "https://tvn24.pl/swiat/koronawirus-w-chile-po-zniesieniu-obostrzen-tlumy-ludzi-w-sklepach-i-dlugie-kolejki-4668753?source=rss": {"file": "1686918031.html", "date": "2020-08-19"}, "http://alarmeringen.nl/zuid-holland/haaglanden/den-haag/34192126/p2000-ambulance-met-spoed-naar-hoge-zand-in-den-haag.html?utm_source=rss&utm_medium=nederland&utm_campaign=sharing": {"file": "1733850206.html", "date": "2020-10-09"}, "https://news.google.com/__i/rss/rd/articles/CBMibWh0dHBzOi8vd3d3Lm51Lm5sL2Zvcm11bGUtMS82MDgwMDIxL3ZlcnN0YXBwZW4tbm90ZWVydC16ZXNkZS10aWpkLWluLWRlcmRlLXRyYWluaW5nLXJ1c3Npc2NoZS1ncmFuZC1wcml4Lmh0bWzSAWxodHRwczovL3d3dy5udS5ubC9mb3JtdWxlLTEvNjA4MDAyMS92ZXJzdGFwcGVuLW5vdGVlcnQtemVzZGUtdGlqZC1pbi1kZXJkZS10cmFpbmluZy1ydXNzaXNjaGUtZ3JhbmQtcHJpeC5hbXA?oc=5": {"file": "1721884119.html", "date": "2020-06-14"}, "https://www.lrt.lt/naujienos/verslas/4/1261660/finansu-ministerija-vidaus-rinkoje-pasiskolino-30-mln-euru": {"file": "1750014643.html", "date": "2020-10-26"}, "https://www.noz.de/lokales/westerkappeln/artikel/1994799/kindergottesdienst-in-velpe-macht-teilnehmern-und-betreuerinnen-spass": {"file": "1518961308.html", "date": "2020-02-11"}, "http://m.daejonilbo.com/mnews.asp?pk_no=1412361": {"file": "1538026996.html", "date": "2020-03-04"}, "https://news.chosun.com/site/data/html_dir/2020/08/27/2020082704301.html": {"file": "1694328145.html", "date": "2020-08-27"}, "https://www.youtube.com/watch?v=Lfw9H63g0OQ": {"file": "1577247412.html", "date": "2020-04-13"}, "https://news.biglobe.ne.jp/entertainment/0215/ori_200215_7568645051.html": {"file": "1521513101.html", "date": NaN}, "https://prtimes.jp/main/html/rd/p/000000232.000009812.html": {"file": "1662376572.html", "date": "2020-07-15"}, "https://prtimes.jp/main/html/rd/p/000000088.000029713.html": {"file": "1649432346.html", "date": "2020-07-01"}, "http://oshiete.goo.ne.jp/qa/11763760.html": {"file": "1660114516.html", "date": "2020-07-13"}, "https://blog.goo.ne.jp/umaichi_news/e/52743e20f825567d4e9889be58ec06b9": {"file": "1660064054.html", "date": "2020-07-12"}, "https://blog.goo.ne.jp/jgccg115/e/5d8c31a659b95cc18a43ad75d152e80f": {"file": "1651753632.html", "date": "2020-07-03"}, "https://www.israelhayom.co.il/article/791661": {"file": "1685121845.html", "date": "2020-08-17"}, "https://www.lagazzettadelmezzogiorno.it/news/mondo/1265473/california-certifica-voto-biden-oltre-quorum-270-elettori.html": {"file": "1789111868.html", "date": "2020-12-06"}, "https://www.edilportale.com/news/2020/09/informatica/quando-la-stampante-rende-piu-smart-il-lavoro-del-progettista_78527_10.html": {"file": "1720037955.html", "date": "2020-09-24"}, "https://www.ilmattino.it/primopiano/sanita/isolamento_gli_urologi_uomini_la_pigrizia_danneggia_la_prostata_in_casa_allenatevi_cosi-5180477.html": {"file": "1582946828.html", "date": "2020-04-19"}, "http://www.ansa.it/sito/notizie/sport/calcio/2020/07/28/ghersini-dirige-cagliari-juve-massimi-lazio-brescia_c708a8fb-c2d2-4a9a-b0f3-05b8cc98389d.html": {"file": "1671960661.html", "date": "2020-07-28"}, "https://www.tribunnews.com/pendidikan/2020/05/04/jawaban-soal-apa-dampak-negatif-jika-menunda-pekerjaan-belajar-dari-rumah-sma-di-tvri": {"file": "1595907782.html", "date": "2020-05-04"}, "https://a1plus.am/hy/article/378866": {"file": "1711609490.html", "date": "2020-09-15"}, "https://news.am/arm/news/615163.html": {"file": "1778318045.html", "date": "2020-11-24"}, "https://www.sonline.hu/orszag-vilag/sokan-visszaallitanak-a-tortenelmi-magyarorszagot-2866608/": {"file": "1625266707.html", "date": "2020-06-04"}, "https://www.baon.hu/eletstilus/gyogyulas-utan-is-kiserheti-kronikus-faradtsag-es-poszttraumas-stressz-a-koronavirust-2928554/": {"file": "1686705395.html", "date": "2020-08-18"}, "https://hindi.business-standard.com//storypage.php?autono=172475": {"file": "1725202846.html", "date": "2020-09-29"}, "https://www.amarujala.com/uttar-pradesh/varanasi/gahu-city-news-vns5205298178?utm_source=rssfeed&utm_medium=Referral&utm_campaign=rssfeed": {"file": "1579139187.html", "date": "2020-04-16"}, "https://hindi.business-standard.com//storypage.php?autono=166607": {"file": "1521292139.html", "date": "2020-02-14"}, "https://www.jagran.com/uttar-pradesh/allahabad-city-21109565.html": {"file": "1782160236.html", "date": "2020-11-28"}, "https://www.divyabhaskar.co.in/local/gujarat/vadodara/news/chhetu-patel-a-resident-of-the-united-states-died-due-to-corona-wife-under-treatment-127088788.html": {"file": "1565998355.html", "date": NaN}, "https://www.divyabhaskar.co.in/local/gujarat/rajkot/news/people-who-are-scared-of-corona-call-and-say-i-see-corona-in-my-hand-and-foot-127064846.html": {"file": "1561643340.html", "date": NaN}, "https://www.lexpress.fr/actualites/1/actualite/angleterre-manchester-united-rate-la-marche-arsenal-revit_2141546.html": {"file": "1808011241.html", "date": "2020-12-26"}, "http://ici.radio-canada.ca/nouvelle/1728875/transport-scolaire-ottawa-mi-septembre-covid": {"file": "1691900941.html", "date": "2020-08-24"}, "https://www.vosgesmatin.fr/edition-la-plaine/2020/03/23/incendie-dans-une-maison-cinq-personnes-relogees": {"file": "1556056486.html", "date": "2020-03-23"}, "https://www.guineenews.org/colonel-barry-accuse-pour-vol-aggrave-vers-la-projection-de-la-video-de-toute-la-verite/": {"file": "1759902081.html", "date": "2020-11-04"}, "https://www.sudinfo.be/id300074/article/2020-12-24/il-brise-le-couvre-feu-et-est-surpris-au-volant-23h30-mellet-je-me-fiche-pas-mal": {"file": "1806102460.html", "date": "2020-12-24"}, "https://actu.fr/societe/coronavirus/solidarite-centre-hospitalier-cote-basque-lance-appel-dons-entreprises-particuliers_32590953.html": {"file": "1560424903.html", "date": "2020-03-27"}, "http://www.republicoftogo.com//Toutes-les-rubriques/Sport/Le-championnat-d-Afrique-des-Nations-n-aura-pas-lieu": {"file": "1551001635.html", "date": "2020-03-17"}, "https://yle.fi/uutiset/3-11523428?origin=rss": {"file": "1699747711.html", "date": "2020-09-02"}, "https://www.khabaronline.ir/news/1353889/\u062a\u062d\u0644\u06cc\u0644-\u0631\u0648\u0632\u0646\u0627\u0645\u0647-\u0627\u0635\u0648\u0644\u06af\u0631\u0627-\u0627\u0632-\u062f\u0639\u0648\u062a-\u0627\u0635\u0644\u0627\u062d-\u0637\u0644\u0628\u0627\u0646-\u0628\u0647-\u062d\u0636\u0648\u0631-\u0645\u0631\u062f\u0645-\u062f\u0631-\u0627\u0646\u062a\u062e\u0627\u0628\u0627\u062a": {"file": "1523806243.html", "date": "2020-02-18"}, "https://www.yjc.ir/fa/news/7349926/\u0627\u0632-\u06a9\u0634\u0641-\u06f7\u06f2-\u062f\u0633\u062a\u06af\u0627\u0647-\u0645\u0648\u062a\u0648\u0631-\u0642\u0627\u0686\u0627\u0642-\u062f\u0631-\u0645\u0647\u0631\u06cc\u0632-\u062a\u0627-\u062f\u0633\u062a\u06af\u06cc\u0631\u06cc-\u0633\u0627\u0631\u0642-\u06f1\u06f0\u06f0-\u0645\u06cc\u0644\u06cc\u0648\u0646-\u0631\u06cc\u0627\u0644\u06cc-\u0637\u0644\u0627\u062c\u0627\u062a-\u0645\u0646\u0632\u0644-\u062f\u0631-\u0628\u0627\u0641\u0642": {"file": "1602557452.html", "date": "2020-05-10"}, "http://www.aryanews.com/News/120200622120908039/\u0648\u0631\u0648\u062f-50-\u0647\u0632\u0627\u0631-\u0645\u06cc\u0644\u06cc\u0627\u0631\u062f-\u062a\u0648\u0645\u0627\u0646-\u0646\u0642\u062f\u06cc\u0646\u06af\u06cc-\u0628\u0647-\u0628\u0648\u0631\u0633-\u062f\u0631-3-\u0645\u0627\u0647": {"file": "1641304459.html", "date": "2020-07-02"}, "https://laprensafl.com/2020/02/17/tenemos-silvia-pinal-para-rato-alejandra-guzman-habla-del-estado-de-salud-de-su-mama/": {"file": "1523780090.html", "date": "2020-02-17"}, "https://listindiario.com/el-deporte/2020/12/22/649384/los-grandes-ligas-en-la-lidom": {"file": "1804096113.html", "date": "2020-12-22"}, "http://bohemia.cu/nacionales/2020/03/adoptan-medidas-organizativas-en-la-habana-para-la-venta-de-alimentos/": {"file": "1561086382.html", "date": "2020-03-27"}, "https://www.diariolibre.com/actualidad/internacional/con-silencio-y-partidos-fantasma-se-reanuda-futbol-aleman-EI18893744": {"file": "1608578024.html", "date": "2020-05-16"}, "http://feedproxy.google.com/~r/NoticiaAlDia/~3/_k8dJ5xnwYw/": {"file": "1579141678.html", "date": "2020-04-15"}, "https://www.eldia.com/nota/2020-4-15-16-5-0-en-ruta-36-y-520-activan-protocolo-de-emergencia-en-un-colectivo-de-la-linea-oeste-la-ciudad": {"file": "1579111612.html", "date": "2020-04-15"}, "https://www.lainformacion.com/mundo/opositores-partidarios-lukashenko-culminan-dias-tension-marchas/2812881/": {"file": "1684898837.html", "date": "2020-08-16"}, "https://www.noticierodigital.com/2020/10/borrell-no-aplazar-las-parlamentarias-empeorara-la-situacion-en-venezuela/": {"file": "1731743891.html", "date": "2020-10-07"}, "https://www.la-prensa.com.mx/republica/decomisa-aduana-de-tijuana-mas-de-730-mil-dolares-en-efectivo-5050444.html": {"file": "1566936994.html", "date": "2020-04-02"}, "https://junin24.com/194420/tres-muertos-en-un-choque-frontal-en-ruta-188.html": {"file": "1506220874.html", "date": "2020-01-27"}, "http://www.radionacional.com.ar/intendente-de-pilar-encontramos-obras-paralizadas-y-calles-derrumbadas/": {"file": "1493765064.html", "date": "2020-01-13"}, "https://www.farodevigo.es/deportes/2020/07/05/andres-iniesta-recuerdos-son-magicos/2309843.html?utm_source=rss": {"file": "1653186688.html", "date": "2020-05-07"}, "http://www.andaluciainformacion.es/andalucia/895957/imbroda-revela-que-padecio-y-supero-el-coronavirus-el-pasado-marzo/": {"file": "1598244597.html", "date": "2020-06-05"}, "https://www.elsoldesanjuandelrio.com.mx/local/pescadores-gestionaran-crias-de-peces-5807069.html": {"file": "1721965295.html", "date": "2020-09-25"}, "https://www.elsoldemazatlan.com.mx/finanzas/precio-del-petroleo-mexicano-cae-a-un-minimo-de-18-anos-4982125.html": {"file": "1551294044.html", "date": "2020-03-17"}, "http://www.telepinar.cu/licenciados-en-educacion-primaria-en-consolacion-del-sur-fotos-y-video/": {"file": "1670377569.html", "date": "2020-07-23"}, "https://larazon.pe/faenon-de-toledo-y-grana-le-costo-s-1400-millones-al-estado-peruano/": {"file": "1718257353.html", "date": "2020-09-22"}, "https://diariodelsur.com.co/noticias/deportes/f%C3%BAtbol/el-primero-en-hablar-sorprendente-despedida-de-juan-guillerm-647581": {"file": "1789664619.html", "date": "2020-12-06"}, "http://www.soychile.cl/Puerto-Montt/Deportes/2020/08/24/670291/Congresos-y-seminarios-sobre-actividad-fisica-y-salud-se-transmitiran-desde-Puerto-Montt.aspx": {"file": "1691780201.html", "date": "2020-08-24"}, "https://www.lavozdelafrontera.com.mx/gossip/luis-miguel-y-jose-jose-entre-la-musica-que-sono-en-la-pandemia-plataformas-digitales-coronavirus-covid-19-5821245.html": {"file": "1724245424.html", "date": "2020-09-29"}, "http://www.radionacional.com.ar/comunidad-regional-de-calamuchita-rechazo-la-idea-de-una-capsula-turistica/": {"file": "1730863222.html", "date": "2020-06-10"}, "https://boingboing.net/2020/12/12/this-deep-funk-hanukkah-song-is-a-holiday-classic-in-the-making.html": {"file": "1795915731.html", "date": "2020-12-12"}, "https://www.washingtonpost.com/politics/federal-workers-are-returning-to-the-office-some-members-of-congress-say-they-shouldnt-be/2020/07/08/c3d22ec8-c151-11ea-b4f6-cb39cd8940fb_story.html": {"file": "1676872902.html", "date": "2020-07-09"}, "http://www.marketwatch.com/news/story.asp?guid=%7B49E8785A-F1C7-11EA-B8AA-ECF03EAB1839%7D&siteid=rss&rss=1": {"file": "1705296354.html", "date": "2020-09-08"}, "https://abc7ny.com/traffic/penn-station-to-close-overnight-for-cleaning/6144149/": {"file": "1594754655.html", "date": "2020-05-01"}, "http://feeds.mashable.com/~r/Mashable/~3/9SVJRKMUwTI/": {"file": "1526526251.html", "date": "2020-02-20"}, "https://www.seattlepi.com/sports/article/Tiz-the-Law-draws-No-17-post-as-3-5-Kentucky-15530833.php": {"file": "1699121936.html", "date": NaN}, "https://twitter.com/Reuters/status/1281836879789404160/photo/1": {"file": "1676646261.html", "date": "2020-07-11"}, "https://kesq.com/news/2020/05/14/mayor-of-coachella-explains-citys-decision-to-continue-requiring-face-coverings/": {"file": "1606865668.html", "date": "2020-05-14"}, "https://tucson.com/news/national/college-football-player-arrested-on-murder-charge-in-georgia/article_c7e4b901-9d60-5895-a288-73911df10bd3.html": {"file": "1725250200.html", "date": NaN}, "http://feeds.bizjournals.com/~r/industry_12/~3/_rJ5SC99V8E/after-two-weeks-chef-says-oggies.html": {"file": "1685765130.html", "date": "2020-08-17"}, "https://www.oann.com/protesters-gather-at-paris-theater-to-confront-macron-over-pension-reform/": {"file": "1498311133.html", "date": "2020-01-18"}, "https://timesofindia.indiatimes.com/india/farmers-protests-continue-for-eleventh-day-top-developments/articleshow/79591842.cms": {"file": "1789437552.html", "date": "2020-12-06"}, "https://kdvr.com/news/auroras-violent-crime-rate-ranks-3rd-out-of-colorados-ten-largest-cities/": {"file": "1731257347.html", "date": "2020-10-06"}, "https://www.breakingsoup.com/south-park-characters-fill-empty-seats-at-denver-broncos-games/": {"file": "1732261760.html", "date": NaN}, "https://www.hutchnews.com/ZZ/news/20201123/latest-germanys-curevac-signs-contract-for-new-vaccine?rssfeed=true": {"file": "1777898598.html", "date": "2020-11-23"}, "https://www.news18.com/news/business/rbi-prescribes-five-pillared-approach-guard-against-cybersecurity-threats-for-urban-co-op-banks-2906047.html": {"file": "1720285564.html", "date": "2020-08-24"}, "https://www.stuff.co.nz/national/crime/300145150/three-men-charged-for-alleged-bank-card-skimming-at-auckland-hospitals.html": {"file": "1753142980.html", "date": "2020-10-29"}, "https://economictimes.indiatimes.com/markets/stocks/news/share-market-update-psu-bank-shares-gain-canara-bank-rises-1br/articleshow/73540940.cms": {"file": "1502375798.html", "date": "2020-01-23"}, "http://rnanews.com/young-leaders-from-canada-fiji-pakistan-uganda-win-commonwealth-youth-awards-2020/": {"file": "1545738097.html", "date": "2020-03-11"}, "https://www.seattletimes.com/nation-world/the-quiet-hand-of-conservative-groups-in-the-anti-lockdown-protests/": {"file": "1587149671.html", "date": "2020-04-21"}, "https://au.news.yahoo.com/the-two-aussie-covid-measures-that-could-never-work-in-the-us-222249752.html": {"file": "1753036930.html", "date": "2020-10-28"}, "https://www.dailymail.co.uk/sport/football/article-8297417/Man-Utd-ace-Dean-Henderson-morally-right-finish-season-Sheff-Utd-Wilder.html?ns_mchannel=rss&ns_campaign=1490&ito=1490": {"file": "1599860291.html", "date": "2020-05-07"}, "https://www.google.com/imgres?imgurl=https://i.ebayimg.com/images/g/fjoAAOSwyGZaRXKK/s-l300.jpg&imgrefurl=https://www.ebay.com/itm/Retirement-Gift-Ideas-Retired-Definition-Funny-Retirement-Coffee-Mug-Tea-Cup-/132449557566&tbnid=QBu8niz350w2PM&vet=1&docid=CTb7OqAkXHkPUM&w=300&h=265&itg=1&q=retirement+definition&hl=en-US&source=sh/x/im": {"file": "1564637733.html", "date": NaN}, "https://www.inquirer.com/news/nation-world/us-state-department-blocks-lawsuit-by-american-imprisoned-tortured-in-egypt-20200718.html": {"file": "1666031133.html", "date": "2020-07-18"}, "https://www.slobodnaevropa.org/a/30657941.html": {"file": "1641303898.html", "date": "2020-06-22"}, "https://www.laprensalatina.com/uncertain-future-for-britains-essential-workers-after-brexit/": {"file": "1632107283.html", "date": "2020-06-11"}, "https://www.monroenews.com/ZZ/news/20200516/italy-seeks-to-boost-tourism-by-opening-borders-june-3?rssfeed=true": {"file": "1608581255.html", "date": "2020-05-16"}, "https://www.malaymail.com/news/sports/2020/04/08/2022-world-athletics-championships-set-for-july-15-24/1854874": {"file": "1572542522.html", "date": "2020-04-08"}, "http://city.udn.com/67926/6950016?ch=rss_ugccitynewpost": {"file": "1580240354.html", "date": "2020-04-17"}, "https://www.moneycontrol.com/news/business/goldman-sachs-says-india\u2019s-fy21-gdp-may-plummet-tomulti-decade-low16bleakest-forecast-so-far_13654421.html": {"file": "1572317591.html", "date": NaN}, "https://wiki.d-addicts.com/index.php?title=Park_Ye_Jin&diff=591343&oldid=588001": {"file": "1571322029.html", "date": "2020-04-07"}, "https://www.forbes.com/sites/marlamilling/2020/05/15/drunkorexia-on-the-rise-among-female-university-students/": {"file": "1608572427.html", "date": "2020-05-15"}, "https://thefrontierpost.com/two-newborns-die-for-want-of-oxygen-at-bhakkar-hospital/": {"file": "1535415671.html", "date": NaN}, "https://www.theargus.co.uk/news/18701511.woman-hurt-hit-car-station-street-eastbourne/?ref=rss": {"file": "1703831611.html", "date": "2020-09-06"}, "https://chicago.suntimes.com/2020/6/24/21302329/trump-judges-nominee-federal-senate": {"file": "1643960841.html", "date": "2020-06-24"}, "https://www.dln.com/newcorporations/details/ref_index/438057": {"file": "1687351353.html", "date": NaN}, "https://www.engadget.com/amazon-luxury-stores-fashion-140141502.html": {"file": "1711803974.html", "date": "2020-09-15"}, "https://www.philstar.com/showbiz/2020/11/27/2059828/abs-cbn-nagsalita-na-sa-paglayas-ni-bea": {"file": "1781170231.html", "date": "2020-11-27"}, "https://www.kut.org/post/local-attorney-andy-brown-will-be-democratic-nominee-county-judge": {"file": "1708999888.html", "date": "2020-08-16"}, "https://sanfrancisco.cbslocal.com/2020/09/16/55th-acm-awards-winners-list/": {"file": "1713311346.html", "date": "2020-08-16"}, "https://semissourinews.com/stories/544335905-total-oasdi-disabled-beneficiaries-in-missouri-zip-63848-remains-the-same-in-2019": {"file": "1693504805.html", "date": "2020-08-26"}, "https://zitrod.com/business/we-must-do-more-what-ceos-like-tim-cook-jamie-dimon-larry-fink-say-about-racial-inequality-protests/": {"file": "1627298010.html", "date": "2020-06-01"}, "http://feeds.bizjournals.com/~r/industry_20/~3/Imon3NQaB8c/shutting-down-tampa-bay-construction-during.html": {"file": "1566929327.html", "date": "2020-04-02"}, "http://rssfeeds.usatoday.com/~/620721596/0/usatoday-newstopstories~Hurricanes-in-a-pandemic-Absolutely-thats-our-nightmare-scenario/": {"file": "1566930342.html", "date": "2020-04-02"}, "https://www.recordonline.com/news/20200316/rockland-to-declare-local-state-of-emergency-on-monday?rssfeed=true": {"file": "1549555386.html", "date": "2020-03-16"}, "https://hypixel.net/threads/what-whered-it-go.2675645/": {"file": "1552005464.html", "date": "2020-03-18"}, "http://nationalpost.com/pmn/health-pmn/frances-macron-condemns-unilateral-border-control-measures-over-coronavirus": {"file": "1549542851.html", "date": "2020-03-16"}, "https://www.news18.com/news/india/suspended-aap-councillor-tahir-hussain-arrested-in-delhi-court-over-ib-staffers-murder-2538473.html": {"file": "1549579884.html", "date": "2020-03-16"}, "http://www.asiapacificstar.com/news/263700449/australian-megablaze-brought-under-control": {"file": "1493757507.html", "date": "2020-01-13"}, "https://www.realestate.com.au/news/live-in-your-own-jurassic-park-at-this-multimilliondollar-kenthurst-estate/?rsf=syn:news:nca:news:spa:strap": {"file": "1500330653.html", "date": "2020-01-22"}, "https://carnegieendowment.org/chinafinancialmarkets/79641": {"file": "1487058344.html", "date": "2019-08-06"}, "http://feeds.reuters.com/~r/reuters/businessNews/~3/UjOBluJTi0o/volkswagens-skoda-auto-2019-deliveries-dip-to-1-24-million-cars-due-to-weaker-sales-in-china-idUSKBN1ZC1DA": {"file": "1493638362.html", "date": NaN}, "https://www.96fm.ie/": {"file": "1630687900.html", "date": NaN}, "https://www.mirror.co.uk/sport/football/transfer-news/arsenal-set-pierre-emerick-aubameyang-22002407": {"file": "1602142656.html", "date": "2020-05-09"}, "https://www.businesstimes.com.sg/companies-markets/s232m-fair-value-loss-pushes-sph-into-the-red-for-first-time": {"file": "1738268651.html", "date": "2020-10-14"}, "https://nckansasnews.com/stories/567912132-mark-dings-donates-2-800-to-tracey-robert-mann-s-campaign-committee-in-september": {"file": "1793453954.html", "date": "2020-12-08"}, "https://whnt.com/news/don-trump-jr-tests-positive-for-coronavirus/": {"file": "1775625776.html", "date": "2020-11-20"}, "https://www.hindustantimes.com/india-news/odisha-artist-spreads-awareness-on-coronavirus-with-wall-paintings/story-zMoh0EOYcRzfnhXu6NBPnM.html": {"file": "1589455043.html", "date": "2020-04-26"}, "https://www.jstor.org/stable/2669240?origin=crossref": {"file": "1587353145.html", "date": NaN}, "https://azraelsmerryland.blogspot.com/2020/07/consumers-elevate-appeal-to-president.html": {"file": "1653190016.html", "date": "2020-07-05"}, "https://kiow.com/2020/10/26/absentee-ballots-are-slow-to-return/": {"file": "1749876905.html", "date": "2020-10-26"}, "https://www.zimeye.net/2020/03/23/coronavirus-doctors-threaten-to-down-tools-due-to-govt-unpreparedness/": {"file": "1556637845.html", "date": "2020-03-23"}, "https://www.registerguard.com/news/20200413/second-suspect-in-shooting-turns-himself-in?rssfeed=true": {"file": "1577551432.html", "date": "2020-04-13"}, "https://www.thestar.com/news/world/us/2020/03/23/ap-exclusive-allen-has-new-publisher-memoir-out-monday.html": {"file": "1556635225.html", "date": "2020-03-23"}, "https://www.urdupoint.com/en/world/russian-prime-minister-mikhail-mishustin-says-906868.html": {"file": "1592133028.html", "date": "2020-04-29"}, "https://www.couriermail.com.au/news/national/98yearold-wwii-veteran-beats-covid19-receives-ovation-from-hospital-staff/video/ca6ce285879e2291307f3fc8148670aa": {"file": "1588233953.html", "date": NaN}, "https://www.eastbaytimes.com/2020/04/24/joe-biden-predicts-trump-will-try-to-delay-elections/": {"file": "1588026728.html", "date": "2020-04-24"}, "https://www.eastbaytimes.com/2020/04/24/coronavirus-how-these-bay-area-travelers-got-stranded-in-bolivia/": {"file": "1588027676.html", "date": "2020-04-24"}, "https://sputniknews.com/radio_the_critical_hour/202004281079127690-some-us-states-begin-lifting-lockdowns-vp-pence-defiantly-tours-clinic-unmasked/": {"file": "1592175395.html", "date": "2020-04-28"}, "https://www.baltimoresun.com/maryland/howard/cng-ho-permits-public-hearing-20200817-hntmbtcnhfgwjlv7vompxm4nmu-story.html#ed=rss_www.baltimoresun.com/arcio/rss/category/latest/": {"file": "1685640081.html", "date": "2020-08-17"}, "https://globalnews.ca/news/6620622/syria-turkey-strikes-conflict/": {"file": "1536545428.html", "date": "2020-03-06"}, "http://rssfeeds.detroitnews.com/~/619980292/0/detroit/home~Men-chased-her-shot-her-at-front-door-now-reward-offered-for-slaying-suspects/": {"file": "1551271741.html", "date": "2020-03-17"}, "https://timesofindia.indiatimes.com/sports/cricket/ipl/live-blog/ipl-2020-live-cricket-score-chennai-super-kings-vs-sunrisers-hyderabad-match-14-dubai/liveblog/78447140.cms": {"file": "1727473717.html", "date": "2020-10-03"}, "https://timesofindia.indiatimes.com/city/bhubaneswar/odisha-reports-first-covid-19-death-72-year-old-man-from-bhubaneswar-dies/articleshow/75026800.cms": {"file": "1571147129.html", "date": "2020-04-07"}, "https://thewest.com.au/business/public-companies/caeneus-charges-up-its-exploration-tool-kit-at-mallina-c-1317422": {"file": "1711399465.html", "date": "2020-09-15"}, "http://optimussearch.com.ph/2020/06/02/no-membership-required-best-and-free-online-dating-websites-in-los-angeles/": {"file": "1677010256.html", "date": NaN}, "https://www.businesstoday.in/current/economy-politics/coronavirus-in-bihar-record-749-cases-in-24-hours-patna-other-districts-announce-lockdown-from-july-10/story/409327.html": {"file": "1656725247.html", "date": "2020-07-09"}, "https://theweek.com/speedreads/887020/trump-visited-trumpowned-golf-course-nearly-24-percent-days-2019": {"file": "1485102321.html", "date": "2020-01-02"}, "https://globalnews.ca/news/7278919/kamala-harris-fact-check-us-vice-president/": {"file": "1684405920.html", "date": "2020-09-15"}, "https://www.baltimoresun.com/opinion/columnists/zurawik/bs-ed-zontv-media-year-20201223-cnvrlhkhnrbihcxx6wxcxt2b7y-story.html#ed=rss_www.baltimoresun.com/arcio/rss/category/latest/": {"file": "1805697156.html", "date": "2020-12-23"}, "https://www.sfgate.com/news/article/New-anthology-collects-dozens-of-poems-about-15250468.php": {"file": "1598537220.html", "date": "2020-05-06"}, "https://bizwest.com/2020/03/11/loft-clothing-store-at-twenty-ninth-street-in-boulder-to-close/loft/": {"file": "1545286960.html", "date": "2020-03-11"}, "https://www.news.az/news/azerbaijan-launches-counteroffensive-to-restore-its-territorial-integrity-pakistani-envoy": {"file": "1791336003.html", "date": "2020-10-13"}, "https://upton.wickedlocal.com/news/20200924/battle-in-congress-to-replace-ruth-bader-ginsburg-is-dashing-hopes-for-covid-19-stimulus-package?rssfeed=true": {"file": "1720539463.html", "date": "2020-09-24"}, "http://www.haniotika-nea.gr/ton-epiasan-tin-ora-poy-prospathoyse-na-klepsei-aytokinita/": {"file": "1687581723.html", "date": "2020-09-24"}, "https://www.nzz.ch/international/neue-us-sanktionen-erschuettern-die-syrische-wirtschaft-ld.1560586": {"file": "1635095370.html", "date": "2020-06-17"}, "https://www.infranken.de/ueberregional/boulevard/kultur/tatort-aus-muenchen-30-jahre-leitmayr-und-batic-art-5139057": {"file": "1808293409.html", "date": "2020-12-27"}, "https://www.presseportal.de/blaulicht/pm/43526/4791887": {"file": "1798244877.html", "date": "2020-12-15"}, "https://sn.dk/Erhverv/Forstaerket-haab-om-737-MAX-erstatning-loefter-Norwegian-aktie/artikel/900368?rss": {"file": "1484494308.html", "date": "2020-01-02"}, "https://sn.dk/Danmark/Coronavirus-rammer-trafikken-i-Danmark/artikel/922379?rss": {"file": "1544178583.html", "date": "2020-03-10"}, "https://www.novinky.cz/vase-zpravy/clanek/janovicka-knihovna-zve-na-vystavu-o-nebezpecnem-zivobyti-prevadecu-na-sumave-40311673": {"file": "1509819104.html", "date": "2020-01-30"}, "https://www.elvallenc.cat/societat/vallsconfinat-continua-amb-mes-novetats/": {"file": "1556626357.html", "date": "2020-03-23"}, "https://bn.wikipedia.org/w/index.php?title=Jim_Higgs&diff=3986629&oldid=0": {"file": "1522651525.html", "date": "2020-02-18"}, "https://www.actualno.com/haskovo/obshtina-haskovo-osiguri-komputri-i-tableti-na-deca-v-socialni-obshtejitija-news_1509983.html": {"file": "1740081165.html", "date": "2020-10-15"}, "https://vratza.com/obshtina-b-vratsa-b-specheli-proekt-za-izgrazhdaneto-na-dopalnitelen-korpus-na/": {"file": "1766087391.html", "date": NaN}, "https://www.youm7.com/story/2020/8/24/\u0648\u0632\u064a\u0631-\u0627\u0644\u0631\u0649-\u064a\u0634\u0647\u062f-\u062a\u0648\u0642\u064a\u0639-\u0639\u0642\u062f-\u062f\u0631\u0627\u0633\u0629-\u062a\u062d\u062f\u064a\u062f-\u0627\u0644\u0633\u062d\u0628-\u0627\u0644\u0622\u0645\u0646-\u0644\u0644\u062e\u0632\u0627\u0646\u0627\u062a/4943459": {"file": "1691196862.html", "date": "2020-09-24"}, "https://www.alyaum.com/articles/6291787/\u0627\u0644\u0642\u0627\u0631\u0627\u062a-\u0627\u0644\u0633\u0628\u0639/\u0637\u0647\u0631\u0627\u0646-\u062a\u062f\u0641\u0646-\u0632\u0627\u062f\u0629-\u0648\u062a\u062a\u0647\u0645-\u0627\u0644\u0645\u0639\u0627\u0631\u0636\u0629-\u0627\u0644\u0625\u064a\u0631\u0627\u0646\u064a\u0629-\u0628\u0627\u063a\u062a\u064a\u0627\u0644\u0647": {"file": "1784190729.html", "date": "2020-01-12"}, "https://www.albayan.ae/across-the-uae/news-and-reports/2020-06-03-1.3874341": {"file": "1623715966.html", "date": "2020-06-03"}, "https://akhbarelyom.com/news/newdetails/3126898/1/36-\u0639\u0627\u0645\u064b\u0627..-\u0633\u0631-\u0631\u062d\u064a\u0644-\u0646\u0639\u064a\u0645\u0629-\u0639\u0627\u0643\u0641-\u0641\u064a-\u0633\u0646-\u0645\u0628\u0643\u0631": {"file": "1731695567.html", "date": "2020-10-07"}, "https://www.almadenahnews.com/article/825076-%D8%A3%D9%85%D9%8A%D8%B1%D9%83%D8%A7-50-%D8%A7%D9%84%D9%81%D8%A7-%D8%AD%D8%B5%D9%8A%D9%84%D8%A9-%D8%A7%D9%84%D9%88%D9%81%D9%8A%D8%A7%D8%AA-%D8%A8%D8%B3%D8%A8%D8%A8-%D9%81%D9%8A%D8%B1%D9%88%D8%B3-%D9%83%D9%88%D8%B1%D9%88%D9%86%D8%A7": {"file": "1588107198.html", "date": "2020-04-24"}, "https://arabic.sputniknews.com/arab_world/202003171044893040-%D9%85%D8%B5%D8%B1-%D8%AA%D8%B3%D8%AC%D9%84-30-%D8%A5%D8%B5%D8%A7%D8%A8%D8%A9-%D8%AC%D8%AF%D9%8A%D8%AF%D8%A9-%D9%88%D8%AD%D8%A7%D9%84%D8%AA%D9%8A-%D9%88%D9%81%D8%A7%D8%A9-%D8%A8%D9%81%D9%8A%D8%B1%D9%88%D8%B3-%D9%83%D9%88%D8%B1%D9%88%D9%86%D8%A7/": {"file": "1550981615.html", "date": "2020-03-17"}, "https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/": {"file": "1806793639.html", "date": "2020-12-25"}, "https://www.beirutobserver.com/2020/11/2338761/": {"file": "1764731404.html", "date": "2020-11-09"}, "https://money.udn.com/money/story/5603/4909425": {"file": "1728970162.html", "date": "2020-10-04"}, "http://www.upmedia.mg/news_info.php?SerialNo=83141": {"file": "1546021647.html", "date": "2020-03-12"}, "https://news.ltn.com.tw/news/business/breakingnews/3119452": {"file": "1564888577.html", "date": "2020-04-01"}, "https://news.sina.com.tw/article/20200121/34046328.html": {"file": "1500260110.html", "date": "2020-01-21"}} \ No newline at end of file +{"https://zaxid.net/news/showNews.do?nastupnogo_tizhnya_na_ukrayinu_chekayut_anomalna_speka_ta_grozi&objectId=1503302": {"file": "1628285861.html", "date": "2020-06-07"}, "https://24tv.ua/yak-venediktova-zahishhala-nardepa-vid-slugi-narodu_n1419177": {"file": "1716869064.html", "date": "2020-09-21"}, "https://www.mynet.com/samsunda-sobadan-zehirlenen-2-cocuk-hastanelik-oldu-110106661445": {"file": "1783862633.html", "date": "2020-11-30"}, "http://www.detaykibris.com/yikilan-binalarkurtarma-calismalari-izmirden-goruntuler-2196g.htm": {"file": "1754856212.html", "date": "2020-10-30"}, "https://www.hatawtabloid.com/2020/02/18/aktres-sunod-sunuran-sa-aktor-bf/": {"file": "1523761669.html", "date": "2020-02-18"}, "http://auto-door16814.ttblogs.com/2028064/%E0%B8%9C-%E0%B8%9C%E0%B8%A5-%E0%B8%95-%E0%B9%81%E0%B8%A5%E0%B8%B0%E0%B8%88%E0%B8%B3%E0%B8%AB%E0%B8%99-%E0%B8%B2%E0%B8%A2%E0%B9%82%E0%B8%8B-%E0%B8%AD-%E0%B8%95%E0%B8%AA%E0%B8%B2%E0%B8%AB%E0%B8%81%E0%B8%A3%E0%B8%A3%E0%B8%A1%E0%B8%97-%E0%B8%81%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%A0%E0%B8%97-%E0%B8%97-%E0%B8%81%E0%B8%8A%E0%B8%99-%E0%B8%94": {"file": "1799801548.html", "date": NaN}, "https://www.hbl.fi/artikel/bbc-premier-league-omstart-17-juli/": {"file": "1619092544.html", "date": "2020-05-28"}, "https://avaz.ba/kantoni/republika-srpska/604722/generalni-direktor-eprs-pozitivan-na-koronavirus": {"file": "1750355638.html", "date": "2020-10-26"}, "https://www.koha.net/kronike-e-zeze/233165/arrestohen-dy-persona-ne-peje-per-organizim-te-lojerave-te-fatit/": {"file": "1682093320.html", "date": "2020-08-13"}, "http://www.standard.al/2020/04/15/dite-zie-shenohen-1-mije-e-438-viktima-nga-covid-19-ne-24-oret-e-fundit-ne-france/": {"file": "1579090372.html", "date": "2020-04-15"}, "https://oren.mk.ru/social/2020/02/18/orenburgskim-chinovnikam-kupyat-eshhe-pyat-avtomobiley.html": {"file": "1523821409.html", "date": "2020-02-18"}, "https://www.inform.kz/ru/polnost-yu-obespechit-region-myasom-pticy-namereny-v-sko_a3725092": {"file": "1784458132.html", "date": "2020-12-01"}, "https://www.ukrinform.ru/rubric-kyiv/3103218-policia-napravila-delo-minera-stolicnogo-metro-v-sud.html": {"file": "1716561772.html", "date": "2020-09-20"}, "https://rzn.mk.ru/social/2020/01/13/v-ryazani-nagradili-luchshikh-zhurnalistov.html": {"file": "1493663910.html", "date": "2020-01-13"}, "https://www.mk.ru/politics/2020/09/23/prichinoy-taynoy-inauguracii-lukashenko-stal-styd.html": {"file": "1719330438.html", "date": "2020-09-23"}, "https://tass.ru/moskva/8836033": {"file": "1647262025.html", "date": "2020-06-28"}, "https://www.mk.ru/social/2020/12/29/prisyazhnye-opravdali-muzhchinu-kotoryy-obvinyalsya-v-ubiystve-aktivista-v-serpukhove.html": {"file": "1810559448.html", "date": "2020-12-29"}, "https://aif.ru/politics/world/otrezat_golovu_vo_imya_proroka_terakt_vo_francii_obnazhil_starye_problemy": {"file": "1742998174.html", "date": "2020-10-19"}, "https://www.unn.com.ua/ru/news/1906059-parlament-moldovi-z-podachi-dodona-obmezhiv-povnovazhennya-sandu-sche-do-yiyi-inavguratsiyi": {"file": "1786902045.html", "date": "2020-12-03"}, "https://news.yam.md/ro/story/10939686": {"file": "1716324024.html", "date": "2020-09-20"}, "https://www.noticiaexata.com.br/artigo/brasileiro-goias-supera-palmeiras-com-gol-no-apagar-das-luzes": {"file": "1777110076.html", "date": "2020-11-22"}, "http://feedproxy.google.com/~r/PublicoRSS/~3/iyen1-bXCVc/covid19-trump-ja-nao-corre-risco-infectar-terceiros-medico-casa-branca-1934765": {"file": "1735359864.html", "date": "2020-10-11"}, "https://economia.ig.com.br/2020-11-26/agronegocio-quer-salvar-relacao-com-a-china-apos-acusacoes-de-eduardo-bolsonaro.html": {"file": "1781093249.html", "date": "2020-11-26"}, "https://noticias.uol.com.br/ultimas-noticias/reuters/2020/03/14/sancoes-dos-eua-dificultam-severamente-luta-do-ira-contra-coronavirus-diz-rouhani.htm": {"file": "1548435131.html", "date": "2020-03-14"}, "https://opapel.com/quintal-dentro-do-ape-tres-dicas-para-integrar-a-varanda-a-sala-de-estar/": {"file": "1756260927.html", "date": "2020-11-01"}, "https://www.uol.com.br/esporte/futebol/ultimas-noticias/2020/05/21/witzel-ignora-decreto-e-diz-que-volta-e-de-responsabilidade-dos-clubes.htm": {"file": "1613345316.html", "date": "2020-05-21"}, "https://www.otempo.com.br/diversao/benjamin-moser-fala-sobre-a-autora-que-o-conquistou-aos-19-anos-1.2423262": {"file": "1793766917.html", "date": "2020-12-10"}, "https://revistaforum.com.br/brasil/tuca-almeida-do-the-voice-kids-morre-baleado/": {"file": "1594316187.html", "date": "2020-05-01"}, "https://g1.globo.com/ro/rondonia/noticia/2020/03/08/estudante-faz-vaquinha-virtual-para-tratamento-de-pitbull-abandonado-com-deficiencia.ghtml": {"file": "1542250510.html", "date": "2020-03-08"}, "https://www.uol.com.br/carros/videos/2020/06/23/comprador-capota-vw-polo-retirando-carro-da-concessionaria-assista.htm": {"file": "1643070775.html", "date": "2020-06-23"}, "https://trojmiasto.wyborcza.pl/trojmiasto/7,35612,25651465,jest-nowy-prezes-grupy-lotos-ma-doswiadczenie-na-kierowniczych.html": {"file": "1511056469.html", "date": "2020-01-31"}, "https://tvn24.pl/swiat/koronawirus-w-chile-po-zniesieniu-obostrzen-tlumy-ludzi-w-sklepach-i-dlugie-kolejki-4668753?source=rss": {"file": "1686918031.html", "date": "2020-08-19"}, "http://alarmeringen.nl/zuid-holland/haaglanden/den-haag/34192126/p2000-ambulance-met-spoed-naar-hoge-zand-in-den-haag.html?utm_source=rss&utm_medium=nederland&utm_campaign=sharing": {"file": "1733850206.html", "date": "2020-10-09"}, "https://news.google.com/__i/rss/rd/articles/CBMibWh0dHBzOi8vd3d3Lm51Lm5sL2Zvcm11bGUtMS82MDgwMDIxL3ZlcnN0YXBwZW4tbm90ZWVydC16ZXNkZS10aWpkLWluLWRlcmRlLXRyYWluaW5nLXJ1c3Npc2NoZS1ncmFuZC1wcml4Lmh0bWzSAWxodHRwczovL3d3dy5udS5ubC9mb3JtdWxlLTEvNjA4MDAyMS92ZXJzdGFwcGVuLW5vdGVlcnQtemVzZGUtdGlqZC1pbi1kZXJkZS10cmFpbmluZy1ydXNzaXNjaGUtZ3JhbmQtcHJpeC5hbXA?oc=5": {"file": "1721884119.html", "date": "2020-09-26"}, "https://www.lrt.lt/naujienos/verslas/4/1261660/finansu-ministerija-vidaus-rinkoje-pasiskolino-30-mln-euru": {"file": "1750014643.html", "date": "2020-10-26"}, "https://www.noz.de/lokales/westerkappeln/artikel/1994799/kindergottesdienst-in-velpe-macht-teilnehmern-und-betreuerinnen-spass": {"file": "1518961308.html", "date": "2020-02-11"}, "http://m.daejonilbo.com/mnews.asp?pk_no=1412361": {"file": "1538026996.html", "date": "2020-03-04"}, "https://news.chosun.com/site/data/html_dir/2020/08/27/2020082704301.html": {"file": "1694328145.html", "date": "2020-08-27"}, "https://www.youtube.com/watch?v=Lfw9H63g0OQ": {"file": "1577247412.html", "date": "2020-04-13"}, "https://news.biglobe.ne.jp/entertainment/0215/ori_200215_7568645051.html": {"file": "1521513101.html", "date": "2020-02-15"}, "https://prtimes.jp/main/html/rd/p/000000232.000009812.html": {"file": "1662376572.html", "date": "2020-07-15"}, "https://prtimes.jp/main/html/rd/p/000000088.000029713.html": {"file": "1649432346.html", "date": "2020-07-01"}, "http://oshiete.goo.ne.jp/qa/11763760.html": {"file": "1660114516.html", "date": "2020-07-13"}, "https://blog.goo.ne.jp/umaichi_news/e/52743e20f825567d4e9889be58ec06b9": {"file": "1660064054.html", "date": "2020-07-12"}, "https://blog.goo.ne.jp/jgccg115/e/5d8c31a659b95cc18a43ad75d152e80f": {"file": "1651753632.html", "date": "2020-07-03"}, "https://www.israelhayom.co.il/article/791661": {"file": "1685121845.html", "date": "2020-08-17"}, "https://www.lagazzettadelmezzogiorno.it/news/mondo/1265473/california-certifica-voto-biden-oltre-quorum-270-elettori.html": {"file": "1789111868.html", "date": "2020-12-06"}, "https://www.edilportale.com/news/2020/09/informatica/quando-la-stampante-rende-piu-smart-il-lavoro-del-progettista_78527_10.html": {"file": "1720037955.html", "date": "2020-09-24"}, "https://www.ilmattino.it/primopiano/sanita/isolamento_gli_urologi_uomini_la_pigrizia_danneggia_la_prostata_in_casa_allenatevi_cosi-5180477.html": {"file": "1582946828.html", "date": "2020-04-19"}, "http://www.ansa.it/sito/notizie/sport/calcio/2020/07/28/ghersini-dirige-cagliari-juve-massimi-lazio-brescia_c708a8fb-c2d2-4a9a-b0f3-05b8cc98389d.html": {"file": "1671960661.html", "date": "2020-07-28"}, "https://www.tribunnews.com/pendidikan/2020/05/04/jawaban-soal-apa-dampak-negatif-jika-menunda-pekerjaan-belajar-dari-rumah-sma-di-tvri": {"file": "1595907782.html", "date": "2020-05-04"}, "https://a1plus.am/hy/article/378866": {"file": "1711609490.html", "date": "2020-09-15"}, "https://news.am/arm/news/615163.html": {"file": "1778318045.html", "date": "2020-11-24"}, "https://www.sonline.hu/orszag-vilag/sokan-visszaallitanak-a-tortenelmi-magyarorszagot-2866608/": {"file": "1625266707.html", "date": "2020-06-04"}, "https://www.baon.hu/eletstilus/gyogyulas-utan-is-kiserheti-kronikus-faradtsag-es-poszttraumas-stressz-a-koronavirust-2928554/": {"file": "1686705395.html", "date": "2020-08-18"}, "https://hindi.business-standard.com//storypage.php?autono=172475": {"file": "1725202846.html", "date": "2020-09-29"}, "https://www.amarujala.com/uttar-pradesh/varanasi/gahu-city-news-vns5205298178?utm_source=rssfeed&utm_medium=Referral&utm_campaign=rssfeed": {"file": "1579139187.html", "date": "2020-04-16"}, "https://hindi.business-standard.com//storypage.php?autono=166607": {"file": "1521292139.html", "date": "2020-02-14"}, "https://www.jagran.com/uttar-pradesh/allahabad-city-21109565.html": {"file": "1782160236.html", "date": "2020-11-28"}, "https://www.divyabhaskar.co.in/local/gujarat/vadodara/news/chhetu-patel-a-resident-of-the-united-states-died-due-to-corona-wife-under-treatment-127088788.html": {"file": "1565998355.html", "date": "2020-04-02"}, "https://www.divyabhaskar.co.in/local/gujarat/rajkot/news/people-who-are-scared-of-corona-call-and-say-i-see-corona-in-my-hand-and-foot-127064846.html": {"file": "1561643340.html", "date": "2020-03-28"}, "https://www.lexpress.fr/actualites/1/actualite/angleterre-manchester-united-rate-la-marche-arsenal-revit_2141546.html": {"file": "1808011241.html", "date": "2020-12-26"}, "http://ici.radio-canada.ca/nouvelle/1728875/transport-scolaire-ottawa-mi-septembre-covid": {"file": "1691900941.html", "date": "2020-08-24"}, "https://www.vosgesmatin.fr/edition-la-plaine/2020/03/23/incendie-dans-une-maison-cinq-personnes-relogees": {"file": "1556056486.html", "date": "2020-03-23"}, "https://www.guineenews.org/colonel-barry-accuse-pour-vol-aggrave-vers-la-projection-de-la-video-de-toute-la-verite/": {"file": "1759902081.html", "date": "2020-11-04"}, "https://www.sudinfo.be/id300074/article/2020-12-24/il-brise-le-couvre-feu-et-est-surpris-au-volant-23h30-mellet-je-me-fiche-pas-mal": {"file": "1806102460.html", "date": "2020-12-24"}, "https://actu.fr/societe/coronavirus/solidarite-centre-hospitalier-cote-basque-lance-appel-dons-entreprises-particuliers_32590953.html": {"file": "1560424903.html", "date": "2020-03-27"}, "http://www.republicoftogo.com//Toutes-les-rubriques/Sport/Le-championnat-d-Afrique-des-Nations-n-aura-pas-lieu": {"file": "1551001635.html", "date": "2020-03-17"}, "https://yle.fi/uutiset/3-11523428?origin=rss": {"file": "1699747711.html", "date": "2020-09-02"}, "https://www.khabaronline.ir/news/1353889/\u062a\u062d\u0644\u06cc\u0644-\u0631\u0648\u0632\u0646\u0627\u0645\u0647-\u0627\u0635\u0648\u0644\u06af\u0631\u0627-\u0627\u0632-\u062f\u0639\u0648\u062a-\u0627\u0635\u0644\u0627\u062d-\u0637\u0644\u0628\u0627\u0646-\u0628\u0647-\u062d\u0636\u0648\u0631-\u0645\u0631\u062f\u0645-\u062f\u0631-\u0627\u0646\u062a\u062e\u0627\u0628\u0627\u062a": {"file": "1523806243.html", "date": "2020-02-18"}, "https://www.yjc.ir/fa/news/7349926/\u0627\u0632-\u06a9\u0634\u0641-\u06f7\u06f2-\u062f\u0633\u062a\u06af\u0627\u0647-\u0645\u0648\u062a\u0648\u0631-\u0642\u0627\u0686\u0627\u0642-\u062f\u0631-\u0645\u0647\u0631\u06cc\u0632-\u062a\u0627-\u062f\u0633\u062a\u06af\u06cc\u0631\u06cc-\u0633\u0627\u0631\u0642-\u06f1\u06f0\u06f0-\u0645\u06cc\u0644\u06cc\u0648\u0646-\u0631\u06cc\u0627\u0644\u06cc-\u0637\u0644\u0627\u062c\u0627\u062a-\u0645\u0646\u0632\u0644-\u062f\u0631-\u0628\u0627\u0641\u0642": {"file": "1602557452.html", "date": "2020-05-10"}, "http://www.aryanews.com/News/120200622120908039/\u0648\u0631\u0648\u062f-50-\u0647\u0632\u0627\u0631-\u0645\u06cc\u0644\u06cc\u0627\u0631\u062f-\u062a\u0648\u0645\u0627\u0646-\u0646\u0642\u062f\u06cc\u0646\u06af\u06cc-\u0628\u0647-\u0628\u0648\u0631\u0633-\u062f\u0631-3-\u0645\u0627\u0647": {"file": "1641304459.html", "date": "2020-07-02"}, "https://laprensafl.com/2020/02/17/tenemos-silvia-pinal-para-rato-alejandra-guzman-habla-del-estado-de-salud-de-su-mama/": {"file": "1523780090.html", "date": "2020-02-17"}, "https://listindiario.com/el-deporte/2020/12/22/649384/los-grandes-ligas-en-la-lidom": {"file": "1804096113.html", "date": "2020-12-22"}, "http://bohemia.cu/nacionales/2020/03/adoptan-medidas-organizativas-en-la-habana-para-la-venta-de-alimentos/": {"file": "1561086382.html", "date": "2020-03-27"}, "https://www.diariolibre.com/actualidad/internacional/con-silencio-y-partidos-fantasma-se-reanuda-futbol-aleman-EI18893744": {"file": "1608578024.html", "date": "2020-05-16"}, "http://feedproxy.google.com/~r/NoticiaAlDia/~3/_k8dJ5xnwYw/": {"file": "1579141678.html", "date": "2020-04-15"}, "https://www.eldia.com/nota/2020-4-15-16-5-0-en-ruta-36-y-520-activan-protocolo-de-emergencia-en-un-colectivo-de-la-linea-oeste-la-ciudad": {"file": "1579111612.html", "date": "2020-04-15"}, "https://www.lainformacion.com/mundo/opositores-partidarios-lukashenko-culminan-dias-tension-marchas/2812881/": {"file": "1684898837.html", "date": "2020-08-16"}, "https://www.noticierodigital.com/2020/10/borrell-no-aplazar-las-parlamentarias-empeorara-la-situacion-en-venezuela/": {"file": "1731743891.html", "date": "2020-10-07"}, "https://www.la-prensa.com.mx/republica/decomisa-aduana-de-tijuana-mas-de-730-mil-dolares-en-efectivo-5050444.html": {"file": "1566936994.html", "date": "2020-04-02"}, "https://junin24.com/194420/tres-muertos-en-un-choque-frontal-en-ruta-188.html": {"file": "1506220874.html", "date": "2020-01-27"}, "http://www.radionacional.com.ar/intendente-de-pilar-encontramos-obras-paralizadas-y-calles-derrumbadas/": {"file": "1493765064.html", "date": "2020-01-13"}, "https://www.farodevigo.es/deportes/2020/07/05/andres-iniesta-recuerdos-son-magicos/2309843.html?utm_source=rss": {"file": "1653186688.html", "date": "2020-07-05"}, "http://www.andaluciainformacion.es/andalucia/895957/imbroda-revela-que-padecio-y-supero-el-coronavirus-el-pasado-marzo/": {"file": "1598244597.html", "date": "2020-05-06"}, "https://www.elsoldesanjuandelrio.com.mx/local/pescadores-gestionaran-crias-de-peces-5807069.html": {"file": "1721965295.html", "date": "2020-09-25"}, "https://www.elsoldemazatlan.com.mx/finanzas/precio-del-petroleo-mexicano-cae-a-un-minimo-de-18-anos-4982125.html": {"file": "1551294044.html", "date": "2020-03-17"}, "http://www.telepinar.cu/licenciados-en-educacion-primaria-en-consolacion-del-sur-fotos-y-video/": {"file": "1670377569.html", "date": "2020-07-23"}, "https://larazon.pe/faenon-de-toledo-y-grana-le-costo-s-1400-millones-al-estado-peruano/": {"file": "1718257353.html", "date": "2020-09-22"}, "https://diariodelsur.com.co/noticias/deportes/f%C3%BAtbol/el-primero-en-hablar-sorprendente-despedida-de-juan-guillerm-647581": {"file": "1789664619.html", "date": "2020-12-06"}, "http://www.soychile.cl/Puerto-Montt/Deportes/2020/08/24/670291/Congresos-y-seminarios-sobre-actividad-fisica-y-salud-se-transmitiran-desde-Puerto-Montt.aspx": {"file": "1691780201.html", "date": "2020-08-24"}, "https://www.lavozdelafrontera.com.mx/gossip/luis-miguel-y-jose-jose-entre-la-musica-que-sono-en-la-pandemia-plataformas-digitales-coronavirus-covid-19-5821245.html": {"file": "1724245424.html", "date": "2020-09-29"}, "http://www.radionacional.com.ar/comunidad-regional-de-calamuchita-rechazo-la-idea-de-una-capsula-turistica/": {"file": "1730863222.html", "date": "2020-10-06"}, "https://boingboing.net/2020/12/12/this-deep-funk-hanukkah-song-is-a-holiday-classic-in-the-making.html": {"file": "1795915731.html", "date": "2020-12-12"}, "https://www.washingtonpost.com/politics/federal-workers-are-returning-to-the-office-some-members-of-congress-say-they-shouldnt-be/2020/07/08/c3d22ec8-c151-11ea-b4f6-cb39cd8940fb_story.html": {"file": "1676872902.html", "date": "2020-07-09"}, "http://www.marketwatch.com/news/story.asp?guid=%7B49E8785A-F1C7-11EA-B8AA-ECF03EAB1839%7D&siteid=rss&rss=1": {"file": "1705296354.html", "date": "2020-09-08"}, "https://abc7ny.com/traffic/penn-station-to-close-overnight-for-cleaning/6144149/": {"file": "1594754655.html", "date": "2020-05-01"}, "http://feeds.mashable.com/~r/Mashable/~3/9SVJRKMUwTI/": {"file": "1526526251.html", "date": "2020-02-20"}, "https://www.seattlepi.com/sports/article/Tiz-the-Law-draws-No-17-post-as-3-5-Kentucky-15530833.php": {"file": "1699121936.html", "date": "2020-09-01"}, "https://twitter.com/Reuters/status/1281836879789404160/photo/1": {"file": "1676646261.html", "date": "2020-07-11"}, "https://kesq.com/news/2020/05/14/mayor-of-coachella-explains-citys-decision-to-continue-requiring-face-coverings/": {"file": "1606865668.html", "date": "2020-05-14"}, "https://tucson.com/news/national/college-football-player-arrested-on-murder-charge-in-georgia/article_c7e4b901-9d60-5895-a288-73911df10bd3.html": {"file": "1725250200.html", "date": "2020-09-30"}, "http://feeds.bizjournals.com/~r/industry_12/~3/_rJ5SC99V8E/after-two-weeks-chef-says-oggies.html": {"file": "1685765130.html", "date": "2020-08-17"}, "https://www.oann.com/protesters-gather-at-paris-theater-to-confront-macron-over-pension-reform/": {"file": "1498311133.html", "date": "2020-01-18"}, "https://timesofindia.indiatimes.com/india/farmers-protests-continue-for-eleventh-day-top-developments/articleshow/79591842.cms": {"file": "1789437552.html", "date": "2020-12-06"}, "https://kdvr.com/news/auroras-violent-crime-rate-ranks-3rd-out-of-colorados-ten-largest-cities/": {"file": "1731257347.html", "date": "2020-10-06"}, "https://www.breakingsoup.com/south-park-characters-fill-empty-seats-at-denver-broncos-games/": {"file": "1732261760.html", "date": "2020-09-28"}, "https://www.hutchnews.com/ZZ/news/20201123/latest-germanys-curevac-signs-contract-for-new-vaccine?rssfeed=true": {"file": "1777898598.html", "date": "2020-11-23"}, "https://www.news18.com/news/business/rbi-prescribes-five-pillared-approach-guard-against-cybersecurity-threats-for-urban-co-op-banks-2906047.html": {"file": "1720285564.html", "date": "2020-09-24"}, "https://www.stuff.co.nz/national/crime/300145150/three-men-charged-for-alleged-bank-card-skimming-at-auckland-hospitals.html": {"file": "1753142980.html", "date": "2020-10-29"}, "https://economictimes.indiatimes.com/markets/stocks/news/share-market-update-psu-bank-shares-gain-canara-bank-rises-1br/articleshow/73540940.cms": {"file": "1502375798.html", "date": "2020-01-23"}, "http://rnanews.com/young-leaders-from-canada-fiji-pakistan-uganda-win-commonwealth-youth-awards-2020/": {"file": "1545738097.html", "date": "2020-03-11"}, "https://www.seattletimes.com/nation-world/the-quiet-hand-of-conservative-groups-in-the-anti-lockdown-protests/": {"file": "1587149671.html", "date": "2020-04-21"}, "https://au.news.yahoo.com/the-two-aussie-covid-measures-that-could-never-work-in-the-us-222249752.html": {"file": "1753036930.html", "date": "2020-10-28"}, "https://www.dailymail.co.uk/sport/football/article-8297417/Man-Utd-ace-Dean-Henderson-morally-right-finish-season-Sheff-Utd-Wilder.html?ns_mchannel=rss&ns_campaign=1490&ito=1490": {"file": "1599860291.html", "date": "2020-05-07"}, "https://www.google.com/imgres?imgurl=https://i.ebayimg.com/images/g/fjoAAOSwyGZaRXKK/s-l300.jpg&imgrefurl=https://www.ebay.com/itm/Retirement-Gift-Ideas-Retired-Definition-Funny-Retirement-Coffee-Mug-Tea-Cup-/132449557566&tbnid=QBu8niz350w2PM&vet=1&docid=CTb7OqAkXHkPUM&w=300&h=265&itg=1&q=retirement+definition&hl=en-US&source=sh/x/im": {"file": "1564637733.html", "date": NaN}, "https://www.inquirer.com/news/nation-world/us-state-department-blocks-lawsuit-by-american-imprisoned-tortured-in-egypt-20200718.html": {"file": "1666031133.html", "date": "2020-07-18"}, "https://www.slobodnaevropa.org/a/30657941.html": {"file": "1641303898.html", "date": "2020-06-22"}, "https://www.laprensalatina.com/uncertain-future-for-britains-essential-workers-after-brexit/": {"file": "1632107283.html", "date": "2020-06-11"}, "https://www.monroenews.com/ZZ/news/20200516/italy-seeks-to-boost-tourism-by-opening-borders-june-3?rssfeed=true": {"file": "1608581255.html", "date": "2020-05-16"}, "https://www.malaymail.com/news/sports/2020/04/08/2022-world-athletics-championships-set-for-july-15-24/1854874": {"file": "1572542522.html", "date": "2020-04-08"}, "http://city.udn.com/67926/6950016?ch=rss_ugccitynewpost": {"file": "1580240354.html", "date": "2020-04-17"}, "https://www.moneycontrol.com/news/business/goldman-sachs-says-india\u2019s-fy21-gdp-may-plummet-tomulti-decade-low16bleakest-forecast-so-far_13654421.html": {"file": "1572317591.html", "date": "2020-04-08"}, "https://wiki.d-addicts.com/index.php?title=Park_Ye_Jin&diff=591343&oldid=588001": {"file": "1571322029.html", "date": "2020-04-07"}, "https://www.forbes.com/sites/marlamilling/2020/05/15/drunkorexia-on-the-rise-among-female-university-students/": {"file": "1608572427.html", "date": "2020-05-15"}, "https://thefrontierpost.com/two-newborns-die-for-want-of-oxygen-at-bhakkar-hospital/": {"file": "1535415671.html", "date": "2020-03-01"}, "https://www.theargus.co.uk/news/18701511.woman-hurt-hit-car-station-street-eastbourne/?ref=rss": {"file": "1703831611.html", "date": "2020-09-06"}, "https://chicago.suntimes.com/2020/6/24/21302329/trump-judges-nominee-federal-senate": {"file": "1643960841.html", "date": "2020-06-24"}, "https://www.dln.com/newcorporations/details/ref_index/438057": {"file": "1687351353.html", "date": NaN}, "https://www.engadget.com/amazon-luxury-stores-fashion-140141502.html": {"file": "1711803974.html", "date": "2020-09-15"}, "https://www.philstar.com/showbiz/2020/11/27/2059828/abs-cbn-nagsalita-na-sa-paglayas-ni-bea": {"file": "1781170231.html", "date": "2020-11-27"}, "https://www.kut.org/post/local-attorney-andy-brown-will-be-democratic-nominee-county-judge": {"file": "1708999888.html", "date": "2020-08-16"}, "https://sanfrancisco.cbslocal.com/2020/09/16/55th-acm-awards-winners-list/": {"file": "1713311346.html", "date": "2020-09-16"}, "https://semissourinews.com/stories/544335905-total-oasdi-disabled-beneficiaries-in-missouri-zip-63848-remains-the-same-in-2019": {"file": "1693504805.html", "date": "2020-08-26"}, "https://zitrod.com/business/we-must-do-more-what-ceos-like-tim-cook-jamie-dimon-larry-fink-say-about-racial-inequality-protests/": {"file": "1627298010.html", "date": "2020-06-01"}, "http://feeds.bizjournals.com/~r/industry_20/~3/Imon3NQaB8c/shutting-down-tampa-bay-construction-during.html": {"file": "1566929327.html", "date": "2020-04-02"}, "http://rssfeeds.usatoday.com/~/620721596/0/usatoday-newstopstories~Hurricanes-in-a-pandemic-Absolutely-thats-our-nightmare-scenario/": {"file": "1566930342.html", "date": "2020-04-02"}, "https://www.recordonline.com/news/20200316/rockland-to-declare-local-state-of-emergency-on-monday?rssfeed=true": {"file": "1549555386.html", "date": "2020-03-16"}, "https://hypixel.net/threads/what-whered-it-go.2675645/": {"file": "1552005464.html", "date": "2020-03-18"}, "http://nationalpost.com/pmn/health-pmn/frances-macron-condemns-unilateral-border-control-measures-over-coronavirus": {"file": "1549542851.html", "date": "2020-03-16"}, "https://www.news18.com/news/india/suspended-aap-councillor-tahir-hussain-arrested-in-delhi-court-over-ib-staffers-murder-2538473.html": {"file": "1549579884.html", "date": "2020-03-16"}, "http://www.asiapacificstar.com/news/263700449/australian-megablaze-brought-under-control": {"file": "1493757507.html", "date": "2020-01-13"}, "https://www.realestate.com.au/news/live-in-your-own-jurassic-park-at-this-multimilliondollar-kenthurst-estate/?rsf=syn:news:nca:news:spa:strap": {"file": "1500330653.html", "date": "2020-01-22"}, "https://carnegieendowment.org/chinafinancialmarkets/79641": {"file": "1487058344.html", "date": "2019-08-06"}, "http://feeds.reuters.com/~r/reuters/businessNews/~3/UjOBluJTi0o/volkswagens-skoda-auto-2019-deliveries-dip-to-1-24-million-cars-due-to-weaker-sales-in-china-idUSKBN1ZC1DA": {"file": "1493638362.html", "date": "2020-01-13"}, "https://www.96fm.ie/": {"file": "1630687900.html", "date": NaN}, "https://www.mirror.co.uk/sport/football/transfer-news/arsenal-set-pierre-emerick-aubameyang-22002407": {"file": "1602142656.html", "date": "2020-05-09"}, "https://www.businesstimes.com.sg/companies-markets/s232m-fair-value-loss-pushes-sph-into-the-red-for-first-time": {"file": "1738268651.html", "date": "2020-10-14"}, "https://nckansasnews.com/stories/567912132-mark-dings-donates-2-800-to-tracey-robert-mann-s-campaign-committee-in-september": {"file": "1793453954.html", "date": "2020-12-08"}, "https://whnt.com/news/don-trump-jr-tests-positive-for-coronavirus/": {"file": "1775625776.html", "date": "2020-11-20"}, "https://www.hindustantimes.com/india-news/odisha-artist-spreads-awareness-on-coronavirus-with-wall-paintings/story-zMoh0EOYcRzfnhXu6NBPnM.html": {"file": "1589455043.html", "date": "2020-04-26"}, "https://www.jstor.org/stable/2669240?origin=crossref": {"file": "1587353145.html", "date": NaN}, "https://azraelsmerryland.blogspot.com/2020/07/consumers-elevate-appeal-to-president.html": {"file": "1653190016.html", "date": "2020-07-05"}, "https://kiow.com/2020/10/26/absentee-ballots-are-slow-to-return/": {"file": "1749876905.html", "date": "2020-10-26"}, "https://www.zimeye.net/2020/03/23/coronavirus-doctors-threaten-to-down-tools-due-to-govt-unpreparedness/": {"file": "1556637845.html", "date": "2020-03-23"}, "https://www.registerguard.com/news/20200413/second-suspect-in-shooting-turns-himself-in?rssfeed=true": {"file": "1577551432.html", "date": "2020-04-13"}, "https://www.thestar.com/news/world/us/2020/03/23/ap-exclusive-allen-has-new-publisher-memoir-out-monday.html": {"file": "1556635225.html", "date": "2020-03-23"}, "https://www.urdupoint.com/en/world/russian-prime-minister-mikhail-mishustin-says-906868.html": {"file": "1592133028.html", "date": "2020-04-29"}, "https://www.couriermail.com.au/news/national/98yearold-wwii-veteran-beats-covid19-receives-ovation-from-hospital-staff/video/ca6ce285879e2291307f3fc8148670aa": {"file": "1588233953.html", "date": "2020-04-24"}, "https://www.eastbaytimes.com/2020/04/24/joe-biden-predicts-trump-will-try-to-delay-elections/": {"file": "1588026728.html", "date": "2020-04-24"}, "https://www.eastbaytimes.com/2020/04/24/coronavirus-how-these-bay-area-travelers-got-stranded-in-bolivia/": {"file": "1588027676.html", "date": "2020-04-24"}, "https://sputniknews.com/radio_the_critical_hour/202004281079127690-some-us-states-begin-lifting-lockdowns-vp-pence-defiantly-tours-clinic-unmasked/": {"file": "1592175395.html", "date": "2020-04-28"}, "https://www.baltimoresun.com/maryland/howard/cng-ho-permits-public-hearing-20200817-hntmbtcnhfgwjlv7vompxm4nmu-story.html#ed=rss_www.baltimoresun.com/arcio/rss/category/latest/": {"file": "1685640081.html", "date": "2020-08-17"}, "https://globalnews.ca/news/6620622/syria-turkey-strikes-conflict/": {"file": "1536545428.html", "date": "2020-03-02"}, "http://rssfeeds.detroitnews.com/~/619980292/0/detroit/home~Men-chased-her-shot-her-at-front-door-now-reward-offered-for-slaying-suspects/": {"file": "1551271741.html", "date": "2020-03-17"}, "https://timesofindia.indiatimes.com/sports/cricket/ipl/live-blog/ipl-2020-live-cricket-score-chennai-super-kings-vs-sunrisers-hyderabad-match-14-dubai/liveblog/78447140.cms": {"file": "1727473717.html", "date": "2020-10-03"}, "https://timesofindia.indiatimes.com/city/bhubaneswar/odisha-reports-first-covid-19-death-72-year-old-man-from-bhubaneswar-dies/articleshow/75026800.cms": {"file": "1571147129.html", "date": "2020-04-07"}, "https://thewest.com.au/business/public-companies/caeneus-charges-up-its-exploration-tool-kit-at-mallina-c-1317422": {"file": "1711399465.html", "date": "2020-09-15"}, "http://optimussearch.com.ph/2020/06/02/no-membership-required-best-and-free-online-dating-websites-in-los-angeles/": {"file": "1677010256.html", "date": "2020-06-02"}, "https://www.businesstoday.in/current/economy-politics/coronavirus-in-bihar-record-749-cases-in-24-hours-patna-other-districts-announce-lockdown-from-july-10/story/409327.html": {"file": "1656725247.html", "date": "2020-07-09"}, "https://theweek.com/speedreads/887020/trump-visited-trumpowned-golf-course-nearly-24-percent-days-2019": {"file": "1485102321.html", "date": "2020-01-02"}, "https://globalnews.ca/news/7278919/kamala-harris-fact-check-us-vice-president/": {"file": "1684405920.html", "date": "2020-08-15"}, "https://www.baltimoresun.com/opinion/columnists/zurawik/bs-ed-zontv-media-year-20201223-cnvrlhkhnrbihcxx6wxcxt2b7y-story.html#ed=rss_www.baltimoresun.com/arcio/rss/category/latest/": {"file": "1805697156.html", "date": "2020-12-23"}, "https://www.sfgate.com/news/article/New-anthology-collects-dozens-of-poems-about-15250468.php": {"file": "1598537220.html", "date": "2020-05-06"}, "https://bizwest.com/2020/03/11/loft-clothing-store-at-twenty-ninth-street-in-boulder-to-close/loft/": {"file": "1545286960.html", "date": "2020-03-11"}, "https://www.news.az/news/azerbaijan-launches-counteroffensive-to-restore-its-territorial-integrity-pakistani-envoy": {"file": "1791336003.html", "date": "2020-10-13"}, "https://upton.wickedlocal.com/news/20200924/battle-in-congress-to-replace-ruth-bader-ginsburg-is-dashing-hopes-for-covid-19-stimulus-package?rssfeed=true": {"file": "1720539463.html", "date": "2020-09-24"}, "http://www.haniotika-nea.gr/ton-epiasan-tin-ora-poy-prospathoyse-na-klepsei-aytokinita/": {"file": "1687581723.html", "date": "2020-08-19"}, "https://www.nzz.ch/international/neue-us-sanktionen-erschuettern-die-syrische-wirtschaft-ld.1560586": {"file": "1635095370.html", "date": "2020-06-15"}, "https://www.infranken.de/ueberregional/boulevard/kultur/tatort-aus-muenchen-30-jahre-leitmayr-und-batic-art-5139057": {"file": "1808293409.html", "date": "2020-12-27"}, "https://www.presseportal.de/blaulicht/pm/43526/4791887": {"file": "1798244877.html", "date": "2020-12-15"}, "https://sn.dk/Erhverv/Forstaerket-haab-om-737-MAX-erstatning-loefter-Norwegian-aktie/artikel/900368?rss": {"file": "1484494308.html", "date": "2020-01-02"}, "https://sn.dk/Danmark/Coronavirus-rammer-trafikken-i-Danmark/artikel/922379?rss": {"file": "1544178583.html", "date": "2020-03-10"}, "https://www.novinky.cz/vase-zpravy/clanek/janovicka-knihovna-zve-na-vystavu-o-nebezpecnem-zivobyti-prevadecu-na-sumave-40311673": {"file": "1509819104.html", "date": "2020-01-30"}, "https://www.elvallenc.cat/societat/vallsconfinat-continua-amb-mes-novetats/": {"file": "1556626357.html", "date": "2020-03-23"}, "https://bn.wikipedia.org/w/index.php?title=Jim_Higgs&diff=3986629&oldid=0": {"file": "1522651525.html", "date": "2020-02-18"}, "https://www.actualno.com/haskovo/obshtina-haskovo-osiguri-komputri-i-tableti-na-deca-v-socialni-obshtejitija-news_1509983.html": {"file": "1740081165.html", "date": "2020-10-15"}, "https://vratza.com/obshtina-b-vratsa-b-specheli-proekt-za-izgrazhdaneto-na-dopalnitelen-korpus-na/": {"file": "1766087391.html", "date": "2020-11-10"}, "https://www.youm7.com/story/2020/8/24/\u0648\u0632\u064a\u0631-\u0627\u0644\u0631\u0649-\u064a\u0634\u0647\u062f-\u062a\u0648\u0642\u064a\u0639-\u0639\u0642\u062f-\u062f\u0631\u0627\u0633\u0629-\u062a\u062d\u062f\u064a\u062f-\u0627\u0644\u0633\u062d\u0628-\u0627\u0644\u0622\u0645\u0646-\u0644\u0644\u062e\u0632\u0627\u0646\u0627\u062a/4943459": {"file": "1691196862.html", "date": "2020-08-24"}, "https://www.alyaum.com/articles/6291787/\u0627\u0644\u0642\u0627\u0631\u0627\u062a-\u0627\u0644\u0633\u0628\u0639/\u0637\u0647\u0631\u0627\u0646-\u062a\u062f\u0641\u0646-\u0632\u0627\u062f\u0629-\u0648\u062a\u062a\u0647\u0645-\u0627\u0644\u0645\u0639\u0627\u0631\u0636\u0629-\u0627\u0644\u0625\u064a\u0631\u0627\u0646\u064a\u0629-\u0628\u0627\u063a\u062a\u064a\u0627\u0644\u0647": {"file": "1784190729.html", "date": "2020-12-01"}, "https://www.albayan.ae/across-the-uae/news-and-reports/2020-06-03-1.3874341": {"file": "1623715966.html", "date": "2020-06-03"}, "https://akhbarelyom.com/news/newdetails/3126898/1/36-\u0639\u0627\u0645\u064b\u0627..-\u0633\u0631-\u0631\u062d\u064a\u0644-\u0646\u0639\u064a\u0645\u0629-\u0639\u0627\u0643\u0641-\u0641\u064a-\u0633\u0646-\u0645\u0628\u0643\u0631": {"file": "1731695567.html", "date": "2020-10-07"}, "https://www.almadenahnews.com/article/825076-%D8%A3%D9%85%D9%8A%D8%B1%D9%83%D8%A7-50-%D8%A7%D9%84%D9%81%D8%A7-%D8%AD%D8%B5%D9%8A%D9%84%D8%A9-%D8%A7%D9%84%D9%88%D9%81%D9%8A%D8%A7%D8%AA-%D8%A8%D8%B3%D8%A8%D8%A8-%D9%81%D9%8A%D8%B1%D9%88%D8%B3-%D9%83%D9%88%D8%B1%D9%88%D9%86%D8%A7": {"file": "1588107198.html", "date": "2020-04-24"}, "https://arabic.sputniknews.com/arab_world/202003171044893040-%D9%85%D8%B5%D8%B1-%D8%AA%D8%B3%D8%AC%D9%84-30-%D8%A5%D8%B5%D8%A7%D8%A8%D8%A9-%D8%AC%D8%AF%D9%8A%D8%AF%D8%A9-%D9%88%D8%AD%D8%A7%D9%84%D8%AA%D9%8A-%D9%88%D9%81%D8%A7%D8%A9-%D8%A8%D9%81%D9%8A%D8%B1%D9%88%D8%B3-%D9%83%D9%88%D8%B1%D9%88%D9%86%D8%A7/": {"file": "1550981615.html", "date": "2020-03-17"}, "https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/": {"file": "1806793639.html", "date": "2020-12-25"}, "https://www.beirutobserver.com/2020/11/2338761/": {"file": "1764731404.html", "date": "2020-11-09"}, "https://money.udn.com/money/story/5603/4909425": {"file": "1728970162.html", "date": "2020-10-04"}, "http://www.upmedia.mg/news_info.php?SerialNo=83141": {"file": "1546021647.html", "date": "2020-03-12"}, "https://news.ltn.com.tw/news/business/breakingnews/3119452": {"file": "1564888577.html", "date": "2020-04-01"}, "https://news.sina.com.tw/article/20200121/34046328.html": {"file": "1500260110.html", "date": "2020-01-21"}} \ No newline at end of file diff --git a/tests/evaluation.py b/tests/evaluation.py index 394125dd..7ecb5fb1 100644 --- a/tests/evaluation.py +++ b/tests/evaluation.py @@ -21,12 +21,11 @@ from articleDateExtractor import extractArticlePublishedDate from date_guesser import guess_date from goose3 import Goose - from newspaper import Article - from newspaper.article import ArticleDownloadState + from newspaper import Article, parsers from newsplease import NewsPlease except ImportError: extractArticlePublishedDate = guess_date = Goose = None - Article = ArticleDownloadState = NewsPlease = None + Article = parsers = NewsPlease = None TEST_DIR = os.path.abspath(os.path.dirname(__file__)) @@ -82,24 +81,38 @@ def run_htmldate_fast(htmlstring): def run_newspaper(htmlstring): - """try with the newspaper module""" - # throws error on the eval_default dataset + """try with the newspaper module (newspaper4k) + + Only the publication date is needed, so we run newspaper's cheap metadata + pass (``get_publishing_date``) and stop before ``calculate_best_node`` and + the rest of ``parse()``. This skips the per-language NLP tokenizers (much + faster, and avoids the optional language-data dependencies) and yields the + exact same date. Note: feeding the HTML via ``download(input_html=...)`` is + the correct newspaper4k entry point -- the older ``Article(html)`` hack + raised ``UnicodeEncodeError`` on non-ASCII pages, silently counting them as + misses. ``extractor.get_publishing_date`` is semi-internal: pin newspaper4k. + """ try: - myarticle = Article(htmlstring) - myarticle.html = htmlstring - myarticle.download_state = ArticleDownloadState.SUCCESS - myarticle.parse() + article = Article(url="") + article.download(input_html=htmlstring) + article.doc = parsers.fromstring(article.html) + if article.doc is None: + return None + publish_date = article.extractor.get_publishing_date(article.url, article.doc) except (UnicodeDecodeError, UnicodeEncodeError): return None - if myarticle.publish_date is None or myarticle.publish_date == "": - return None - return str(myarticle.publish_date)[0:10] + return str(publish_date)[0:10] if publish_date else None def run_newsplease(htmlstring): - """try with newsplease""" + """try with newsplease + + ``fetch_images=False`` skips image downloading/processing (and uses the + no-images newspaper extractor internally); the publication date is + unaffected and the call is ~2.5x faster. + """ try: - article = NewsPlease.from_html(htmlstring, url=None) + article = NewsPlease.from_html(htmlstring, url=None, fetch_images=False) if article.date_publish is None: return None return convert_date(article.date_publish, "%Y-%m-%d %H:%M:%S", "%Y-%m-%d") diff --git a/tests/unit_tests.py b/tests/unit_tests.py index 92b37d32..cd472a6c 100644 --- a/tests/unit_tests.py +++ b/tests/unit_tests.py @@ -153,12 +153,13 @@ def test_sanity(): assert is_valid_format("ABC") is False assert is_valid_format(123) is False assert is_valid_format(("a", "b")) is False - _, discarded = discard_unwanted( + tree = discard_unwanted( html.fromstring( '
000
AAA
' ) ) - assert len(discarded) == 1 + assert tree.find('.//div[@id="wm-ipp"]') is None # archive.org banner removed + assert "AAA" in tree.text_content() # real content kept # reset caches: examine_date_elements used above old_values = try_date_expr.cache_info() reset_caches()