Skip to content

Revisit handling of can_finish and high level exceptions #129

@benoit74

Description

@benoit74

We have a very weird end of scrape at https://farm.openzim.org/pipeline/0736eb17-c4a6-4065-9774-e155c81300f5:

[mindtouch2zim::Thread-4 (worker)::2025-01-10 01:17:12,990] DEBUG:Fetched directly from S3 cache
[mindtouch2zim::Thread-4 (worker)::2025-01-10 01:17:12,990] DEBUG:Adding asset to bio.libretexts.org/@api/deki/files/10689/Figure_11_01_03.jpg?revision=1&size=bestfit&width=910&height=803 in the ZIM
[mindtouch2zim::Thread-5 (worker)::2025-01-10 01:17:13,041] DEBUG:Fetching from online
[mindtouch2zim::Thread-3 (worker)::2025-01-10 01:17:13,056] DEBUG:Fetched directly from S3 cache
[mindtouch2zim::Thread-3 (worker)::2025-01-10 01:17:13,056] DEBUG:Adding asset to bio.libretexts.org/@api/deki/files/11609/base_pairing_labeled.png?revision=2&size=bestfit&width=741&height=347 in the ZIM
[mindtouch2zim::Thread-10 (worker)::2025-01-10 01:17:13,057] DEBUG:Fetched directly from S3 cache
[mindtouch2zim::Thread-10 (worker)::2025-01-10 01:17:13,057] DEBUG:Adding asset to bio.libretexts.org/@api/deki/files/10691/Figure_11_01_05.jpg?revision=1&size=bestfit&width=1030&height=1445 in the ZIM
[mindtouch2zim::Thread-6 (worker)::2025-01-10 01:17:13,070] DEBUG:Fetched directly from S3 cache
[mindtouch2zim::Thread-6 (worker)::2025-01-10 01:17:13,070] DEBUG:Adding asset to bio.libretexts.org/@api/deki/files/10692/Figure_11_01_06.jpg?revision=1&size=bestfit&width=1021&height=942 in the ZIM
[mindtouch2zim::Thread-5 (worker)::2025-01-10 01:17:13,173] DEBUG:Optimizing
[mindtouch2zim::Thread-5 (worker)::2025-01-10 01:17:13,194] DEBUG:Uploading to S3
[mindtouch2zim::Thread-5 (worker)::2025-01-10 01:17:13,359] DEBUG:Adding asset to bio.libretexts.org/@api/deki/files/21513/mindtouch.page%23thumbnail?revision=1 in the ZIM
[mindtouch2zim::Thread-2 (worker)::2025-01-10 01:17:24,368] DEBUG:Request error, starting backoff of 12.4 seconds after 4 tries
[mindtouch2zim::Thread-7 (worker)::2025-01-10 01:17:28,196] DEBUG:Request error, starting backoff of 8.6 seconds after 4 tries
[mindtouch2zim::Thread-2 (worker)::2025-01-10 01:17:36,761] WARNING:Exception while processing asset from https://search.openverse.engineering/static/img/cc_icon.svg?media_id=ac219762-a26d-45fd-823d-4ff90c5f3706 used by page ID 84593 (https://bio.libretexts.org/Sandboxes/tholmberg_at_nwcc.edu/Introduction_to_Environmental_Science/11%3A_Conventional_and_Sustainable_Energy/10.2%3A_Forms_of_Energy): HTTPSConnectionPool(host='search.openverse.engineering', port=443): Max retries exceeded with url: /static/img/cc_icon.svg?media_id=ac219762-a26d-45fd-823d-4ff90c5f3706 (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f4128a51490>: Failed to resolve 'search.openverse.engineering' ([Errno -5] No address associated with hostname)"))
[mindtouch2zim::MainThread::2025-01-10 01:17:36,766] INFO:  Progress 74646 / 74648
[mindtouch2zim::Thread-7 (worker)::2025-01-10 01:17:36,801] WARNING:Exception while processing asset from https://search.openverse.engineering/static/img/cc-by_icon.svg used by page ID 84593 (https://bio.libretexts.org/Sandboxes/tholmberg_at_nwcc.edu/Introduction_to_Environmental_Science/11%3A_Conventional_and_Sustainable_Energy/10.2%3A_Forms_of_Energy): HTTPSConnectionPool(host='search.openverse.engineering', port=443): Max retries exceeded with url: /static/img/cc-by_icon.svg (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f411dd181a0>: Failed to resolve 'search.openverse.engineering' ([Errno -5] No address associated with hostname)"))
[mindtouch2zim::MainThread::2025-01-10 01:17:36,809] WARNING:1413 bad assets have been ignored
[mindtouch2zim::MainThread::2025-01-10 01:17:36,819] ERROR:ZIM creation failed
[mindtouch2zim::MainThread::2025-01-10 01:17:36,819] INFO:  Progress 74648 / 74648

Looking at the code, I have no clue how this could happen (and exception should have be re-raised since we manipulate can_finish at only one place). And no clue how I intended this to work (if creator.can_finish is never supposed to be false ... glad I placed this code however ...).

Anyway, there is something to fix here.

Note that there is only one log with ERROR level in the whole task.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions