Skip to content

add 1-hour and 2-hour retries for all tasks except pp-starter#243

Merged
ceblanton merged 2 commits into
mainfrom
242.add-retries
Jun 11, 2026
Merged

add 1-hour and 2-hour retries for all tasks except pp-starter#243
ceblanton merged 2 commits into
mainfrom
242.add-retries

Conversation

@ceblanton

@ceblanton ceblanton commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Describe your changes

Also, set runahead limit lower due to cylc error. (In Chris's defense, cylc validate did not flag the 99999 as too high)

Issue ticket number and link (if applicable)

#242 #241

Checklist before requesting a review

  • I ran my code
  • I tried to make my code readable
  • I tried to comment my code
  • I wrote a new test, if applicable
  • I wrote new instructions/documentation, if applicable
  • I ran pytest and inspected it's output
  • I ran pylint and attempted to implement some of it's feedback
  • No print statements; all user-facing info uses logging module

Manual Pipeline Run Details

Was the manual pipeline (test_cloud_runner) triggered for this PR?

  • Yes
  • No

Result of manual pipeline run:

(Paste relevant logs, output, or a link to the workflow run here)

How to trigger the manual pipeline:

The test_cloud_runner pipeline is not automatically associated as a required check with the PR; it must be triggered to test changes in a full post-processing run.

To trigger the manual pipeline:

  1. Follow the link to the test_cloud_runner actions tab here

    • you should see "This workflow has a workflow_dispatch event trigger"
  2. Click the dropdown "Run workflow":

    a. If trying to merge from a branch on fre-workflows: choose branch from the first drop down, leave the next 2 inputs blank, and choose the fre-cli branch to test

    b. If trying to merge from a fre-workflows fork: can skip first branch selection, input the fork name (ex: [user]/fre-workflows), input the fork's branch name, and choose the fre-cli branch to test

  3. Click "Run workflow"

Note: you may need to reload the page to see your running workflow.

…ter.

Also, set runahead limit lower due to cylc error.
@ceblanton

Copy link
Copy Markdown
Contributor Author

@singhd789 thank you for holding your nose on the lack of tests for these rapid fixes. True to form, that last PR was buggy. 99999 was too high but cylc validate didn't catch it. I saw it when I tested the retries.

@ceblanton ceblanton requested a review from singhd789 June 11, 2026 18:31
Comment thread flow.cylc

[[pp-starter]]
inherit = PP-STARTER
# Note: pp-starter is a fake task and should be rephrased as a proper trigger.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait how is it a "fake" task?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does zero productive work. It's merely a trigger for the other tasks. Cylc has proper mechanisms for triggers, and we just never got there yet. You can see the modern triggers in Alex's workflows

@singhd789 singhd789 Jun 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. I guess I saw this as a task because it does work to resolve the target file. Either way, it's just a comment I suppose

Comment thread flow.cylc Outdated
[runtime]
[[root]]
# retry all tasks twice- once after an hour, second after two hours
execution retry delays = PT1H, PT2H

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, does this mean that if a task stalls/fails, the workflow will be hanging for an hour or 2 for retries?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. This is not desired behavior for the testing pipelines, is it?

@singhd789 singhd789 Jun 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It miiiight be ok for the test_cloud_runner actually because we have this: https://github.com/NOAA-GFDL/fre-workflows/blob/main/for_gh_runner/runscript.sh#L121

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just thinking that might not be desired by users or maybe for the local test, but if we need it, we can work with it

@singhd789 singhd789 Jun 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm wait, the default behavior might already be an hour

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no default retry. I do think that this would be bad feature for the test pipelines, probably wasting cloud resources.

@ceblanton ceblanton merged commit ff6df48 into main Jun 11, 2026
1 check passed
@ceblanton ceblanton deleted the 242.add-retries branch June 11, 2026 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants