[fournos_launcher] Allow launching multiple jobs concurrently by kpouget · Pull Request #49 · openshift-psap/forge

kpouget · 2026-04-29T14:24:57Z

Summary by CodeRabbit

New Features
- /parallel directive and predefined parallel job slots to launch multiple job instances.
- Parallel execution helper to run tasks concurrently with optional per-task artifact directories.
- New CLI: cleanup job command; new resolve subcommand for hardware stubbing.
Changes
- Export command renamed to export-artifacts; export now defaults artifact directory and may update job export status.
- Job-name env var standardized to FJOB_NAME.
- Notifications include variable_overrides.yaml when present.
- Test presets: added medium_test (120s) and shortened long_test (300s).

openshift-ci · 2026-04-29T14:25:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tosokin for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-04-29T14:25:15Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Introduces thread-local script execution context and per-thread ARTIFACT_DIR handling, a Parallel context manager for concurrent tasks, a /parallel FOURNOS directive and parallel submission flow with per-job artifact suffixes, new cleanup CLI, config/preset changes, and several CLI/env renames and export updates.

Changes

Cohort / File(s)	Summary
DSL execution & script manager `projects/core/dsl/runtime.py`, `projects/core/dsl/script_manager.py`	`execute_tasks` now starts/clears a thread-local ScriptManager execution context and strips `artifact_dirname_prefix`/`artifact_dirname_suffix` from function args before execution. `ScriptManager` adds thread-local `task_results`, a Lock, and methods `start_execution_context`, `clear_execution_context`, `has_execution_context`; `get_task_result` uses the thread-local container.
Thread-local artifact dir & env helpers `projects/core/library/env.py`	Adds TLS storage and accessors for `ARTIFACT_DIR` (`__getattr__`, `get_tls_artifact_dir`, `_set_tls_artifact_dir`, `ensure_thread_artifact_dir`), updates Next/TempArtifactDir and index logic to prefer thread-local with global fallback and initializes TLS in `init()`.
Parallel execution framework `projects/core/library/run.py`	Adds `Parallel` context manager with `delayed()` to enqueue functions, executes them via ThreadPoolExecutor on exit, optionally creates a dedicated per-context artifact subdir and sets thread-local ARTIFACT_DIR for workers, cancels remaining tasks on worker exception and handles exit behavior.
/parallel directive & config `projects/fournos_launcher/orchestration/pr_args.py`, `projects/fournos_launcher/orchestration/config.yaml`	Adds `/parallel` directive handler `handle_parallel_directive` and registers it; `config.yaml` adds `fournos_launcher.parallel_jobs` keys `"0"`–`"9"` initialized to `[]`.
Parallel job submission & submit tool changes `projects/fournos_launcher/orchestration/submit.py`, `projects/fournos_launcher/toolbox/submit_and_wait/main.py`, `projects/fournos_launcher/toolbox/cleanup_fjob/main.py`	`submit.py` can launch multiple `submit_and_wait` calls concurrently using `fournos_launcher.parallel_jobs`, aggregates per-job failures, and invokes `cleanup_fjob` for each submitted job. `submit_and_wait.run()` gains `artifact_dirname_suffix`. Adds new `cleanup_fjob` toolbox CLI to conditionally delete FOURNOS jobs (with optional shutdown check).
Export CLI & job-name env rename `projects/core/library/export.py`, `projects/core/ci_entrypoint/fournos_resolve.py`	Renames CLI command `export` → `export-artifacts`; export defaults artifact dir to `ARTIFACT_BASE_DIR` and writes status back to FournosJob when `FOURNOS_CI=true`. Renames env/CLI var `FOURNOS_JOB_NAME` → `FJOB_NAME` and updates option wiring.
Config overrides materialization `projects/core/library/config.py`	`apply_config_overrides` will materialize missing dotted override keys by resolving the parent config and inserting the new leaf (calls `save_config()`), otherwise raises as before.
Notifications message content `projects/core/notifications/send.py`	`get_common_message` now reads `ARTIFACT_DIR/000__ci_metadata/variable_overrides.yaml` when present and includes it under “Test configuration”; fallback text updated to reference both `variable_overrides.yaml/pr_config.txt`.
CI/tooling, presets & minor CLI changes `projects/llm_d_legacy/orchestration/ci.py`, `projects/skeleton/orchestration/ci.py`, `projects/skeleton/orchestration/presets.d/presets.yaml`, `projects/core/ci_entrypoint/fournos.py`, `projects/llm_d_legacy/testing/*`	Adds stub `resolve_hardware_request` and Fournos resolve subcommand in llm_d_legacy; removes Caliper export side-effects from skeleton CI `test`; adds `medium_test` preset and shortens `long_test`; replaces dynamic `oc` fetch with `load_fjob_yaml` reading `fournos_fjob.yaml`; large preset/config YAML additions/removals in llm_d_legacy testing.
Entrypoint/tooling signature tweaks `projects/fournos_launcher/toolbox/submit_and_wait/main.py`	`run(...)` now accepts keyword-only `artifact_dirname_suffix: str = None`; removed the FOURNOS job-deletion cleanup step from that task.

Sequence Diagram

sequenceDiagram
    participant CLI as User/CLI
    participant Parser as Directive Parser
    participant Config as Config Manager
    participant Submit as Submit Orchestrator
    participant Script as ScriptManager
    participant Worker as Job Worker Thread

    CLI->>Parser: submit with /parallel directive
    Parser->>Config: write fournos_launcher.parallel_jobs[index]=[presets]
    CLI->>Submit: trigger submission
    Submit->>Config: read fournos_launcher.parallel_jobs
    alt parallel entries present
        Submit->>Submit: build shared timestamp base name
        loop per job
            Submit->>Worker: start thread with job args (+artifact_dirname_suffix)
            Worker->>Script: start_execution_context(source_file)
            Worker->>Worker: set thread-local ARTIFACT_DIR (if dedicated)
            Worker->>Worker: call submit_and_wait(...)
            Worker-->>Script: clear_execution_context() (finally)
        end
        Submit->>Submit: aggregate failures
        Submit->>Submit: call cleanup_fjob per submitted job
    else no parallel config
        Submit->>Worker: single submit_and_wait (normal flow)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[Skeleton] Rework to be more exhaustive #26: Overlaps modifications to projects/core/dsl/runtime.py::execute_tasks and artifact handling — strong code-level relation.
[fournos-deploy] Get the foreign project to work #30: Touches the DSL execution stack and ScriptManager-related behavior changes — closely related.
[forge] include some documentation pages #16: Modifies task-execution flow and logging in projects/core/dsl/runtime.py — related to execution-context and task result handling.

Poem

🐰 Threads hop, directories spin,

suffixes stitch to jobs within,
contexts start then vanish clean,
presets call and runners preen,
a rabbit cheers: "Concurrent win!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 77.19% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and clearly summarizes the main objective: enabling concurrent job launching in the fournos_launcher module, which aligns with the extensive changes across multiple files for parallel job submission, context management, and related infrastructure.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

kpouget · 2026-04-29T14:26:01Z

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

kpouget · 2026-04-29T14:32:00Z

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

coderabbitai

Actionable comments posted: 10

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@projects/core/dsl/runtime.py`:
- Around line 203-205: The finally block calls
script_manager.clear_execution_context() but script_manager is only assigned
inside the try, so if earlier operations raise an exception the finally will
raise UnboundLocalError and hide the original error; to fix, initialize
script_manager = None (or a sentinel) before the try and in the finally guard
the call (e.g., if script_manager is not None:
script_manager.clear_execution_context()), referencing the script_manager
variable and its clear_execution_context() method to locate where to add the
initialization and the conditional cleanup.

In `@projects/core/library/env.py`:
- Around line 37-39: The TLS setter _set_tls_artifact_dir currently has no
effect because NextArtifactDir, TempArtifactDir, and next_artifact_index still
read/write the module-global ARTIFACT_DIR/ARTIFACT_INDEX; update those helpers
to prefer thread-local storage (read _tls_artifact_dir.val and a thread-local
index like _tls_artifact_index.val) and only fall back to the module-global
ARTIFACT_DIR/ARTIFACT_INDEX when the TLS value is None, and ensure writes
(incrementing index or setting current dir) update the TLS slot so each worker
thread gets isolated artifact dirs and indexes; change references inside
NextArtifactDir, TempArtifactDir, and next_artifact_index accordingly.
- Around line 15-25: The module-level ARTIFACT_DIR currently exists so Python
never calls __getattr__ for env.ARTIFACT_DIR; remove or stop defining the
top-level ARTIFACT_DIR constant so attribute access will fall through to
__getattr__ (which checks _tls_artifact_dir and globals().get("ARTIFACT_DIR") as
a fallback). Locate and delete the module-level ARTIFACT_DIR definition (or
ensure it is removed from globals() during import) so __getattr__ can intercept
reads and return thread-local values using _tls_artifact_dir.val.

In `@projects/core/library/run.py`:
- Around line 199-202: Parallel blocks with no tasks cause
ThreadPoolExecutor(max_workers=0) to raise; update run.Parallel handling by
checking self.parallel_tasks before creating the executor and treat an empty
list as a no-op. Specifically, before computing max_workers or entering
ThreadPoolExecutor, return/exit early (or skip executor creation) when
self.parallel_tasks is empty; alternatively ensure max_workers is computed as
max(1, min(len(self.parallel_tasks), os.cpu_count() or 1)). Change the logic
around the max_workers variable and ThreadPoolExecutor usage so
run.Parallel(...) with no tasks does not instantiate ThreadPoolExecutor with 0
workers.
- Around line 174-213: The bug is that a single parallel_dir is created once and
shared by all futures so downstream env.next_artifact_index() calls still race;
to fix, generate a unique artifact directory per task before submitting: inside
the loop over self.parallel_tasks call env.next_artifact_index() (use
env.next_artifact_index()) for each delayed_func, build and mkdir a
task-specific parallel_dir, then submit _run_with_artifact_dir with that unique
parallel_dir (instead of the shared parallel_dir); keep use of
_run_with_artifact_dir, mkdir(exist_ok=True) and pass the created dir into
executor.submit so each worker gets its own non-colliding subtree.
- Line 9: The file imports joblib unconditionally via the top-level import
joblib which will raise ModuleNotFoundError if joblib is not declared as a
dependency; either add "joblib" to the project's dependencies in pyproject.toml,
or change the import to the safe pattern used in projects/legacy/library/run.py:
wrap import joblib in try/except ImportError, set a sentinel (e.g., joblib =
None) and guard any usage of joblib (functions/classes that call joblib) so
behavior degrades gracefully; reference the top-level import joblib and the
EXTRA_PACKAGES = ["joblib"] note in run_ci.py when choosing the fix.
- Around line 164-165: The code incorrectly uses joblib.delayed(function)(*args,
**kwargs) which returns a tuple-like delayed object and later code expects
attributes .func/.args/.keywords; change the append site
(self.parallel_tasks.append(...)) to store the raw tuple (function, args,
kwargs) instead of using joblib.delayed, then update the execution path that
reads those entries (the logic around the lines that access .func, .args,
.keywords) to unpack the tuple as (func, args, kwargs) and call func(*args,
**kwargs); also remove the unnecessary joblib.delayed usage/import if no longer
needed so only the plain tuple is used when scheduling and running tasks.

In `@projects/fournos_launcher/orchestration/pr_args.py`:
- Around line 241-254: The parser currently accepts any non-negative integer for
the /parallel index and returns {f"fournos_launcher.parallel_jobs.{idx}":
presets}, but config.yaml only defines slots 0–9 so indices >=10 will be ignored
later; after converting idx (and after the non-negative check) validate that idx
is within the allowed range (e.g. 0 <= idx < 10) and raise a ValueError with a
clear message like "Invalid /parallel directive: index out of range, must be
0..9, got {idx} in '{line}'"; implement this check near the existing idx parsing
block (or introduce a MAX_PARALLEL_JOBS constant) to prevent silently dropped
configs.

In `@projects/fournos_launcher/orchestration/submit.py`:
- Around line 135-148: The submit_kwargs dict in submit.py captures status_dest
as env.ARTIFACT_DIR too early, causing all parallel submissions to share the
same artifact dir; update submit_kwargs (the dict built in submit.py) to not
freeze status_dest to env.ARTIFACT_DIR — either remove the "status_dest" key or
set it to None so submit_and_wait can use the thread-local/artifact-dir chosen
by the worker/run.Parallel; ensure submit_and_wait still accepts and handles a
None status_dest fallback to the per-worker env.ARTIFACT_DIR/"artifacts".

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ac92e4f5-2162-41d1-b95e-cda672e7a047

📥 Commits

Reviewing files that changed from the base of the PR and between 9b669c6 and ef51d9a.

📒 Files selected for processing (7)

projects/core/dsl/runtime.py
projects/core/dsl/script_manager.py
projects/core/library/env.py
projects/core/library/run.py
projects/fournos_launcher/orchestration/config.yaml
projects/fournos_launcher/orchestration/pr_args.py
projects/fournos_launcher/orchestration/submit.py

coderabbitai · 2026-04-29T14:34:15Z

+    try:
+        idx = int(idx_str)
+        if idx < 0:
+            raise ValueError(
+                f"Invalid /parallel directive: index must be non-negative, got {idx} in '{line}'"
+            )
+    except ValueError as e:
+        if "non-negative" in str(e):
+            raise
+        raise ValueError(
+            f"Invalid /parallel directive: index must be a number, got '{idx_str}' in '{line}'"
+        ) from None
+
+    return {f"fournos_launcher.parallel_jobs.{idx}": presets}


⚠️ Potential issue | 🟠 Major

Validate /parallel indices against the configured slots.

config.yaml only declares fournos_launcher.parallel_jobs.0 through .9. A directive like /parallel 10 ... passes parsing here, but submit.py only reads the declared map, so that job config can be silently dropped.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/fournos_launcher/orchestration/pr_args.py` around lines 241 - 254, The parser currently accepts any non-negative integer for the /parallel index and returns {f"fournos_launcher.parallel_jobs.{idx}": presets}, but config.yaml only defines slots 0–9 so indices >=10 will be ignored later; after converting idx (and after the non-negative check) validate that idx is within the allowed range (e.g. 0 <= idx < 10) and raise a ValueError with a clear message like "Invalid /parallel directive: index out of range, must be 0..9, got {idx} in '{line}'"; implement this check near the existing idx parsing block (or introduce a MAX_PARALLEL_JOBS constant) to prevent silently dropped configs.

kpouget · 2026-04-29T19:42:54Z

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

kpouget · 2026-04-29T21:08:13Z

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

kpouget · 2026-04-29T21:29:38Z

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

psap-forge-bot · 2026-04-29T21:30:28Z

🔴 Test of 'fournos_launcher submit' failed after 00 hours 00 minutes 02 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

• Failure indicator: Empty.
• Execution logs

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (3)

projects/core/dsl/script_manager.py (1)

98-116: clear_tasks lacks thread-safety synchronization.

While register_task is protected by self._lock, clear_tasks mutates both _task_registry and _task_results without acquiring the lock. This could cause race conditions if tasks are being registered concurrently with a clear operation.

🔒 Proposed fix to add lock protection

     def clear_tasks(self, source_file: str | None = None) -> None:
         """
         Clear tasks from registry

         Args:
             source_file: If specified, only clear tasks from this file.
                         If None, clear all tasks from all files.
         """
-        if source_file is None:
-            # Clear all tasks from all files
-            logger.debug("Clearing all tasks from script manager")
-            self._task_registry.clear()
-            self._task_results.clear()
-        else:
-            # Clear tasks from specific file
-            if source_file in self._task_registry:
-                tasks_to_remove = self._task_registry[source_file]
-
-                # Clear task results for tasks from this file
-                for task_info in tasks_to_remove:
-                    task_name = task_info["name"]
-                    if task_name in self._task_results:
-                        del self._task_results[task_name]
-
-                # Remove tasks from this file
-                del self._task_registry[source_file]
-                logger.debug(f"Cleared {len(tasks_to_remove)} tasks from {source_file}")
+        with self._lock:
+            if source_file is None:
+                # Clear all tasks from all files
+                logger.debug("Clearing all tasks from script manager")
+                self._task_registry.clear()
+                self._task_results.clear()
+            else:
+                # Clear tasks from specific file
+                if source_file in self._task_registry:
+                    tasks_to_remove = self._task_registry[source_file]
+
+                    # Clear task results for tasks from this file
+                    for task_info in tasks_to_remove:
+                        task_name = task_info["name"]
+                        if task_name in self._task_results:
+                            del self._task_results[task_name]
+
+                    # Remove tasks from this file
+                    del self._task_registry[source_file]
+                    logger.debug(f"Cleared {len(tasks_to_remove)} tasks from {source_file}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/core/dsl/script_manager.py` around lines 98 - 116, clear_tasks
mutates shared state (_task_registry and _task_results) without acquiring the
instance lock, creating a race with register_task which uses self._lock; wrap
the entire clear_tasks implementation in the same lock used by register_task
(acquire self._lock at method start and release at end, including both the
branch that clears all tasks and the branch that clears a specific source_file)
so all reads/writes to _task_registry and _task_results are synchronized with
other methods using self._lock.

projects/fournos_launcher/toolbox/submit_fjob/main.py (1)

179-183: Consider moving import to module level.

The yaml import inside the task function works but is unconventional. Moving it to the top of the file with other imports improves readability and makes dependencies explicit.

♻️ Proposed refactor

At the top of the file:

 import logging
 from datetime import datetime
+
+import yaml
 
 from projects.core.dsl import (

Then remove line 179:

-    import yaml
-
     job_info_file = args.artifact_dir / "artifacts" / f"{ctx.final_job_name}-submission-info.yaml"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/fournos_launcher/toolbox/submit_fjob/main.py` around lines 179 -
183, Move the local "import yaml" out of the task function and add it to the
module-level imports at the top of the file so dependencies are explicit; then
remove the inline import on the block that writes job_info_file (the code that
constructs job_info_file using args.artifact_dir and ctx.final_job_name and
calls yaml.dump(job_info, ...)) so the function simply uses the module-level
yaml reference.

projects/fournos_launcher/orchestration/submit.py (1)

138-151: status_dest in submit_kwargs is unused by submit_fjob.

Looking at the submit_fjob function signature (context snippet 1), it doesn't accept a status_dest parameter. This key in submit_kwargs will be ignored when unpacked with **submit_kwargs at line 181.

This isn't a bug since wait_fjobs calls explicitly pass status_dest, but the unused key adds confusion.

♻️ Proposed fix to remove unused key

     submit_kwargs = {
         "cluster_name": cluster_name,
         "project": config.project.get_config("ci_job.project"),
         "variables_overrides": overrides,
         "namespace": config.project.get_config("fournos.namespace"),
         "owner": config.project.get_config("fournos.job.owner"),
         "pipeline_name": config.project.get_config("fournos.job.pipeline_name"),
         "env": env_dict,
-        "status_dest": env.ARTIFACT_DIR,
         "ci_label": config.project.get_config("fournos.job.ci_label"),
         "exclusive": config.project.get_config("fournos.job.exclusive"),
         "gpu_count": gpu_count,
         "gpu_type": gpu_type,
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/fournos_launcher/orchestration/submit.py` around lines 138 - 151,
submit_kwargs contains a "status_dest" key that submit_fjob does not accept and
will be ignored when **submit_kwargs is used; remove the "status_dest":
env.ARTIFACT_DIR entry from the submit_kwargs dict (in
projects/fournos_launcher/orchestration/submit.py) and instead pass
env.ARTIFACT_DIR explicitly to wait_fjobs (which already expects status_dest) or
add status_dest as an explicit argument where needed—do not change submit_fjob
signature unless you intend for it to consume status_dest.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@projects/fournos_launcher/toolbox/wait_fjobs/main.py`:
- Around line 334-340: The check for command output uses yaml_list.stdout which
will be empty when stdout was redirected via stdout_dest; instead, after running
shell.run(...) keep the yaml_list.success (or returncode) check and verify the
artifact file (artifact_dir / f"{job_name}-pods.yaml") exists and is non-empty
(e.g., .exists() and .stat().st_size > 0) before incrementing captured_count;
update the condition that references yaml_list.stdout to check the file on disk
and leave yaml_list (the shell.run result) for success/returncode checks.
- Around line 306-314: The code currently checks both yaml_list.success and
yaml_list.stdout and yaml_list.stdout.strip() before counting captured
PipelineRun specs; update the condition to only check yaml_list.success (as
other similar code paths do when stdout_dest is used) so replace the combined
stdout checks with a single yaml_list.success in the function that builds the
message returning "Captured PipelineRun specs..." (look for yaml_list and
ctx.target_jobs) and apply the identical simplification in the
capture_all_pod_specs location where the same triple-check appears.
- Around line 122-123: The retry decorator currently hardcodes
`@retry`(attempts=120, delay=10, backoff=1.0) which yields a 20-minute limit;
update this to be configurable instead of fixed: read attempts, delay and
backoff from environment variables or function parameters (e.g., RETRY_ATTEMPTS,
RETRY_DELAY, RETRY_BACKOFF) and apply them to the retry decorator or construct
the decorator dynamically so longer (or overridable) timeouts are possible for
FOURNOS model jobs; alternatively, increase default values to reflect multi-hour
ML runs but ensure the new values are settable at runtime.

---

Nitpick comments:
In `@projects/core/dsl/script_manager.py`:
- Around line 98-116: clear_tasks mutates shared state (_task_registry and
_task_results) without acquiring the instance lock, creating a race with
register_task which uses self._lock; wrap the entire clear_tasks implementation
in the same lock used by register_task (acquire self._lock at method start and
release at end, including both the branch that clears all tasks and the branch
that clears a specific source_file) so all reads/writes to _task_registry and
_task_results are synchronized with other methods using self._lock.

In `@projects/fournos_launcher/orchestration/submit.py`:
- Around line 138-151: submit_kwargs contains a "status_dest" key that
submit_fjob does not accept and will be ignored when **submit_kwargs is used;
remove the "status_dest": env.ARTIFACT_DIR entry from the submit_kwargs dict (in
projects/fournos_launcher/orchestration/submit.py) and instead pass
env.ARTIFACT_DIR explicitly to wait_fjobs (which already expects status_dest) or
add status_dest as an explicit argument where needed—do not change submit_fjob
signature unless you intend for it to consume status_dest.

In `@projects/fournos_launcher/toolbox/submit_fjob/main.py`:
- Around line 179-183: Move the local "import yaml" out of the task function and
add it to the module-level imports at the top of the file so dependencies are
explicit; then remove the inline import on the block that writes job_info_file
(the code that constructs job_info_file using args.artifact_dir and
ctx.final_job_name and calls yaml.dump(job_info, ...)) so the function simply
uses the module-level yaml reference.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0e2b6019-d8bc-4f91-9835-6fdf68c88041

📥 Commits

Reviewing files that changed from the base of the PR and between ef51d9a and 0202aa3.

📒 Files selected for processing (8)

projects/core/ci_entrypoint/run_ci.py
projects/core/dsl/runtime.py
projects/core/dsl/script_manager.py
projects/fournos_launcher/orchestration/config.yaml
projects/fournos_launcher/orchestration/pr_args.py
projects/fournos_launcher/orchestration/submit.py
projects/fournos_launcher/toolbox/submit_fjob/main.py
projects/fournos_launcher/toolbox/wait_fjobs/main.py

✅ Files skipped from review due to trivial changes (2)

projects/core/ci_entrypoint/run_ci.py
projects/fournos_launcher/orchestration/config.yaml

🚧 Files skipped from review as they are similar to previous changes (2)

projects/core/dsl/runtime.py
projects/fournos_launcher/orchestration/pr_args.py

kpouget · 2026-04-30T07:22:21Z

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

psap-forge-bot · 2026-04-30T07:23:56Z

🔴 Test of 'fournos_launcher submit' failed after 00 hours 00 minutes 53 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

• Failure indicator:

## /logs/artifacts/FAILURE 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
~~ projects/fournos_launcher/toolbox/submit_and_wait/main.py:177
~~ TASK: wait_for_job_completion: Wait for FOURNOS job to complete
~~ ARTIFACT_DIR: /logs/artifacts/001__submit_and_wait
~~ LOG_FILE: /logs/artifacts/001__submit_and_wait/task.log
~~ ARGS:
~~     cluster_name: psap-mgmt
~~     project: skeleton
~~     args: []
~~     variables_overrides:
~~       fournos_launcher.parallel_jobs.1:
~~       - quick_test
~~       fournos_launcher.parallel_jobs.2:
~~       - long_test
~~       fournos_launcher.parallel_jobs.3:
~~       - quick_test
~~       - side_testing
~~     job_name: forge-skeleton-20260430-072304
~~     namespace: psap-automation
~~     owner: kpouget
~~     display_name: skeleton
~~     pipeline_name: forge-test-only
~~     env:
~~       JOB_TYPE: presubmit
~~       JOB_NAME: pull-ci-openshift-psap-forge-main-fournos
~~       JOB_SPEC: '{"type":"presubmit","job":"pull-ci-openshift-psap-forge-main-fournos","buildid":"2049751204031893504","prowjobid":"8cd4c45a-e461-4bb5-8f5b-5fe419ed411a","refs":{"org":"openshift-psap","repo":"forge","repo_link":"https://github.com/openshift-psap/forge","base_ref":"main","base_sha":"ecd138b2539c3a73e4725be54d1a005c9c3096d7","base_link":"https://github.com/openshift-psap/forge/commit/ecd138b2539c3a73e4725be54d1a005c9c3096d7","pulls":[{"number":49,"author":"kpouget","sha":"0cfb72af362b46f48a7676e86f6b984396e3db14","title":"[fournos_launcher]
~~         Allow launching multiple jobs concurrently","head_ref":"parallel","link":"https://github.com/openshift-psap/forge/pull/49","commit_link":"https://github.com/openshift-psap/forge/pull/49/commits/0cfb72af362b46f48a7676e86f6b984396e3db14","author_link":"https://github.com/kpouget"}]},"decoration_config":{"timeout":"23h0m0s","grace_period":"15s","utility_images":{"clonerefs":"us-docker.pkg.dev/k8s-infra-prow/images/clonerefs:v20260429-8f72e9c5a","initupload":"us-docker.pkg.dev/k8s-infra-prow/images/initupload:v20260429-8f72e9c5a","entrypoint":"us-docker.pkg.dev/k8s-infra-prow/images/entrypoint:v20260429-8f72e9c5a","sidecar":"us-docker.pkg.dev/k8s-infra-prow/images/sidecar:v20260429-8f72e9c5a"},"resources":{"clonerefs":{"limits":{"memory":"3Gi"},"requests":{"cpu":"100m","memory":"500Mi"}},"initupload":{"limits":{"memory":"200Mi"},"requests":{"cpu":"100m","memory":"50Mi"}},"place_entrypoint":{"limits":{"memory":"100Mi"},"requests":{"cpu":"100m","memory":"25Mi"}},"sidecar":{"limits":{"memory":"2Gi"},"requests":{"cpu":"100m","memory":"250Mi"}}},"gcs_configuration":{"bucket":"test-platform-results","path_strategy":"single","default_org":"openshift","default_repo":"origin","mediaTypes":{"log":"text/plain"},"compress_file_types":["txt","log","json","tar","html","yaml"]},"gcs_credentials_secret":"gce-sa-credentials-gcs-publisher","skip_cloning":true,"censor_secrets":true,"censoring_options":{"minimum_secret_length":6}}}'
~~       OPENSHIFT_CI: 'true'
~~       JOB_NAME_SAFE: fournos
~~       BUILD_ID: '2049751204031893504'
~~       PULL_PULL_SHA: 0cfb72af362b46f48a7676e86f6b984396e3db14
~~       PULL_NUMBER: '49'
~~       PULL_BASE_REF: main
~~       REPO_NAME: forge
~~       REPO_OWNER: openshift-psap
~~       PULL_BASE_SHA: ecd138b2539c3a73e4725be54d1a005c9c3096d7
~~       PULL_TITLE: '[fournos_launcher] Allow launching multiple jobs concurrently'
~~       PULL_REFS: main:ecd138b2539c3a73e4725be54d1a005c9c3096d7,49:0cfb72af362b46f48a7676e86f6b984396e3db14
~~       PULL_HEAD_REF: parallel
~~     status_dest: /logs/artifacts
~~     ci_label: pr49_b2049751204031893504
~~     exclusive: true
~~     gpu_count: null
~~     gpu_type: null
~~     artifact_dir: /logs/artifacts/001__submit_and_wait
~~ CONTEXT:
~~     final_job_name: forge-skeleton-20260430-072304
~~     manifest_file: /logs/artifacts/001__submit_and_wait/src/forge-skeleton-20260430-072304-manifest.yaml
~~
~~ EXCEPTION: RuntimeError
~~     Job forge-skeleton-20260430-072304 failed: Forge resolution failed: Job has reached the specified backoff limit
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


[...]

• Execution logs

kpouget · 2026-04-30T07:31:42Z

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

psap-forge-bot · 2026-04-30T07:32:25Z

🔴 Test of 'fournos_launcher submit' failed after 00 hours 00 minutes 20 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

• Failure indicator:

## /logs/artifacts/FAILURE 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
~~ projects/fournos_launcher/toolbox/submit_and_wait/main.py:177
~~ TASK: wait_for_job_completion: Wait for FOURNOS job to complete
~~ ARTIFACT_DIR: /logs/artifacts/001__submit_and_wait
~~ LOG_FILE: /logs/artifacts/001__submit_and_wait/task.log
~~ ARGS:
~~     cluster_name: psap-mgmt
~~     project: skeleton
~~     args: []
~~     variables_overrides:
~~       fournos_launcher.parallel_jobs.1:
~~       - quick_test
~~       fournos_launcher.parallel_jobs.2:
~~       - long_test
~~       fournos_launcher.parallel_jobs.3:
~~       - quick_test
~~       - side_testing
~~     job_name: forge-skeleton-20260430-073206
~~     namespace: psap-automation
~~     owner: kpouget
~~     display_name: skeleton
~~     pipeline_name: forge-test-only
~~     env:
~~       JOB_TYPE: presubmit
~~       JOB_NAME: pull-ci-openshift-psap-forge-main-fournos
~~       JOB_SPEC: '{"type":"presubmit","job":"pull-ci-openshift-psap-forge-main-fournos","buildid":"2049753557309067264","prowjobid":"5817d88e-c50e-4f4c-9582-19e76e4e5c3f","refs":{"org":"openshift-psap","repo":"forge","repo_link":"https://github.com/openshift-psap/forge","base_ref":"main","base_sha":"ecd138b2539c3a73e4725be54d1a005c9c3096d7","base_link":"https://github.com/openshift-psap/forge/commit/ecd138b2539c3a73e4725be54d1a005c9c3096d7","pulls":[{"number":49,"author":"kpouget","sha":"427cccba7d52450623e9a7b15dca4bd83bd46de0","title":"[fournos_launcher]
~~         Allow launching multiple jobs concurrently","head_ref":"parallel","link":"https://github.com/openshift-psap/forge/pull/49","commit_link":"https://github.com/openshift-psap/forge/pull/49/commits/427cccba7d52450623e9a7b15dca4bd83bd46de0","author_link":"https://github.com/kpouget"}]},"decoration_config":{"timeout":"23h0m0s","grace_period":"15s","utility_images":{"clonerefs":"us-docker.pkg.dev/k8s-infra-prow/images/clonerefs:v20260429-8f72e9c5a","initupload":"us-docker.pkg.dev/k8s-infra-prow/images/initupload:v20260429-8f72e9c5a","entrypoint":"us-docker.pkg.dev/k8s-infra-prow/images/entrypoint:v20260429-8f72e9c5a","sidecar":"us-docker.pkg.dev/k8s-infra-prow/images/sidecar:v20260429-8f72e9c5a"},"resources":{"clonerefs":{"limits":{"memory":"3Gi"},"requests":{"cpu":"100m","memory":"500Mi"}},"initupload":{"limits":{"memory":"200Mi"},"requests":{"cpu":"100m","memory":"50Mi"}},"place_entrypoint":{"limits":{"memory":"100Mi"},"requests":{"cpu":"100m","memory":"25Mi"}},"sidecar":{"limits":{"memory":"2Gi"},"requests":{"cpu":"100m","memory":"250Mi"}}},"gcs_configuration":{"bucket":"test-platform-results","path_strategy":"single","default_org":"openshift","default_repo":"origin","mediaTypes":{"log":"text/plain"},"compress_file_types":["txt","log","json","tar","html","yaml"]},"gcs_credentials_secret":"gce-sa-credentials-gcs-publisher","skip_cloning":true,"censor_secrets":true,"censoring_options":{"minimum_secret_length":6}}}'
~~       OPENSHIFT_CI: 'true'
~~       JOB_NAME_SAFE: fournos
~~       BUILD_ID: '2049753557309067264'
~~       PULL_PULL_SHA: 427cccba7d52450623e9a7b15dca4bd83bd46de0
~~       PULL_NUMBER: '49'
~~       PULL_BASE_REF: main
~~       REPO_NAME: forge
~~       REPO_OWNER: openshift-psap
~~       PULL_BASE_SHA: ecd138b2539c3a73e4725be54d1a005c9c3096d7
~~       PULL_TITLE: '[fournos_launcher] Allow launching multiple jobs concurrently'
~~       PULL_REFS: main:ecd138b2539c3a73e4725be54d1a005c9c3096d7,49:427cccba7d52450623e9a7b15dca4bd83bd46de0
~~       PULL_HEAD_REF: parallel
~~     status_dest: /logs/artifacts
~~     ci_label: pr49_b2049753557309067264
~~     exclusive: true
~~     gpu_count: null
~~     gpu_type: null
~~     artifact_dir: /logs/artifacts/001__submit_and_wait
~~ CONTEXT:
~~     final_job_name: forge-skeleton-20260430-073206
~~     manifest_file: /logs/artifacts/001__submit_and_wait/src/forge-skeleton-20260430-073206-manifest.yaml
~~
~~ EXCEPTION: RuntimeError
~~     Job forge-skeleton-20260430-073206 failed: Forge resolution failed: Job has reached the specified backoff limit
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


[...]

• Execution logs

kpouget · 2026-04-30T07:33:24Z

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt
/var fournos.namespace: psap-automation-wip

psap-forge-bot · 2026-04-30T07:34:50Z

🔴 Test of 'fournos_launcher submit' failed after 00 hours 01 minutes 04 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton
/var fournos.namespace: psap-automation-wip
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

• Failure indicator:

## /logs/artifacts/FAILURE 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
~~ projects/fournos_launcher/toolbox/submit_and_wait/main.py:177
~~ TASK: wait_for_job_completion: Wait for FOURNOS job to complete
~~ ARTIFACT_DIR: /logs/artifacts/001__submit_and_wait
~~ LOG_FILE: /logs/artifacts/001__submit_and_wait/task.log
~~ ARGS:
~~     cluster_name: psap-mgmt
~~     project: skeleton
~~     args: []
~~     variables_overrides:
~~       fournos_launcher.parallel_jobs.1:
~~       - quick_test
~~       fournos_launcher.parallel_jobs.2:
~~       - long_test
~~       fournos_launcher.parallel_jobs.3:
~~       - quick_test
~~       - side_testing
~~     job_name: forge-skeleton-20260430-073347
~~     namespace: psap-automation-wip
~~     owner: kpouget
~~     display_name: skeleton
~~     pipeline_name: forge-test-only
~~     env:
~~       JOB_TYPE: presubmit
~~       JOB_NAME: pull-ci-openshift-psap-forge-main-fournos
~~       JOB_SPEC: '{"type":"presubmit","job":"pull-ci-openshift-psap-forge-main-fournos","buildid":"2049753986755465216","prowjobid":"6db8deeb-5086-4d88-8638-49228c1f73b4","refs":{"org":"openshift-psap","repo":"forge","repo_link":"https://github.com/openshift-psap/forge","base_ref":"main","base_sha":"ecd138b2539c3a73e4725be54d1a005c9c3096d7","base_link":"https://github.com/openshift-psap/forge/commit/ecd138b2539c3a73e4725be54d1a005c9c3096d7","pulls":[{"number":49,"author":"kpouget","sha":"427cccba7d52450623e9a7b15dca4bd83bd46de0","title":"[fournos_launcher]
~~         Allow launching multiple jobs concurrently","head_ref":"parallel","link":"https://github.com/openshift-psap/forge/pull/49","commit_link":"https://github.com/openshift-psap/forge/pull/49/commits/427cccba7d52450623e9a7b15dca4bd83bd46de0","author_link":"https://github.com/kpouget"}]},"decoration_config":{"timeout":"23h0m0s","grace_period":"15s","utility_images":{"clonerefs":"us-docker.pkg.dev/k8s-infra-prow/images/clonerefs:v20260429-8f72e9c5a","initupload":"us-docker.pkg.dev/k8s-infra-prow/images/initupload:v20260429-8f72e9c5a","entrypoint":"us-docker.pkg.dev/k8s-infra-prow/images/entrypoint:v20260429-8f72e9c5a","sidecar":"us-docker.pkg.dev/k8s-infra-prow/images/sidecar:v20260429-8f72e9c5a"},"resources":{"clonerefs":{"limits":{"memory":"3Gi"},"requests":{"cpu":"100m","memory":"500Mi"}},"initupload":{"limits":{"memory":"200Mi"},"requests":{"cpu":"100m","memory":"50Mi"}},"place_entrypoint":{"limits":{"memory":"100Mi"},"requests":{"cpu":"100m","memory":"25Mi"}},"sidecar":{"limits":{"memory":"2Gi"},"requests":{"cpu":"100m","memory":"250Mi"}}},"gcs_configuration":{"bucket":"test-platform-results","path_strategy":"single","default_org":"openshift","default_repo":"origin","mediaTypes":{"log":"text/plain"},"compress_file_types":["txt","log","json","tar","html","yaml"]},"gcs_credentials_secret":"gce-sa-credentials-gcs-publisher","skip_cloning":true,"censor_secrets":true,"censoring_options":{"minimum_secret_length":6}}}'
~~       OPENSHIFT_CI: 'true'
~~       JOB_NAME_SAFE: fournos
~~       BUILD_ID: '2049753986755465216'
~~       PULL_PULL_SHA: 427cccba7d52450623e9a7b15dca4bd83bd46de0
~~       PULL_NUMBER: '49'
~~       PULL_BASE_REF: main
~~       REPO_NAME: forge
~~       REPO_OWNER: openshift-psap
~~       PULL_BASE_SHA: ecd138b2539c3a73e4725be54d1a005c9c3096d7
~~       PULL_TITLE: '[fournos_launcher] Allow launching multiple jobs concurrently'
~~       PULL_REFS: main:ecd138b2539c3a73e4725be54d1a005c9c3096d7,49:427cccba7d52450623e9a7b15dca4bd83bd46de0
~~       PULL_HEAD_REF: parallel
~~     status_dest: /logs/artifacts
~~     ci_label: pr49_b2049753986755465216
~~     exclusive: true
~~     gpu_count: null
~~     gpu_type: null
~~     artifact_dir: /logs/artifacts/001__submit_and_wait
~~ CONTEXT:
~~     final_job_name: forge-skeleton-20260430-073347
~~     manifest_file: /logs/artifacts/001__submit_and_wait/src/forge-skeleton-20260430-073347-manifest.yaml
~~
~~ EXCEPTION: RuntimeError
~~     Job forge-skeleton-20260430-073347 failed: Forge resolution failed: Job has reached the specified backoff limit
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


[...]

• Execution logs

kpouget · 2026-04-30T07:45:00Z

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt
/var fournos.namespace: psap-automation-wip

psap-forge-bot · 2026-04-30T07:45:39Z

🔴 Test of 'fournos_launcher submit' failed after 00 hours 00 minutes 02 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton
/var fournos.namespace: psap-automation-wip
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

• Failure indicator:

## /logs/artifacts/FAILURE 
--- 📍TypeError STACKTRACE ---
--- 📍submit_job.<locals>.submit_parallel_job() takes 2 positional arguments but 3 were given

   Traceback (most recent call last):
     File "/app/forge/projects/core/library/ci.py", line 100, in wrapper
       exit_code = command_func(*args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/fournos_launcher/orchestration/ci.py", line 78, in submit
       return submit_mod.submit_job()
              ^^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/fournos_launcher/orchestration/submit.py", line 192, in submit_job
       with run.Parallel("parallel_jobs", exit_on_exception=False) as parallel:
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/core/library/run.py", line 231, in __exit__
       raise e
     File "/app/forge/projects/core/library/run.py", line 219, in __exit__
       future.result()  # This will raise any exception that occurred
       ^^^^^^^^^^^^^^^
     File "/usr/lib64/python3.12/concurrent/futures/_base.py", line 449, in result
       return self.__get_result()
              ^^^^^^^^^^^^^^^^^^^
     File "/usr/lib64/python3.12/concurrent/futures/_base.py", line 401, in __get_result
       raise self._exception
     File "/usr/lib64/python3.12/concurrent/futures/thread.py", line 59, in run
       result = self.fn(*self.args, **self.kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/core/library/run.py", line 192, in _run_with_artifact_dir
       return func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
   TypeError: submit_job.<locals>.submit_parallel_job() takes 2 positional arguments but 3 were given

[...]

• Execution logs

kpouget · 2026-04-30T07:52:14Z

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt
/var fournos.namespace: psap-automation-wip

psap-forge-bot · 2026-04-30T07:55:36Z

🔴 Test of 'fournos_launcher submit' failed after 00 hours 00 minutes 04 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton
/var fournos.namespace: psap-automation-wip
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

• Failure indicator: Empty.
• Execution logs

kpouget · 2026-04-30T07:56:28Z

/test fournos skeleton quick_test
/cluster psap-mgmt
/var fournos.namespace: psap-automation-wip

psap-forge-bot · 2026-04-30T07:57:08Z

🔴 Test of 'fournos_launcher submit' failed after 00 hours 00 minutes 20 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton quick_test
/var fournos.namespace: psap-automation-wip
/cluster psap-mgmt

• Failure indicator:

## /logs/artifacts/FAILURE 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
~~ projects/fournos_launcher/toolbox/submit_and_wait/main.py:177
~~ TASK: wait_for_job_completion: Wait for FOURNOS job to complete
~~ ARTIFACT_DIR: /logs/artifacts/001__submit_and_wait
~~ LOG_FILE: /logs/artifacts/001__submit_and_wait/task.log
~~ ARGS:
~~     cluster_name: psap-mgmt
~~     project: skeleton
~~     args:
~~     - quick_test
~~     variables_overrides: {}
~~     job_name: forge-skeleton-20260430-075649
~~     namespace: psap-automation-wip
~~     owner: kpouget
~~     display_name: skeleton quick_test
~~     pipeline_name: forge-test-only
~~     env:
~~       JOB_TYPE: presubmit
~~       JOB_NAME: pull-ci-openshift-psap-forge-main-fournos
~~       JOB_SPEC: '{"type":"presubmit","job":"pull-ci-openshift-psap-forge-main-fournos","buildid":"2049759790367772672","prowjobid":"34ab9060-31e8-471d-bb8c-f4e57c960034","refs":{"org":"openshift-psap","repo":"forge","repo_link":"https://github.com/openshift-psap/forge","base_ref":"main","base_sha":"ecd138b2539c3a73e4725be54d1a005c9c3096d7","base_link":"https://github.com/openshift-psap/forge/commit/ecd138b2539c3a73e4725be54d1a005c9c3096d7","pulls":[{"number":49,"author":"kpouget","sha":"62352c8fa3fcf7a9f0217e3bc3d8bf803e4f1a31","title":"[fournos_launcher]
~~         Allow launching multiple jobs concurrently","head_ref":"parallel","link":"https://github.com/openshift-psap/forge/pull/49","commit_link":"https://github.com/openshift-psap/forge/pull/49/commits/62352c8fa3fcf7a9f0217e3bc3d8bf803e4f1a31","author_link":"https://github.com/kpouget"}]},"decoration_config":{"timeout":"23h0m0s","grace_period":"15s","utility_images":{"clonerefs":"us-docker.pkg.dev/k8s-infra-prow/images/clonerefs:v20260429-8f72e9c5a","initupload":"us-docker.pkg.dev/k8s-infra-prow/images/initupload:v20260429-8f72e9c5a","entrypoint":"us-docker.pkg.dev/k8s-infra-prow/images/entrypoint:v20260429-8f72e9c5a","sidecar":"us-docker.pkg.dev/k8s-infra-prow/images/sidecar:v20260429-8f72e9c5a"},"resources":{"clonerefs":{"limits":{"memory":"3Gi"},"requests":{"cpu":"100m","memory":"500Mi"}},"initupload":{"limits":{"memory":"200Mi"},"requests":{"cpu":"100m","memory":"50Mi"}},"place_entrypoint":{"limits":{"memory":"100Mi"},"requests":{"cpu":"100m","memory":"25Mi"}},"sidecar":{"limits":{"memory":"2Gi"},"requests":{"cpu":"100m","memory":"250Mi"}}},"gcs_configuration":{"bucket":"test-platform-results","path_strategy":"single","default_org":"openshift","default_repo":"origin","mediaTypes":{"log":"text/plain"},"compress_file_types":["txt","log","json","tar","html","yaml"]},"gcs_credentials_secret":"gce-sa-credentials-gcs-publisher","skip_cloning":true,"censor_secrets":true,"censoring_options":{"minimum_secret_length":6}}}'
~~       OPENSHIFT_CI: 'true'
~~       JOB_NAME_SAFE: fournos
~~       BUILD_ID: '2049759790367772672'
~~       PULL_PULL_SHA: 62352c8fa3fcf7a9f0217e3bc3d8bf803e4f1a31
~~       PULL_NUMBER: '49'
~~       PULL_BASE_REF: main
~~       REPO_NAME: forge
~~       REPO_OWNER: openshift-psap
~~       PULL_BASE_SHA: ecd138b2539c3a73e4725be54d1a005c9c3096d7
~~       PULL_TITLE: '[fournos_launcher] Allow launching multiple jobs concurrently'
~~       PULL_REFS: main:ecd138b2539c3a73e4725be54d1a005c9c3096d7,49:62352c8fa3fcf7a9f0217e3bc3d8bf803e4f1a31
~~       PULL_HEAD_REF: parallel
~~     status_dest: /logs/artifacts
~~     ci_label: pr49_b2049759790367772672
~~     exclusive: true
~~     gpu_count: null
~~     gpu_type: null
~~     artifact_dir: /logs/artifacts/001__submit_and_wait
~~ CONTEXT:
~~     final_job_name: forge-skeleton-20260430-075649
~~     manifest_file: /logs/artifacts/001__submit_and_wait/src/forge-skeleton-20260430-075649-manifest.yaml
~~
~~ EXCEPTION: RuntimeError
~~     Job forge-skeleton-20260430-075649 failed: Forge resolution failed: Job has reached the specified backoff limit
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


[...]

• Execution logs

kpouget · 2026-04-30T08:01:40Z

/test fournos skeleton quick_test
/cluster psap-mgmt
/var fournos.namespace: psap-automation-wip

psap-forge-bot · 2026-04-30T08:03:24Z

🔴 Test of 'fournos_launcher submit' failed after 00 hours 01 minutes 06 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton quick_test
/var fournos.namespace: psap-automation-wip
/cluster psap-mgmt

• Failure indicator:

## /logs/artifacts/FAILURE 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
~~ projects/fournos_launcher/toolbox/submit_and_wait/main.py:178
~~ TASK: wait_for_job_completion: Wait for FOURNOS job to complete
~~ ARTIFACT_DIR: /logs/artifacts/001__submit_and_wait
~~ LOG_FILE: /logs/artifacts/001__submit_and_wait/task.log
~~ ARGS:
~~     cluster_name: psap-mgmt
~~     project: skeleton
~~     args:
~~     - quick_test
~~     variables_overrides: {}
~~     job_name: forge-skeleton-20260430-080218
~~     namespace: psap-automation-wip
~~     owner: kpouget
~~     display_name: skeleton quick_test
~~     pipeline_name: forge-test-only
~~     env:
~~       JOB_TYPE: presubmit
~~       JOB_NAME: pull-ci-openshift-psap-forge-main-fournos
~~       JOB_SPEC: '{"type":"presubmit","job":"pull-ci-openshift-psap-forge-main-fournos","buildid":"2049761157819928576","prowjobid":"0e9de590-e24e-45b3-b6e4-ac2f79c3a6e9","refs":{"org":"openshift-psap","repo":"forge","repo_link":"https://github.com/openshift-psap/forge","base_ref":"main","base_sha":"ecd138b2539c3a73e4725be54d1a005c9c3096d7","base_link":"https://github.com/openshift-psap/forge/commit/ecd138b2539c3a73e4725be54d1a005c9c3096d7","pulls":[{"number":49,"author":"kpouget","sha":"87d1f3de95faeb854c04d057291fff70793f190a","title":"[fournos_launcher]
~~         Allow launching multiple jobs concurrently","head_ref":"parallel","link":"https://github.com/openshift-psap/forge/pull/49","commit_link":"https://github.com/openshift-psap/forge/pull/49/commits/87d1f3de95faeb854c04d057291fff70793f190a","author_link":"https://github.com/kpouget"}]},"decoration_config":{"timeout":"23h0m0s","grace_period":"15s","utility_images":{"clonerefs":"us-docker.pkg.dev/k8s-infra-prow/images/clonerefs:v20260429-8f72e9c5a","initupload":"us-docker.pkg.dev/k8s-infra-prow/images/initupload:v20260429-8f72e9c5a","entrypoint":"us-docker.pkg.dev/k8s-infra-prow/images/entrypoint:v20260429-8f72e9c5a","sidecar":"us-docker.pkg.dev/k8s-infra-prow/images/sidecar:v20260429-8f72e9c5a"},"resources":{"clonerefs":{"limits":{"memory":"3Gi"},"requests":{"cpu":"100m","memory":"500Mi"}},"initupload":{"limits":{"memory":"200Mi"},"requests":{"cpu":"100m","memory":"50Mi"}},"place_entrypoint":{"limits":{"memory":"100Mi"},"requests":{"cpu":"100m","memory":"25Mi"}},"sidecar":{"limits":{"memory":"2Gi"},"requests":{"cpu":"100m","memory":"250Mi"}}},"gcs_configuration":{"bucket":"test-platform-results","path_strategy":"single","default_org":"openshift","default_repo":"origin","mediaTypes":{"log":"text/plain"},"compress_file_types":["txt","log","json","tar","html","yaml"]},"gcs_credentials_secret":"gce-sa-credentials-gcs-publisher","skip_cloning":true,"censor_secrets":true,"censoring_options":{"minimum_secret_length":6}}}'
~~       OPENSHIFT_CI: 'true'
~~       JOB_NAME_SAFE: fournos
~~       BUILD_ID: '2049761157819928576'
~~       PULL_PULL_SHA: 87d1f3de95faeb854c04d057291fff70793f190a
~~       PULL_NUMBER: '49'
~~       PULL_BASE_REF: main
~~       REPO_NAME: forge
~~       REPO_OWNER: openshift-psap
~~       PULL_BASE_SHA: ecd138b2539c3a73e4725be54d1a005c9c3096d7
~~       PULL_TITLE: '[fournos_launcher] Allow launching multiple jobs concurrently'
~~       PULL_REFS: main:ecd138b2539c3a73e4725be54d1a005c9c3096d7,49:87d1f3de95faeb854c04d057291fff70793f190a
~~       PULL_HEAD_REF: parallel
~~     status_dest: /logs/artifacts
~~     ci_label: pr49_b2049761157819928576
~~     exclusive: true
~~     gpu_count: null
~~     gpu_type: null
~~     artifact_dirname_suffix: null
~~     artifact_dir: /logs/artifacts/001__submit_and_wait
~~ CONTEXT:
~~     final_job_name: forge-skeleton-20260430-080218
~~     manifest_file: /logs/artifacts/001__submit_and_wait/src/forge-skeleton-20260430-080218-manifest.yaml
~~
~~ EXCEPTION: RuntimeError
~~     Job forge-skeleton-20260430-080218 failed: Tasks Completed: 2 (Failed: 2, Cancelled 0), Skipped: 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


[...]

• Execution logs

psap-forge-bot · 2026-05-04T11:17:46Z

🟢 Test of 'fournos_launcher submit' succeeded after 00 hours 06 minutes 32 seconds 🟢

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton
/var fournos.namespace: psap-automation-wip
/parallel 1 quick_test
/parallel 2 medium_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

• Execution logs

…icated function

kpouget · 2026-05-04T11:35:30Z

/test fournos skeleton
/var fournos.namespace: psap-automation-wip
/parallel 1 quick_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

psap-forge-bot · 2026-05-04T11:36:57Z

🟢 Test of 'skeleton test' succeeded after 00 hours 00 minutes 10 seconds 🟢

• Link to the test results.

• No reports index generated...

Test configuration:

ci_job.cluster: psap-mgmt
ci_job.exclusive: true
ci_job.fjob: forge-skeleton-20260504-113606-1
ci_job.hardware:
  gpuCount: 4
  gpuType: h200
ci_job.name: skeleton quick_test (job 1)
ci_job.owner: kpouget
project.args:
- quick_test
project.name: skeleton

• Execution logs

psap-forge-bot · 2026-05-04T11:38:03Z

🟢 Test of 'skeleton test' succeeded after 00 hours 00 minutes 10 seconds 🟢

• Link to the test results.

• No reports index generated...

Test configuration:

ci_job.cluster: psap-mgmt
ci_job.exclusive: true
ci_job.fjob: forge-skeleton-20260504-113606-3
ci_job.hardware:
  gpuCount: 4
  gpuType: h200
ci_job.name: skeleton quick_test side_testing (job 3)
ci_job.owner: kpouget
project.args:
- quick_test
- side_testing
project.name: skeleton

• Execution logs

psap-forge-bot · 2026-05-04T11:38:43Z

🟢 Test of 'fournos_launcher submit' succeeded after 00 hours 02 minutes 37 seconds 🟢

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton
/var fournos.namespace: psap-automation-wip
/parallel 1 quick_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

• Execution logs

kpouget · 2026-05-04T11:57:26Z

/test fournos skeleton
/var fournos.namespace: psap-automation-wip
/parallel 1 quick_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

psap-forge-bot · 2026-05-04T11:58:48Z

🟢 Test of 'skeleton test' succeeded after 00 hours 00 minutes 10 seconds 🟢

• Link to the test results.

• No reports index generated...

Test configuration:

ci_job.cluster: psap-mgmt
ci_job.exclusive: true
ci_job.fjob: forge-skeleton-20260504-115754-1
ci_job.hardware:
  gpuCount: 4
  gpuType: h200
ci_job.name: skeleton quick_test (job 1)
ci_job.owner: kpouget
project.args:
- quick_test
project.name: skeleton

• Execution logs

psap-forge-bot · 2026-05-04T11:59:49Z

🟢 Test of 'skeleton test' succeeded after 00 hours 00 minutes 10 seconds 🟢

• Link to the test results.

• No reports index generated...

Test configuration:

ci_job.cluster: psap-mgmt
ci_job.exclusive: true
ci_job.fjob: forge-skeleton-20260504-115754-3
ci_job.hardware:
  gpuCount: 4
  gpuType: h200
ci_job.name: skeleton quick_test side_testing (job 3)
ci_job.owner: kpouget
project.args:
- quick_test
- side_testing
project.name: skeleton

• Execution logs

psap-forge-bot · 2026-05-04T12:00:26Z

🟢 Test of 'fournos_launcher submit' succeeded after 00 hours 02 minutes 33 seconds 🟢

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton
/var fournos.namespace: psap-automation-wip
/parallel 1 quick_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

• Execution logs

kpouget · 2026-05-04T12:04:29Z

/test fournos skeleton
/var fournos.namespace: psap-automation-wip
/parallel 1 quick_test
/parallel 2 medium_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

psap-forge-bot · 2026-05-04T12:07:42Z

🟢 Test of 'skeleton test' succeeded after 00 hours 02 minutes 00 seconds 🟢

• Link to the test results.

• No reports index generated...

Test configuration:

ci_job.cluster: psap-mgmt
ci_job.exclusive: true
ci_job.fjob: forge-skeleton-20260504-120459-2
ci_job.hardware:
  gpuCount: 4
  gpuType: h200
ci_job.name: skeleton medium_test (job 2)
ci_job.owner: kpouget
project.args:
- medium_test
project.name: skeleton

• Execution logs

kpouget · 2026-05-04T12:08:28Z

good, merging this version!

kpouget · 2026-05-04T15:38:28Z

/test fournos skeleton
/var fournos.namespace: psap-automation-wip
/parallel 1 quick_test
/parallel 2 medium_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

kpouget force-pushed the parallel branch from 904f253 to 8e35757 Compare April 29, 2026 14:33

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread projects/fournos_launcher/toolbox/wait_fjobs/main.py Outdated

Comment thread projects/fournos_launcher/toolbox/wait_fjobs/main.py Outdated

Comment thread projects/fournos_launcher/toolbox/wait_fjobs/main.py Outdated

kpouget force-pushed the parallel branch 2 times, most recently from 2a9c539 to 0cfb72a Compare April 30, 2026 07:21

kpouget force-pushed the parallel branch from 0cfb72a to 427cccb Compare April 30, 2026 07:29

kpouget force-pushed the parallel branch from 427cccb to 1685b22 Compare April 30, 2026 07:42

kpouget force-pushed the parallel branch from 1685b22 to 62352c8 Compare April 30, 2026 07:52

kpouget added 12 commits May 4, 2026 13:34

[core] library: env: thread safe

bd79fee

[core] library: run: thread safe

0314955

[skeleton] orchestration: presets: update the preset names

60f17f0

[core] ci_entrypoint: fournos: consume the fournos-prepare fjob

da8b002

[llm_d_legacy] make it work with Forge launcher

b56b35b

[core] dsl: define the task ID as filename:lineno

fe14f38

[core] library: config: move _create_first_parent_config_key to a ded…

d97d6dc

…icated function

[core] library: env: don't update ARTIFACT_DIR in the threads

2e4d685

[core] library: export: use run.run

ce8fabd

[core] library: run_parallel: move the Parallel code to a dedicate file

f89e4d3

toolbox: simplify the entrypoint mechanism

d282d53

[fournos_launcher] orchestration: signal handlers

1da5b7d

kpouget force-pushed the parallel branch from 9eb6cbe to 1da5b7d Compare May 4, 2026 11:35

kpouget added 2 commits May 4, 2026 13:52

[core] dsl: runtime: use a thread safe logging

77fe447

[core] library: env: make the ordering thread safe

6124bae

kpouget merged commit f8f2c7c into openshift-psap:main May 4, 2026
5 of 7 checks passed

kpouget deleted the parallel branch May 4, 2026 12:08

Conversation

kpouget commented Apr 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-ci Bot commented Apr 29, 2026

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

kpouget commented Apr 29, 2026

Uh oh!

kpouget commented Apr 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kpouget commented Apr 29, 2026

Uh oh!

kpouget commented Apr 29, 2026

Uh oh!

kpouget commented Apr 29, 2026

Uh oh!

psap-forge-bot Bot commented Apr 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kpouget commented Apr 30, 2026

Uh oh!

psap-forge-bot Bot commented Apr 30, 2026

Uh oh!

kpouget commented Apr 30, 2026

Uh oh!

psap-forge-bot Bot commented Apr 30, 2026

Uh oh!

kpouget commented Apr 30, 2026

Uh oh!

psap-forge-bot Bot commented Apr 30, 2026

Uh oh!

kpouget commented Apr 30, 2026

Uh oh!

psap-forge-bot Bot commented Apr 30, 2026

Uh oh!

kpouget commented Apr 30, 2026

Uh oh!

psap-forge-bot Bot commented Apr 30, 2026

Uh oh!

kpouget commented Apr 30, 2026

Uh oh!

psap-forge-bot Bot commented Apr 30, 2026

Uh oh!

kpouget commented Apr 30, 2026

Uh oh!

psap-forge-bot Bot commented Apr 30, 2026

Uh oh!

psap-forge-bot Bot commented May 4, 2026

kpouget commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading