Skip to content

[fournos_launcher] Allow launching multiple jobs concurrently#49

Merged
kpouget merged 33 commits intoopenshift-psap:mainfrom
kpouget:parallel
May 4, 2026
Merged

[fournos_launcher] Allow launching multiple jobs concurrently#49
kpouget merged 33 commits intoopenshift-psap:mainfrom
kpouget:parallel

Conversation

@kpouget
Copy link
Copy Markdown
Contributor

@kpouget kpouget commented Apr 29, 2026

Summary by CodeRabbit

  • New Features

    • /parallel directive and predefined parallel job slots to launch multiple job instances.
    • Parallel execution helper to run tasks concurrently with optional per-task artifact directories.
    • New CLI: cleanup job command; new resolve subcommand for hardware stubbing.
  • Changes

    • Export command renamed to export-artifacts; export now defaults artifact directory and may update job export status.
    • Job-name env var standardized to FJOB_NAME.
    • Notifications include variable_overrides.yaml when present.
    • Test presets: added medium_test (120s) and shortened long_test (300s).

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 29, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tosokin for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Introduces thread-local script execution context and per-thread ARTIFACT_DIR handling, a Parallel context manager for concurrent tasks, a /parallel FOURNOS directive and parallel submission flow with per-job artifact suffixes, new cleanup CLI, config/preset changes, and several CLI/env renames and export updates.

Changes

Cohort / File(s) Summary
DSL execution & script manager
projects/core/dsl/runtime.py, projects/core/dsl/script_manager.py
execute_tasks now starts/clears a thread-local ScriptManager execution context and strips artifact_dirname_prefix/artifact_dirname_suffix from function args before execution. ScriptManager adds thread-local task_results, a Lock, and methods start_execution_context, clear_execution_context, has_execution_context; get_task_result uses the thread-local container.
Thread-local artifact dir & env helpers
projects/core/library/env.py
Adds TLS storage and accessors for ARTIFACT_DIR (__getattr__, get_tls_artifact_dir, _set_tls_artifact_dir, ensure_thread_artifact_dir), updates Next/TempArtifactDir and index logic to prefer thread-local with global fallback and initializes TLS in init().
Parallel execution framework
projects/core/library/run.py
Adds Parallel context manager with delayed() to enqueue functions, executes them via ThreadPoolExecutor on exit, optionally creates a dedicated per-context artifact subdir and sets thread-local ARTIFACT_DIR for workers, cancels remaining tasks on worker exception and handles exit behavior.
/parallel directive & config
projects/fournos_launcher/orchestration/pr_args.py, projects/fournos_launcher/orchestration/config.yaml
Adds /parallel directive handler handle_parallel_directive and registers it; config.yaml adds fournos_launcher.parallel_jobs keys "0""9" initialized to [].
Parallel job submission & submit tool changes
projects/fournos_launcher/orchestration/submit.py, projects/fournos_launcher/toolbox/submit_and_wait/main.py, projects/fournos_launcher/toolbox/cleanup_fjob/main.py
submit.py can launch multiple submit_and_wait calls concurrently using fournos_launcher.parallel_jobs, aggregates per-job failures, and invokes cleanup_fjob for each submitted job. submit_and_wait.run() gains artifact_dirname_suffix. Adds new cleanup_fjob toolbox CLI to conditionally delete FOURNOS jobs (with optional shutdown check).
Export CLI & job-name env rename
projects/core/library/export.py, projects/core/ci_entrypoint/fournos_resolve.py
Renames CLI command exportexport-artifacts; export defaults artifact dir to ARTIFACT_BASE_DIR and writes status back to FournosJob when FOURNOS_CI=true. Renames env/CLI var FOURNOS_JOB_NAMEFJOB_NAME and updates option wiring.
Config overrides materialization
projects/core/library/config.py
apply_config_overrides will materialize missing dotted override keys by resolving the parent config and inserting the new leaf (calls save_config()), otherwise raises as before.
Notifications message content
projects/core/notifications/send.py
get_common_message now reads ARTIFACT_DIR/000__ci_metadata/variable_overrides.yaml when present and includes it under “Test configuration”; fallback text updated to reference both variable_overrides.yaml/pr_config.txt.
CI/tooling, presets & minor CLI changes
projects/llm_d_legacy/orchestration/ci.py, projects/skeleton/orchestration/ci.py, projects/skeleton/orchestration/presets.d/presets.yaml, projects/core/ci_entrypoint/fournos.py, projects/llm_d_legacy/testing/*
Adds stub resolve_hardware_request and Fournos resolve subcommand in llm_d_legacy; removes Caliper export side-effects from skeleton CI test; adds medium_test preset and shortens long_test; replaces dynamic oc fetch with load_fjob_yaml reading fournos_fjob.yaml; large preset/config YAML additions/removals in llm_d_legacy testing.
Entrypoint/tooling signature tweaks
projects/fournos_launcher/toolbox/submit_and_wait/main.py
run(...) now accepts keyword-only artifact_dirname_suffix: str = None; removed the FOURNOS job-deletion cleanup step from that task.

Sequence Diagram

sequenceDiagram
    participant CLI as User/CLI
    participant Parser as Directive Parser
    participant Config as Config Manager
    participant Submit as Submit Orchestrator
    participant Script as ScriptManager
    participant Worker as Job Worker Thread

    CLI->>Parser: submit with /parallel directive
    Parser->>Config: write fournos_launcher.parallel_jobs[index]=[presets]
    CLI->>Submit: trigger submission
    Submit->>Config: read fournos_launcher.parallel_jobs
    alt parallel entries present
        Submit->>Submit: build shared timestamp base name
        loop per job
            Submit->>Worker: start thread with job args (+artifact_dirname_suffix)
            Worker->>Script: start_execution_context(source_file)
            Worker->>Worker: set thread-local ARTIFACT_DIR (if dedicated)
            Worker->>Worker: call submit_and_wait(...)
            Worker-->>Script: clear_execution_context() (finally)
        end
        Submit->>Submit: aggregate failures
        Submit->>Submit: call cleanup_fjob per submitted job
    else no parallel config
        Submit->>Worker: single submit_and_wait (normal flow)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 Threads hop, directories spin,

suffixes stitch to jobs within,
contexts start then vanish clean,
presets call and runners preen,
a rabbit cheers: "Concurrent win!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.19% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly summarizes the main objective: enabling concurrent job launching in the fournos_launcher module, which aligns with the extensive changes across multiple files for parallel job submission, context management, and related infrastructure.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Apr 29, 2026

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

1 similar comment
@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Apr 29, 2026

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@projects/core/dsl/runtime.py`:
- Around line 203-205: The finally block calls
script_manager.clear_execution_context() but script_manager is only assigned
inside the try, so if earlier operations raise an exception the finally will
raise UnboundLocalError and hide the original error; to fix, initialize
script_manager = None (or a sentinel) before the try and in the finally guard
the call (e.g., if script_manager is not None:
script_manager.clear_execution_context()), referencing the script_manager
variable and its clear_execution_context() method to locate where to add the
initialization and the conditional cleanup.

In `@projects/core/library/env.py`:
- Around line 37-39: The TLS setter _set_tls_artifact_dir currently has no
effect because NextArtifactDir, TempArtifactDir, and next_artifact_index still
read/write the module-global ARTIFACT_DIR/ARTIFACT_INDEX; update those helpers
to prefer thread-local storage (read _tls_artifact_dir.val and a thread-local
index like _tls_artifact_index.val) and only fall back to the module-global
ARTIFACT_DIR/ARTIFACT_INDEX when the TLS value is None, and ensure writes
(incrementing index or setting current dir) update the TLS slot so each worker
thread gets isolated artifact dirs and indexes; change references inside
NextArtifactDir, TempArtifactDir, and next_artifact_index accordingly.
- Around line 15-25: The module-level ARTIFACT_DIR currently exists so Python
never calls __getattr__ for env.ARTIFACT_DIR; remove or stop defining the
top-level ARTIFACT_DIR constant so attribute access will fall through to
__getattr__ (which checks _tls_artifact_dir and globals().get("ARTIFACT_DIR") as
a fallback). Locate and delete the module-level ARTIFACT_DIR definition (or
ensure it is removed from globals() during import) so __getattr__ can intercept
reads and return thread-local values using _tls_artifact_dir.val.

In `@projects/core/library/run.py`:
- Around line 199-202: Parallel blocks with no tasks cause
ThreadPoolExecutor(max_workers=0) to raise; update run.Parallel handling by
checking self.parallel_tasks before creating the executor and treat an empty
list as a no-op. Specifically, before computing max_workers or entering
ThreadPoolExecutor, return/exit early (or skip executor creation) when
self.parallel_tasks is empty; alternatively ensure max_workers is computed as
max(1, min(len(self.parallel_tasks), os.cpu_count() or 1)). Change the logic
around the max_workers variable and ThreadPoolExecutor usage so
run.Parallel(...) with no tasks does not instantiate ThreadPoolExecutor with 0
workers.
- Around line 174-213: The bug is that a single parallel_dir is created once and
shared by all futures so downstream env.next_artifact_index() calls still race;
to fix, generate a unique artifact directory per task before submitting: inside
the loop over self.parallel_tasks call env.next_artifact_index() (use
env.next_artifact_index()) for each delayed_func, build and mkdir a
task-specific parallel_dir, then submit _run_with_artifact_dir with that unique
parallel_dir (instead of the shared parallel_dir); keep use of
_run_with_artifact_dir, mkdir(exist_ok=True) and pass the created dir into
executor.submit so each worker gets its own non-colliding subtree.
- Line 9: The file imports joblib unconditionally via the top-level import
joblib which will raise ModuleNotFoundError if joblib is not declared as a
dependency; either add "joblib" to the project's dependencies in pyproject.toml,
or change the import to the safe pattern used in projects/legacy/library/run.py:
wrap import joblib in try/except ImportError, set a sentinel (e.g., joblib =
None) and guard any usage of joblib (functions/classes that call joblib) so
behavior degrades gracefully; reference the top-level import joblib and the
EXTRA_PACKAGES = ["joblib"] note in run_ci.py when choosing the fix.
- Around line 164-165: The code incorrectly uses joblib.delayed(function)(*args,
**kwargs) which returns a tuple-like delayed object and later code expects
attributes .func/.args/.keywords; change the append site
(self.parallel_tasks.append(...)) to store the raw tuple (function, args,
kwargs) instead of using joblib.delayed, then update the execution path that
reads those entries (the logic around the lines that access .func, .args,
.keywords) to unpack the tuple as (func, args, kwargs) and call func(*args,
**kwargs); also remove the unnecessary joblib.delayed usage/import if no longer
needed so only the plain tuple is used when scheduling and running tasks.

In `@projects/fournos_launcher/orchestration/pr_args.py`:
- Around line 241-254: The parser currently accepts any non-negative integer for
the /parallel index and returns {f"fournos_launcher.parallel_jobs.{idx}":
presets}, but config.yaml only defines slots 0–9 so indices >=10 will be ignored
later; after converting idx (and after the non-negative check) validate that idx
is within the allowed range (e.g. 0 <= idx < 10) and raise a ValueError with a
clear message like "Invalid /parallel directive: index out of range, must be
0..9, got {idx} in '{line}'"; implement this check near the existing idx parsing
block (or introduce a MAX_PARALLEL_JOBS constant) to prevent silently dropped
configs.

In `@projects/fournos_launcher/orchestration/submit.py`:
- Around line 135-148: The submit_kwargs dict in submit.py captures status_dest
as env.ARTIFACT_DIR too early, causing all parallel submissions to share the
same artifact dir; update submit_kwargs (the dict built in submit.py) to not
freeze status_dest to env.ARTIFACT_DIR — either remove the "status_dest" key or
set it to None so submit_and_wait can use the thread-local/artifact-dir chosen
by the worker/run.Parallel; ensure submit_and_wait still accepts and handles a
None status_dest fallback to the per-worker env.ARTIFACT_DIR/"artifacts".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ac92e4f5-2162-41d1-b95e-cda672e7a047

📥 Commits

Reviewing files that changed from the base of the PR and between 9b669c6 and ef51d9a.

📒 Files selected for processing (7)
  • projects/core/dsl/runtime.py
  • projects/core/dsl/script_manager.py
  • projects/core/library/env.py
  • projects/core/library/run.py
  • projects/fournos_launcher/orchestration/config.yaml
  • projects/fournos_launcher/orchestration/pr_args.py
  • projects/fournos_launcher/orchestration/submit.py

Comment thread projects/core/dsl/runtime.py
Comment thread projects/core/dsl/script_manager.py Outdated
Comment thread projects/core/library/env.py
Comment thread projects/core/library/env.py
Comment thread projects/core/library/run.py Outdated
Comment thread projects/core/library/run.py Outdated
Comment thread projects/core/library/run.py Outdated
Comment thread projects/core/library/run.py Outdated
Comment on lines +241 to +254
try:
idx = int(idx_str)
if idx < 0:
raise ValueError(
f"Invalid /parallel directive: index must be non-negative, got {idx} in '{line}'"
)
except ValueError as e:
if "non-negative" in str(e):
raise
raise ValueError(
f"Invalid /parallel directive: index must be a number, got '{idx_str}' in '{line}'"
) from None

return {f"fournos_launcher.parallel_jobs.{idx}": presets}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate /parallel indices against the configured slots.

config.yaml only declares fournos_launcher.parallel_jobs.0 through .9. A directive like /parallel 10 ... passes parsing here, but submit.py only reads the declared map, so that job config can be silently dropped.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/fournos_launcher/orchestration/pr_args.py` around lines 241 - 254,
The parser currently accepts any non-negative integer for the /parallel index
and returns {f"fournos_launcher.parallel_jobs.{idx}": presets}, but config.yaml
only defines slots 0–9 so indices >=10 will be ignored later; after converting
idx (and after the non-negative check) validate that idx is within the allowed
range (e.g. 0 <= idx < 10) and raise a ValueError with a clear message like
"Invalid /parallel directive: index out of range, must be 0..9, got {idx} in
'{line}'"; implement this check near the existing idx parsing block (or
introduce a MAX_PARALLEL_JOBS constant) to prevent silently dropped configs.

Comment thread projects/fournos_launcher/orchestration/submit.py
@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Apr 29, 2026

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

2 similar comments
@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Apr 29, 2026

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Apr 29, 2026

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

@psap-forge-bot
Copy link
Copy Markdown

🔴 Test of 'fournos_launcher submit' failed after 00 hours 00 minutes 02 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

• Failure indicator: Empty.
Execution logs

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
projects/core/dsl/script_manager.py (1)

98-116: clear_tasks lacks thread-safety synchronization.

While register_task is protected by self._lock, clear_tasks mutates both _task_registry and _task_results without acquiring the lock. This could cause race conditions if tasks are being registered concurrently with a clear operation.

🔒 Proposed fix to add lock protection
     def clear_tasks(self, source_file: str | None = None) -> None:
         """
         Clear tasks from registry

         Args:
             source_file: If specified, only clear tasks from this file.
                         If None, clear all tasks from all files.
         """
-        if source_file is None:
-            # Clear all tasks from all files
-            logger.debug("Clearing all tasks from script manager")
-            self._task_registry.clear()
-            self._task_results.clear()
-        else:
-            # Clear tasks from specific file
-            if source_file in self._task_registry:
-                tasks_to_remove = self._task_registry[source_file]
-
-                # Clear task results for tasks from this file
-                for task_info in tasks_to_remove:
-                    task_name = task_info["name"]
-                    if task_name in self._task_results:
-                        del self._task_results[task_name]
-
-                # Remove tasks from this file
-                del self._task_registry[source_file]
-                logger.debug(f"Cleared {len(tasks_to_remove)} tasks from {source_file}")
+        with self._lock:
+            if source_file is None:
+                # Clear all tasks from all files
+                logger.debug("Clearing all tasks from script manager")
+                self._task_registry.clear()
+                self._task_results.clear()
+            else:
+                # Clear tasks from specific file
+                if source_file in self._task_registry:
+                    tasks_to_remove = self._task_registry[source_file]
+
+                    # Clear task results for tasks from this file
+                    for task_info in tasks_to_remove:
+                        task_name = task_info["name"]
+                        if task_name in self._task_results:
+                            del self._task_results[task_name]
+
+                    # Remove tasks from this file
+                    del self._task_registry[source_file]
+                    logger.debug(f"Cleared {len(tasks_to_remove)} tasks from {source_file}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/dsl/script_manager.py` around lines 98 - 116, clear_tasks
mutates shared state (_task_registry and _task_results) without acquiring the
instance lock, creating a race with register_task which uses self._lock; wrap
the entire clear_tasks implementation in the same lock used by register_task
(acquire self._lock at method start and release at end, including both the
branch that clears all tasks and the branch that clears a specific source_file)
so all reads/writes to _task_registry and _task_results are synchronized with
other methods using self._lock.
projects/fournos_launcher/toolbox/submit_fjob/main.py (1)

179-183: Consider moving import to module level.

The yaml import inside the task function works but is unconventional. Moving it to the top of the file with other imports improves readability and makes dependencies explicit.

♻️ Proposed refactor

At the top of the file:

 import logging
 from datetime import datetime
+
+import yaml
 
 from projects.core.dsl import (

Then remove line 179:

-    import yaml
-
     job_info_file = args.artifact_dir / "artifacts" / f"{ctx.final_job_name}-submission-info.yaml"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/fournos_launcher/toolbox/submit_fjob/main.py` around lines 179 -
183, Move the local "import yaml" out of the task function and add it to the
module-level imports at the top of the file so dependencies are explicit; then
remove the inline import on the block that writes job_info_file (the code that
constructs job_info_file using args.artifact_dir and ctx.final_job_name and
calls yaml.dump(job_info, ...)) so the function simply uses the module-level
yaml reference.
projects/fournos_launcher/orchestration/submit.py (1)

138-151: status_dest in submit_kwargs is unused by submit_fjob.

Looking at the submit_fjob function signature (context snippet 1), it doesn't accept a status_dest parameter. This key in submit_kwargs will be ignored when unpacked with **submit_kwargs at line 181.

This isn't a bug since wait_fjobs calls explicitly pass status_dest, but the unused key adds confusion.

♻️ Proposed fix to remove unused key
     submit_kwargs = {
         "cluster_name": cluster_name,
         "project": config.project.get_config("ci_job.project"),
         "variables_overrides": overrides,
         "namespace": config.project.get_config("fournos.namespace"),
         "owner": config.project.get_config("fournos.job.owner"),
         "pipeline_name": config.project.get_config("fournos.job.pipeline_name"),
         "env": env_dict,
-        "status_dest": env.ARTIFACT_DIR,
         "ci_label": config.project.get_config("fournos.job.ci_label"),
         "exclusive": config.project.get_config("fournos.job.exclusive"),
         "gpu_count": gpu_count,
         "gpu_type": gpu_type,
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/fournos_launcher/orchestration/submit.py` around lines 138 - 151,
submit_kwargs contains a "status_dest" key that submit_fjob does not accept and
will be ignored when **submit_kwargs is used; remove the "status_dest":
env.ARTIFACT_DIR entry from the submit_kwargs dict (in
projects/fournos_launcher/orchestration/submit.py) and instead pass
env.ARTIFACT_DIR explicitly to wait_fjobs (which already expects status_dest) or
add status_dest as an explicit argument where needed—do not change submit_fjob
signature unless you intend for it to consume status_dest.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@projects/fournos_launcher/toolbox/wait_fjobs/main.py`:
- Around line 334-340: The check for command output uses yaml_list.stdout which
will be empty when stdout was redirected via stdout_dest; instead, after running
shell.run(...) keep the yaml_list.success (or returncode) check and verify the
artifact file (artifact_dir / f"{job_name}-pods.yaml") exists and is non-empty
(e.g., .exists() and .stat().st_size > 0) before incrementing captured_count;
update the condition that references yaml_list.stdout to check the file on disk
and leave yaml_list (the shell.run result) for success/returncode checks.
- Around line 306-314: The code currently checks both yaml_list.success and
yaml_list.stdout and yaml_list.stdout.strip() before counting captured
PipelineRun specs; update the condition to only check yaml_list.success (as
other similar code paths do when stdout_dest is used) so replace the combined
stdout checks with a single yaml_list.success in the function that builds the
message returning "Captured PipelineRun specs..." (look for yaml_list and
ctx.target_jobs) and apply the identical simplification in the
capture_all_pod_specs location where the same triple-check appears.
- Around line 122-123: The retry decorator currently hardcodes
`@retry`(attempts=120, delay=10, backoff=1.0) which yields a 20-minute limit;
update this to be configurable instead of fixed: read attempts, delay and
backoff from environment variables or function parameters (e.g., RETRY_ATTEMPTS,
RETRY_DELAY, RETRY_BACKOFF) and apply them to the retry decorator or construct
the decorator dynamically so longer (or overridable) timeouts are possible for
FOURNOS model jobs; alternatively, increase default values to reflect multi-hour
ML runs but ensure the new values are settable at runtime.

---

Nitpick comments:
In `@projects/core/dsl/script_manager.py`:
- Around line 98-116: clear_tasks mutates shared state (_task_registry and
_task_results) without acquiring the instance lock, creating a race with
register_task which uses self._lock; wrap the entire clear_tasks implementation
in the same lock used by register_task (acquire self._lock at method start and
release at end, including both the branch that clears all tasks and the branch
that clears a specific source_file) so all reads/writes to _task_registry and
_task_results are synchronized with other methods using self._lock.

In `@projects/fournos_launcher/orchestration/submit.py`:
- Around line 138-151: submit_kwargs contains a "status_dest" key that
submit_fjob does not accept and will be ignored when **submit_kwargs is used;
remove the "status_dest": env.ARTIFACT_DIR entry from the submit_kwargs dict (in
projects/fournos_launcher/orchestration/submit.py) and instead pass
env.ARTIFACT_DIR explicitly to wait_fjobs (which already expects status_dest) or
add status_dest as an explicit argument where needed—do not change submit_fjob
signature unless you intend for it to consume status_dest.

In `@projects/fournos_launcher/toolbox/submit_fjob/main.py`:
- Around line 179-183: Move the local "import yaml" out of the task function and
add it to the module-level imports at the top of the file so dependencies are
explicit; then remove the inline import on the block that writes job_info_file
(the code that constructs job_info_file using args.artifact_dir and
ctx.final_job_name and calls yaml.dump(job_info, ...)) so the function simply
uses the module-level yaml reference.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0e2b6019-d8bc-4f91-9835-6fdf68c88041

📥 Commits

Reviewing files that changed from the base of the PR and between ef51d9a and 0202aa3.

📒 Files selected for processing (8)
  • projects/core/ci_entrypoint/run_ci.py
  • projects/core/dsl/runtime.py
  • projects/core/dsl/script_manager.py
  • projects/fournos_launcher/orchestration/config.yaml
  • projects/fournos_launcher/orchestration/pr_args.py
  • projects/fournos_launcher/orchestration/submit.py
  • projects/fournos_launcher/toolbox/submit_fjob/main.py
  • projects/fournos_launcher/toolbox/wait_fjobs/main.py
✅ Files skipped from review due to trivial changes (2)
  • projects/core/ci_entrypoint/run_ci.py
  • projects/fournos_launcher/orchestration/config.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
  • projects/core/dsl/runtime.py
  • projects/fournos_launcher/orchestration/pr_args.py

Comment thread projects/fournos_launcher/toolbox/wait_fjobs/main.py Outdated
Comment thread projects/fournos_launcher/toolbox/wait_fjobs/main.py Outdated
Comment thread projects/fournos_launcher/toolbox/wait_fjobs/main.py Outdated
@kpouget kpouget force-pushed the parallel branch 2 times, most recently from 2a9c539 to 0cfb72a Compare April 30, 2026 07:21
@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Apr 30, 2026

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

@psap-forge-bot
Copy link
Copy Markdown

🔴 Test of 'fournos_launcher submit' failed after 00 hours 00 minutes 53 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

Failure indicator:

## /logs/artifacts/FAILURE 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
~~ projects/fournos_launcher/toolbox/submit_and_wait/main.py:177
~~ TASK: wait_for_job_completion: Wait for FOURNOS job to complete
~~ ARTIFACT_DIR: /logs/artifacts/001__submit_and_wait
~~ LOG_FILE: /logs/artifacts/001__submit_and_wait/task.log
~~ ARGS:
~~     cluster_name: psap-mgmt
~~     project: skeleton
~~     args: []
~~     variables_overrides:
~~       fournos_launcher.parallel_jobs.1:
~~       - quick_test
~~       fournos_launcher.parallel_jobs.2:
~~       - long_test
~~       fournos_launcher.parallel_jobs.3:
~~       - quick_test
~~       - side_testing
~~     job_name: forge-skeleton-20260430-072304
~~     namespace: psap-automation
~~     owner: kpouget
~~     display_name: skeleton
~~     pipeline_name: forge-test-only
~~     env:
~~       JOB_TYPE: presubmit
~~       JOB_NAME: pull-ci-openshift-psap-forge-main-fournos
~~       JOB_SPEC: '{"type":"presubmit","job":"pull-ci-openshift-psap-forge-main-fournos","buildid":"2049751204031893504","prowjobid":"8cd4c45a-e461-4bb5-8f5b-5fe419ed411a","refs":{"org":"openshift-psap","repo":"forge","repo_link":"https://github.com/openshift-psap/forge","base_ref":"main","base_sha":"ecd138b2539c3a73e4725be54d1a005c9c3096d7","base_link":"https://github.com/openshift-psap/forge/commit/ecd138b2539c3a73e4725be54d1a005c9c3096d7","pulls":[{"number":49,"author":"kpouget","sha":"0cfb72af362b46f48a7676e86f6b984396e3db14","title":"[fournos_launcher]
~~         Allow launching multiple jobs concurrently","head_ref":"parallel","link":"https://github.com/openshift-psap/forge/pull/49","commit_link":"https://github.com/openshift-psap/forge/pull/49/commits/0cfb72af362b46f48a7676e86f6b984396e3db14","author_link":"https://github.com/kpouget"}]},"decoration_config":{"timeout":"23h0m0s","grace_period":"15s","utility_images":{"clonerefs":"us-docker.pkg.dev/k8s-infra-prow/images/clonerefs:v20260429-8f72e9c5a","initupload":"us-docker.pkg.dev/k8s-infra-prow/images/initupload:v20260429-8f72e9c5a","entrypoint":"us-docker.pkg.dev/k8s-infra-prow/images/entrypoint:v20260429-8f72e9c5a","sidecar":"us-docker.pkg.dev/k8s-infra-prow/images/sidecar:v20260429-8f72e9c5a"},"resources":{"clonerefs":{"limits":{"memory":"3Gi"},"requests":{"cpu":"100m","memory":"500Mi"}},"initupload":{"limits":{"memory":"200Mi"},"requests":{"cpu":"100m","memory":"50Mi"}},"place_entrypoint":{"limits":{"memory":"100Mi"},"requests":{"cpu":"100m","memory":"25Mi"}},"sidecar":{"limits":{"memory":"2Gi"},"requests":{"cpu":"100m","memory":"250Mi"}}},"gcs_configuration":{"bucket":"test-platform-results","path_strategy":"single","default_org":"openshift","default_repo":"origin","mediaTypes":{"log":"text/plain"},"compress_file_types":["txt","log","json","tar","html","yaml"]},"gcs_credentials_secret":"gce-sa-credentials-gcs-publisher","skip_cloning":true,"censor_secrets":true,"censoring_options":{"minimum_secret_length":6}}}'
~~       OPENSHIFT_CI: 'true'
~~       JOB_NAME_SAFE: fournos
~~       BUILD_ID: '2049751204031893504'
~~       PULL_PULL_SHA: 0cfb72af362b46f48a7676e86f6b984396e3db14
~~       PULL_NUMBER: '49'
~~       PULL_BASE_REF: main
~~       REPO_NAME: forge
~~       REPO_OWNER: openshift-psap
~~       PULL_BASE_SHA: ecd138b2539c3a73e4725be54d1a005c9c3096d7
~~       PULL_TITLE: '[fournos_launcher] Allow launching multiple jobs concurrently'
~~       PULL_REFS: main:ecd138b2539c3a73e4725be54d1a005c9c3096d7,49:0cfb72af362b46f48a7676e86f6b984396e3db14
~~       PULL_HEAD_REF: parallel
~~     status_dest: /logs/artifacts
~~     ci_label: pr49_b2049751204031893504
~~     exclusive: true
~~     gpu_count: null
~~     gpu_type: null
~~     artifact_dir: /logs/artifacts/001__submit_and_wait
~~ CONTEXT:
~~     final_job_name: forge-skeleton-20260430-072304
~~     manifest_file: /logs/artifacts/001__submit_and_wait/src/forge-skeleton-20260430-072304-manifest.yaml
~~
~~ EXCEPTION: RuntimeError
~~     Job forge-skeleton-20260430-072304 failed: Forge resolution failed: Job has reached the specified backoff limit
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


[...]

Execution logs

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Apr 30, 2026

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

@psap-forge-bot
Copy link
Copy Markdown

🔴 Test of 'fournos_launcher submit' failed after 00 hours 00 minutes 20 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

Failure indicator:

## /logs/artifacts/FAILURE 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
~~ projects/fournos_launcher/toolbox/submit_and_wait/main.py:177
~~ TASK: wait_for_job_completion: Wait for FOURNOS job to complete
~~ ARTIFACT_DIR: /logs/artifacts/001__submit_and_wait
~~ LOG_FILE: /logs/artifacts/001__submit_and_wait/task.log
~~ ARGS:
~~     cluster_name: psap-mgmt
~~     project: skeleton
~~     args: []
~~     variables_overrides:
~~       fournos_launcher.parallel_jobs.1:
~~       - quick_test
~~       fournos_launcher.parallel_jobs.2:
~~       - long_test
~~       fournos_launcher.parallel_jobs.3:
~~       - quick_test
~~       - side_testing
~~     job_name: forge-skeleton-20260430-073206
~~     namespace: psap-automation
~~     owner: kpouget
~~     display_name: skeleton
~~     pipeline_name: forge-test-only
~~     env:
~~       JOB_TYPE: presubmit
~~       JOB_NAME: pull-ci-openshift-psap-forge-main-fournos
~~       JOB_SPEC: '{"type":"presubmit","job":"pull-ci-openshift-psap-forge-main-fournos","buildid":"2049753557309067264","prowjobid":"5817d88e-c50e-4f4c-9582-19e76e4e5c3f","refs":{"org":"openshift-psap","repo":"forge","repo_link":"https://github.com/openshift-psap/forge","base_ref":"main","base_sha":"ecd138b2539c3a73e4725be54d1a005c9c3096d7","base_link":"https://github.com/openshift-psap/forge/commit/ecd138b2539c3a73e4725be54d1a005c9c3096d7","pulls":[{"number":49,"author":"kpouget","sha":"427cccba7d52450623e9a7b15dca4bd83bd46de0","title":"[fournos_launcher]
~~         Allow launching multiple jobs concurrently","head_ref":"parallel","link":"https://github.com/openshift-psap/forge/pull/49","commit_link":"https://github.com/openshift-psap/forge/pull/49/commits/427cccba7d52450623e9a7b15dca4bd83bd46de0","author_link":"https://github.com/kpouget"}]},"decoration_config":{"timeout":"23h0m0s","grace_period":"15s","utility_images":{"clonerefs":"us-docker.pkg.dev/k8s-infra-prow/images/clonerefs:v20260429-8f72e9c5a","initupload":"us-docker.pkg.dev/k8s-infra-prow/images/initupload:v20260429-8f72e9c5a","entrypoint":"us-docker.pkg.dev/k8s-infra-prow/images/entrypoint:v20260429-8f72e9c5a","sidecar":"us-docker.pkg.dev/k8s-infra-prow/images/sidecar:v20260429-8f72e9c5a"},"resources":{"clonerefs":{"limits":{"memory":"3Gi"},"requests":{"cpu":"100m","memory":"500Mi"}},"initupload":{"limits":{"memory":"200Mi"},"requests":{"cpu":"100m","memory":"50Mi"}},"place_entrypoint":{"limits":{"memory":"100Mi"},"requests":{"cpu":"100m","memory":"25Mi"}},"sidecar":{"limits":{"memory":"2Gi"},"requests":{"cpu":"100m","memory":"250Mi"}}},"gcs_configuration":{"bucket":"test-platform-results","path_strategy":"single","default_org":"openshift","default_repo":"origin","mediaTypes":{"log":"text/plain"},"compress_file_types":["txt","log","json","tar","html","yaml"]},"gcs_credentials_secret":"gce-sa-credentials-gcs-publisher","skip_cloning":true,"censor_secrets":true,"censoring_options":{"minimum_secret_length":6}}}'
~~       OPENSHIFT_CI: 'true'
~~       JOB_NAME_SAFE: fournos
~~       BUILD_ID: '2049753557309067264'
~~       PULL_PULL_SHA: 427cccba7d52450623e9a7b15dca4bd83bd46de0
~~       PULL_NUMBER: '49'
~~       PULL_BASE_REF: main
~~       REPO_NAME: forge
~~       REPO_OWNER: openshift-psap
~~       PULL_BASE_SHA: ecd138b2539c3a73e4725be54d1a005c9c3096d7
~~       PULL_TITLE: '[fournos_launcher] Allow launching multiple jobs concurrently'
~~       PULL_REFS: main:ecd138b2539c3a73e4725be54d1a005c9c3096d7,49:427cccba7d52450623e9a7b15dca4bd83bd46de0
~~       PULL_HEAD_REF: parallel
~~     status_dest: /logs/artifacts
~~     ci_label: pr49_b2049753557309067264
~~     exclusive: true
~~     gpu_count: null
~~     gpu_type: null
~~     artifact_dir: /logs/artifacts/001__submit_and_wait
~~ CONTEXT:
~~     final_job_name: forge-skeleton-20260430-073206
~~     manifest_file: /logs/artifacts/001__submit_and_wait/src/forge-skeleton-20260430-073206-manifest.yaml
~~
~~ EXCEPTION: RuntimeError
~~     Job forge-skeleton-20260430-073206 failed: Forge resolution failed: Job has reached the specified backoff limit
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


[...]

Execution logs

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Apr 30, 2026

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt
/var fournos.namespace: psap-automation-wip

@psap-forge-bot
Copy link
Copy Markdown

🔴 Test of 'fournos_launcher submit' failed after 00 hours 01 minutes 04 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton
/var fournos.namespace: psap-automation-wip
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

Failure indicator:

## /logs/artifacts/FAILURE 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
~~ projects/fournos_launcher/toolbox/submit_and_wait/main.py:177
~~ TASK: wait_for_job_completion: Wait for FOURNOS job to complete
~~ ARTIFACT_DIR: /logs/artifacts/001__submit_and_wait
~~ LOG_FILE: /logs/artifacts/001__submit_and_wait/task.log
~~ ARGS:
~~     cluster_name: psap-mgmt
~~     project: skeleton
~~     args: []
~~     variables_overrides:
~~       fournos_launcher.parallel_jobs.1:
~~       - quick_test
~~       fournos_launcher.parallel_jobs.2:
~~       - long_test
~~       fournos_launcher.parallel_jobs.3:
~~       - quick_test
~~       - side_testing
~~     job_name: forge-skeleton-20260430-073347
~~     namespace: psap-automation-wip
~~     owner: kpouget
~~     display_name: skeleton
~~     pipeline_name: forge-test-only
~~     env:
~~       JOB_TYPE: presubmit
~~       JOB_NAME: pull-ci-openshift-psap-forge-main-fournos
~~       JOB_SPEC: '{"type":"presubmit","job":"pull-ci-openshift-psap-forge-main-fournos","buildid":"2049753986755465216","prowjobid":"6db8deeb-5086-4d88-8638-49228c1f73b4","refs":{"org":"openshift-psap","repo":"forge","repo_link":"https://github.com/openshift-psap/forge","base_ref":"main","base_sha":"ecd138b2539c3a73e4725be54d1a005c9c3096d7","base_link":"https://github.com/openshift-psap/forge/commit/ecd138b2539c3a73e4725be54d1a005c9c3096d7","pulls":[{"number":49,"author":"kpouget","sha":"427cccba7d52450623e9a7b15dca4bd83bd46de0","title":"[fournos_launcher]
~~         Allow launching multiple jobs concurrently","head_ref":"parallel","link":"https://github.com/openshift-psap/forge/pull/49","commit_link":"https://github.com/openshift-psap/forge/pull/49/commits/427cccba7d52450623e9a7b15dca4bd83bd46de0","author_link":"https://github.com/kpouget"}]},"decoration_config":{"timeout":"23h0m0s","grace_period":"15s","utility_images":{"clonerefs":"us-docker.pkg.dev/k8s-infra-prow/images/clonerefs:v20260429-8f72e9c5a","initupload":"us-docker.pkg.dev/k8s-infra-prow/images/initupload:v20260429-8f72e9c5a","entrypoint":"us-docker.pkg.dev/k8s-infra-prow/images/entrypoint:v20260429-8f72e9c5a","sidecar":"us-docker.pkg.dev/k8s-infra-prow/images/sidecar:v20260429-8f72e9c5a"},"resources":{"clonerefs":{"limits":{"memory":"3Gi"},"requests":{"cpu":"100m","memory":"500Mi"}},"initupload":{"limits":{"memory":"200Mi"},"requests":{"cpu":"100m","memory":"50Mi"}},"place_entrypoint":{"limits":{"memory":"100Mi"},"requests":{"cpu":"100m","memory":"25Mi"}},"sidecar":{"limits":{"memory":"2Gi"},"requests":{"cpu":"100m","memory":"250Mi"}}},"gcs_configuration":{"bucket":"test-platform-results","path_strategy":"single","default_org":"openshift","default_repo":"origin","mediaTypes":{"log":"text/plain"},"compress_file_types":["txt","log","json","tar","html","yaml"]},"gcs_credentials_secret":"gce-sa-credentials-gcs-publisher","skip_cloning":true,"censor_secrets":true,"censoring_options":{"minimum_secret_length":6}}}'
~~       OPENSHIFT_CI: 'true'
~~       JOB_NAME_SAFE: fournos
~~       BUILD_ID: '2049753986755465216'
~~       PULL_PULL_SHA: 427cccba7d52450623e9a7b15dca4bd83bd46de0
~~       PULL_NUMBER: '49'
~~       PULL_BASE_REF: main
~~       REPO_NAME: forge
~~       REPO_OWNER: openshift-psap
~~       PULL_BASE_SHA: ecd138b2539c3a73e4725be54d1a005c9c3096d7
~~       PULL_TITLE: '[fournos_launcher] Allow launching multiple jobs concurrently'
~~       PULL_REFS: main:ecd138b2539c3a73e4725be54d1a005c9c3096d7,49:427cccba7d52450623e9a7b15dca4bd83bd46de0
~~       PULL_HEAD_REF: parallel
~~     status_dest: /logs/artifacts
~~     ci_label: pr49_b2049753986755465216
~~     exclusive: true
~~     gpu_count: null
~~     gpu_type: null
~~     artifact_dir: /logs/artifacts/001__submit_and_wait
~~ CONTEXT:
~~     final_job_name: forge-skeleton-20260430-073347
~~     manifest_file: /logs/artifacts/001__submit_and_wait/src/forge-skeleton-20260430-073347-manifest.yaml
~~
~~ EXCEPTION: RuntimeError
~~     Job forge-skeleton-20260430-073347 failed: Forge resolution failed: Job has reached the specified backoff limit
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


[...]

Execution logs

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Apr 30, 2026

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt
/var fournos.namespace: psap-automation-wip

@psap-forge-bot
Copy link
Copy Markdown

🔴 Test of 'fournos_launcher submit' failed after 00 hours 00 minutes 02 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton
/var fournos.namespace: psap-automation-wip
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

Failure indicator:

## /logs/artifacts/FAILURE 
--- 📍TypeError STACKTRACE ---
--- 📍submit_job.<locals>.submit_parallel_job() takes 2 positional arguments but 3 were given

   Traceback (most recent call last):
     File "/app/forge/projects/core/library/ci.py", line 100, in wrapper
       exit_code = command_func(*args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/fournos_launcher/orchestration/ci.py", line 78, in submit
       return submit_mod.submit_job()
              ^^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/fournos_launcher/orchestration/submit.py", line 192, in submit_job
       with run.Parallel("parallel_jobs", exit_on_exception=False) as parallel:
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/core/library/run.py", line 231, in __exit__
       raise e
     File "/app/forge/projects/core/library/run.py", line 219, in __exit__
       future.result()  # This will raise any exception that occurred
       ^^^^^^^^^^^^^^^
     File "/usr/lib64/python3.12/concurrent/futures/_base.py", line 449, in result
       return self.__get_result()
              ^^^^^^^^^^^^^^^^^^^
     File "/usr/lib64/python3.12/concurrent/futures/_base.py", line 401, in __get_result
       raise self._exception
     File "/usr/lib64/python3.12/concurrent/futures/thread.py", line 59, in run
       result = self.fn(*self.args, **self.kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/core/library/run.py", line 192, in _run_with_artifact_dir
       return func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
   TypeError: submit_job.<locals>.submit_parallel_job() takes 2 positional arguments but 3 were given

[...]

Execution logs

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Apr 30, 2026

/test fournos skeleton
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt
/var fournos.namespace: psap-automation-wip

@psap-forge-bot
Copy link
Copy Markdown

🔴 Test of 'fournos_launcher submit' failed after 00 hours 00 minutes 04 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton
/var fournos.namespace: psap-automation-wip
/parallel 1 quick_test
/parallel 2 long_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

• Failure indicator: Empty.
Execution logs

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Apr 30, 2026

/test fournos skeleton quick_test
/cluster psap-mgmt
/var fournos.namespace: psap-automation-wip

@psap-forge-bot
Copy link
Copy Markdown

🔴 Test of 'fournos_launcher submit' failed after 00 hours 00 minutes 20 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton quick_test
/var fournos.namespace: psap-automation-wip
/cluster psap-mgmt

Failure indicator:

## /logs/artifacts/FAILURE 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
~~ projects/fournos_launcher/toolbox/submit_and_wait/main.py:177
~~ TASK: wait_for_job_completion: Wait for FOURNOS job to complete
~~ ARTIFACT_DIR: /logs/artifacts/001__submit_and_wait
~~ LOG_FILE: /logs/artifacts/001__submit_and_wait/task.log
~~ ARGS:
~~     cluster_name: psap-mgmt
~~     project: skeleton
~~     args:
~~     - quick_test
~~     variables_overrides: {}
~~     job_name: forge-skeleton-20260430-075649
~~     namespace: psap-automation-wip
~~     owner: kpouget
~~     display_name: skeleton quick_test
~~     pipeline_name: forge-test-only
~~     env:
~~       JOB_TYPE: presubmit
~~       JOB_NAME: pull-ci-openshift-psap-forge-main-fournos
~~       JOB_SPEC: '{"type":"presubmit","job":"pull-ci-openshift-psap-forge-main-fournos","buildid":"2049759790367772672","prowjobid":"34ab9060-31e8-471d-bb8c-f4e57c960034","refs":{"org":"openshift-psap","repo":"forge","repo_link":"https://github.com/openshift-psap/forge","base_ref":"main","base_sha":"ecd138b2539c3a73e4725be54d1a005c9c3096d7","base_link":"https://github.com/openshift-psap/forge/commit/ecd138b2539c3a73e4725be54d1a005c9c3096d7","pulls":[{"number":49,"author":"kpouget","sha":"62352c8fa3fcf7a9f0217e3bc3d8bf803e4f1a31","title":"[fournos_launcher]
~~         Allow launching multiple jobs concurrently","head_ref":"parallel","link":"https://github.com/openshift-psap/forge/pull/49","commit_link":"https://github.com/openshift-psap/forge/pull/49/commits/62352c8fa3fcf7a9f0217e3bc3d8bf803e4f1a31","author_link":"https://github.com/kpouget"}]},"decoration_config":{"timeout":"23h0m0s","grace_period":"15s","utility_images":{"clonerefs":"us-docker.pkg.dev/k8s-infra-prow/images/clonerefs:v20260429-8f72e9c5a","initupload":"us-docker.pkg.dev/k8s-infra-prow/images/initupload:v20260429-8f72e9c5a","entrypoint":"us-docker.pkg.dev/k8s-infra-prow/images/entrypoint:v20260429-8f72e9c5a","sidecar":"us-docker.pkg.dev/k8s-infra-prow/images/sidecar:v20260429-8f72e9c5a"},"resources":{"clonerefs":{"limits":{"memory":"3Gi"},"requests":{"cpu":"100m","memory":"500Mi"}},"initupload":{"limits":{"memory":"200Mi"},"requests":{"cpu":"100m","memory":"50Mi"}},"place_entrypoint":{"limits":{"memory":"100Mi"},"requests":{"cpu":"100m","memory":"25Mi"}},"sidecar":{"limits":{"memory":"2Gi"},"requests":{"cpu":"100m","memory":"250Mi"}}},"gcs_configuration":{"bucket":"test-platform-results","path_strategy":"single","default_org":"openshift","default_repo":"origin","mediaTypes":{"log":"text/plain"},"compress_file_types":["txt","log","json","tar","html","yaml"]},"gcs_credentials_secret":"gce-sa-credentials-gcs-publisher","skip_cloning":true,"censor_secrets":true,"censoring_options":{"minimum_secret_length":6}}}'
~~       OPENSHIFT_CI: 'true'
~~       JOB_NAME_SAFE: fournos
~~       BUILD_ID: '2049759790367772672'
~~       PULL_PULL_SHA: 62352c8fa3fcf7a9f0217e3bc3d8bf803e4f1a31
~~       PULL_NUMBER: '49'
~~       PULL_BASE_REF: main
~~       REPO_NAME: forge
~~       REPO_OWNER: openshift-psap
~~       PULL_BASE_SHA: ecd138b2539c3a73e4725be54d1a005c9c3096d7
~~       PULL_TITLE: '[fournos_launcher] Allow launching multiple jobs concurrently'
~~       PULL_REFS: main:ecd138b2539c3a73e4725be54d1a005c9c3096d7,49:62352c8fa3fcf7a9f0217e3bc3d8bf803e4f1a31
~~       PULL_HEAD_REF: parallel
~~     status_dest: /logs/artifacts
~~     ci_label: pr49_b2049759790367772672
~~     exclusive: true
~~     gpu_count: null
~~     gpu_type: null
~~     artifact_dir: /logs/artifacts/001__submit_and_wait
~~ CONTEXT:
~~     final_job_name: forge-skeleton-20260430-075649
~~     manifest_file: /logs/artifacts/001__submit_and_wait/src/forge-skeleton-20260430-075649-manifest.yaml
~~
~~ EXCEPTION: RuntimeError
~~     Job forge-skeleton-20260430-075649 failed: Forge resolution failed: Job has reached the specified backoff limit
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


[...]

Execution logs

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Apr 30, 2026

/test fournos skeleton quick_test
/cluster psap-mgmt
/var fournos.namespace: psap-automation-wip

@psap-forge-bot
Copy link
Copy Markdown

🔴 Test of 'fournos_launcher submit' failed after 00 hours 01 minutes 06 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton quick_test
/var fournos.namespace: psap-automation-wip
/cluster psap-mgmt

Failure indicator:

## /logs/artifacts/FAILURE 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
~~ projects/fournos_launcher/toolbox/submit_and_wait/main.py:178
~~ TASK: wait_for_job_completion: Wait for FOURNOS job to complete
~~ ARTIFACT_DIR: /logs/artifacts/001__submit_and_wait
~~ LOG_FILE: /logs/artifacts/001__submit_and_wait/task.log
~~ ARGS:
~~     cluster_name: psap-mgmt
~~     project: skeleton
~~     args:
~~     - quick_test
~~     variables_overrides: {}
~~     job_name: forge-skeleton-20260430-080218
~~     namespace: psap-automation-wip
~~     owner: kpouget
~~     display_name: skeleton quick_test
~~     pipeline_name: forge-test-only
~~     env:
~~       JOB_TYPE: presubmit
~~       JOB_NAME: pull-ci-openshift-psap-forge-main-fournos
~~       JOB_SPEC: '{"type":"presubmit","job":"pull-ci-openshift-psap-forge-main-fournos","buildid":"2049761157819928576","prowjobid":"0e9de590-e24e-45b3-b6e4-ac2f79c3a6e9","refs":{"org":"openshift-psap","repo":"forge","repo_link":"https://github.com/openshift-psap/forge","base_ref":"main","base_sha":"ecd138b2539c3a73e4725be54d1a005c9c3096d7","base_link":"https://github.com/openshift-psap/forge/commit/ecd138b2539c3a73e4725be54d1a005c9c3096d7","pulls":[{"number":49,"author":"kpouget","sha":"87d1f3de95faeb854c04d057291fff70793f190a","title":"[fournos_launcher]
~~         Allow launching multiple jobs concurrently","head_ref":"parallel","link":"https://github.com/openshift-psap/forge/pull/49","commit_link":"https://github.com/openshift-psap/forge/pull/49/commits/87d1f3de95faeb854c04d057291fff70793f190a","author_link":"https://github.com/kpouget"}]},"decoration_config":{"timeout":"23h0m0s","grace_period":"15s","utility_images":{"clonerefs":"us-docker.pkg.dev/k8s-infra-prow/images/clonerefs:v20260429-8f72e9c5a","initupload":"us-docker.pkg.dev/k8s-infra-prow/images/initupload:v20260429-8f72e9c5a","entrypoint":"us-docker.pkg.dev/k8s-infra-prow/images/entrypoint:v20260429-8f72e9c5a","sidecar":"us-docker.pkg.dev/k8s-infra-prow/images/sidecar:v20260429-8f72e9c5a"},"resources":{"clonerefs":{"limits":{"memory":"3Gi"},"requests":{"cpu":"100m","memory":"500Mi"}},"initupload":{"limits":{"memory":"200Mi"},"requests":{"cpu":"100m","memory":"50Mi"}},"place_entrypoint":{"limits":{"memory":"100Mi"},"requests":{"cpu":"100m","memory":"25Mi"}},"sidecar":{"limits":{"memory":"2Gi"},"requests":{"cpu":"100m","memory":"250Mi"}}},"gcs_configuration":{"bucket":"test-platform-results","path_strategy":"single","default_org":"openshift","default_repo":"origin","mediaTypes":{"log":"text/plain"},"compress_file_types":["txt","log","json","tar","html","yaml"]},"gcs_credentials_secret":"gce-sa-credentials-gcs-publisher","skip_cloning":true,"censor_secrets":true,"censoring_options":{"minimum_secret_length":6}}}'
~~       OPENSHIFT_CI: 'true'
~~       JOB_NAME_SAFE: fournos
~~       BUILD_ID: '2049761157819928576'
~~       PULL_PULL_SHA: 87d1f3de95faeb854c04d057291fff70793f190a
~~       PULL_NUMBER: '49'
~~       PULL_BASE_REF: main
~~       REPO_NAME: forge
~~       REPO_OWNER: openshift-psap
~~       PULL_BASE_SHA: ecd138b2539c3a73e4725be54d1a005c9c3096d7
~~       PULL_TITLE: '[fournos_launcher] Allow launching multiple jobs concurrently'
~~       PULL_REFS: main:ecd138b2539c3a73e4725be54d1a005c9c3096d7,49:87d1f3de95faeb854c04d057291fff70793f190a
~~       PULL_HEAD_REF: parallel
~~     status_dest: /logs/artifacts
~~     ci_label: pr49_b2049761157819928576
~~     exclusive: true
~~     gpu_count: null
~~     gpu_type: null
~~     artifact_dirname_suffix: null
~~     artifact_dir: /logs/artifacts/001__submit_and_wait
~~ CONTEXT:
~~     final_job_name: forge-skeleton-20260430-080218
~~     manifest_file: /logs/artifacts/001__submit_and_wait/src/forge-skeleton-20260430-080218-manifest.yaml
~~
~~ EXCEPTION: RuntimeError
~~     Job forge-skeleton-20260430-080218 failed: Tasks Completed: 2 (Failed: 2, Cancelled 0), Skipped: 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


[...]

Execution logs

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented May 4, 2026

🟢 Test of 'fournos_launcher submit' succeeded after 00 hours 06 minutes 32 seconds 🟢

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton
/var fournos.namespace: psap-automation-wip
/parallel 1 quick_test
/parallel 2 medium_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

Execution logs

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 4, 2026

/test fournos skeleton
/var fournos.namespace: psap-automation-wip
/parallel 1 quick_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented May 4, 2026

🟢 Test of 'skeleton test' succeeded after 00 hours 00 minutes 10 seconds 🟢

• Link to the test results.

• No reports index generated...

Test configuration:

ci_job.cluster: psap-mgmt
ci_job.exclusive: true
ci_job.fjob: forge-skeleton-20260504-113606-1
ci_job.hardware:
  gpuCount: 4
  gpuType: h200
ci_job.name: skeleton quick_test (job 1)
ci_job.owner: kpouget
project.args:
- quick_test
project.name: skeleton

Execution logs

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented May 4, 2026

🟢 Test of 'skeleton test' succeeded after 00 hours 00 minutes 10 seconds 🟢

• Link to the test results.

• No reports index generated...

Test configuration:

ci_job.cluster: psap-mgmt
ci_job.exclusive: true
ci_job.fjob: forge-skeleton-20260504-113606-3
ci_job.hardware:
  gpuCount: 4
  gpuType: h200
ci_job.name: skeleton quick_test side_testing (job 3)
ci_job.owner: kpouget
project.args:
- quick_test
- side_testing
project.name: skeleton

Execution logs

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented May 4, 2026

🟢 Test of 'fournos_launcher submit' succeeded after 00 hours 02 minutes 37 seconds 🟢

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton
/var fournos.namespace: psap-automation-wip
/parallel 1 quick_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

Execution logs

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 4, 2026

/test fournos skeleton
/var fournos.namespace: psap-automation-wip
/parallel 1 quick_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented May 4, 2026

🟢 Test of 'skeleton test' succeeded after 00 hours 00 minutes 10 seconds 🟢

• Link to the test results.

• No reports index generated...

Test configuration:

ci_job.cluster: psap-mgmt
ci_job.exclusive: true
ci_job.fjob: forge-skeleton-20260504-115754-1
ci_job.hardware:
  gpuCount: 4
  gpuType: h200
ci_job.name: skeleton quick_test (job 1)
ci_job.owner: kpouget
project.args:
- quick_test
project.name: skeleton

Execution logs

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented May 4, 2026

🟢 Test of 'skeleton test' succeeded after 00 hours 00 minutes 10 seconds 🟢

• Link to the test results.

• No reports index generated...

Test configuration:

ci_job.cluster: psap-mgmt
ci_job.exclusive: true
ci_job.fjob: forge-skeleton-20260504-115754-3
ci_job.hardware:
  gpuCount: 4
  gpuType: h200
ci_job.name: skeleton quick_test side_testing (job 3)
ci_job.owner: kpouget
project.args:
- quick_test
- side_testing
project.name: skeleton

Execution logs

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented May 4, 2026

🟢 Test of 'fournos_launcher submit' succeeded after 00 hours 02 minutes 33 seconds 🟢

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos skeleton
/var fournos.namespace: psap-automation-wip
/parallel 1 quick_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

Execution logs

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 4, 2026

/test fournos skeleton
/var fournos.namespace: psap-automation-wip
/parallel 1 quick_test
/parallel 2 medium_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented May 4, 2026

🟢 Test of 'skeleton test' succeeded after 00 hours 02 minutes 00 seconds 🟢

• Link to the test results.

• No reports index generated...

Test configuration:

ci_job.cluster: psap-mgmt
ci_job.exclusive: true
ci_job.fjob: forge-skeleton-20260504-120459-2
ci_job.hardware:
  gpuCount: 4
  gpuType: h200
ci_job.name: skeleton medium_test (job 2)
ci_job.owner: kpouget
project.args:
- medium_test
project.name: skeleton

Execution logs

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 4, 2026

good, merging this version!

@kpouget kpouget merged commit f8f2c7c into openshift-psap:main May 4, 2026
5 of 7 checks passed
@kpouget kpouget deleted the parallel branch May 4, 2026 12:08
@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 4, 2026

/test fournos skeleton
/var fournos.namespace: psap-automation-wip
/parallel 1 quick_test
/parallel 2 medium_test
/parallel 3 quick_test side_testing
/cluster psap-mgmt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant