Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions src/dstack/_internal/core/models/runs.py
Original file line number Diff line number Diff line change
Expand Up @@ -548,11 +548,15 @@ def _status_message(cls, values) -> Dict:
retry_on_events = (
jobs[0].job_spec.retry.on_events if jobs and jobs[0].job_spec.retry else []
)
job_status = (
jobs[0].job_submissions[-1].status if jobs and jobs[0].job_submissions else None
)
Comment on lines +551 to +553
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For multi-job runs, it can be misleading to rely solely on the status of the first job.

Consider a two-replica service. If the first replica is pulling and the second is running, the current implementation will set the run status message to pulling.

 NAME               BACKEND          RESOURCES                        PRICE    STATUS   SUBMITTED  
 test-service                                                                  pulling  3 mins ago 
   replica=0 job=0  aws (us-west-2)  cpu=2 mem=1GB disk=100GB (spot)  $0.0025  pulling  21 sec ago 
   replica=1 job=0  aws (us-west-2)  cpu=2 mem=1GB disk=100GB (spot)  $0.0025  running  3 mins ago

However, if the first replica is running and the second is pulling, the message will be running.

 NAME               BACKEND          RESOURCES                        PRICE    STATUS   SUBMITTED  
 test-service                                                                  running  8 mins ago 
   replica=0 job=0  aws (us-west-2)  cpu=2 mem=1GB disk=100GB (spot)  $0.0025  running  6 mins ago 
   replica=1 job=0  aws (us-west-2)  cpu=2 mem=1GB disk=100GB (spot)  $0.0025  pulling  16 sec ago

This inconsistency is misleading because the replicas are supposed to have equal importance.

I can suggest to only rely on the job status for single-job runs. In multi-job runs, job statuses will not affect the run status message.

Suggested change
job_status = (
jobs[0].job_submissions[-1].status if jobs and jobs[0].job_submissions else None
)
job_status = (
jobs[0].job_submissions[-1].status
if len(jobs) == 1 and jobs[0].job_submissions
else None
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, merged without! Will fix separately!

termination_reason = Run.get_last_termination_reason(jobs[0]) if jobs else None
except KeyError:
return values
values["status_message"] = Run._get_status_message(
status=status,
job_status=job_status,
retry_on_events=retry_on_events,
termination_reason=termination_reason,
)
Expand All @@ -568,9 +572,12 @@ def get_last_termination_reason(job: "Job") -> Optional[JobTerminationReason]:
@staticmethod
def _get_status_message(
status: RunStatus,
job_status: Optional[JobStatus],
retry_on_events: List[RetryEvent],
termination_reason: Optional[JobTerminationReason],
) -> str:
if job_status == JobStatus.PULLING:
return "pulling"
# Currently, `retrying` is shown only for `no-capacity` events
if (
status in [RunStatus.SUBMITTED, RunStatus.PENDING]
Expand Down
Loading