Fix mlflow authenticate bug when forking run with multiple nodes by havardhhaugen · Pull Request #196 · ecmwf/anemoi-training

havardhhaugen · 2024-12-11T08:03:01Z

Fix:
Add a check for rank zero only when authenticating etc for a forked run.

anaprietonem · 2024-12-11T08:38:28Z

Hey @havardhhaugen thanks for opening this fix! does this just happen with forking or also if you try to resume the run? The proposed fix might break mlflow functionality related to the forking and resuming of runs when migrating between 2 servers. I will try to test it to get a better idea. But I think we should try to understand why the authenticate is failing with those large jobs. Maybe it's matter of the number of server requests. @gmertes Do you have a view regarding that?

havardhhaugen · 2024-12-11T11:08:37Z

It only happens when forking.I have not done any testing on the fix other than seeing that it now works for the specific job that crashed earlier. So it would be great if you could do some more thorough testing. My suspicion is also that it crashes due to the large number of requests without the rank zero statement, but as far as I can tell the error message from mlflow is empty so it's hard to tell.

anaprietonem · 2024-12-19T10:18:35Z

I had a better look at this and the proposed change @havardhhaugen unfortunately breaks the functionality to resume runs that are migrated across remote servers. The reason for this is that the function self._check_server2server_lineage(parent_run) checks the server2lineage. If it's a migrated run then this lineage returns the new run id for all runs, while if it's not s server2server then the rank0 uses the new run id while other ranks can point to the parent run to be resumed. Now if we introduce the rank 0 only, the lineage breaks for the other ranks for migrated runs and the training hangs.

From what I have chat with @gmertes , the logs you shared make us think this is a server error so we could try to update some of the settings of the server to increase the number of authentication requests. We will look at this after the Christmas break and happy to keep you in the loop. I will have a thought to see if there are other alternatives to limit the number of authentication requests without breaking the functionality for server2server runs.

havardhhaugen and others added 2 commits December 11, 2024 09:51

Fix mlflow logger

e537374

Merge branch 'ecmwf:develop' into fix/mlflow-auth

7fcd7de

github-actions Bot added the contributor label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix mlflow authenticate bug when forking run with multiple nodes#196

Fix mlflow authenticate bug when forking run with multiple nodes#196
havardhhaugen wants to merge 2 commits intoecmwf:developfrom
metno:fix/mlflow-auth

havardhhaugen commented Dec 11, 2024

Uh oh!

anaprietonem commented Dec 11, 2024

Uh oh!

havardhhaugen commented Dec 11, 2024

Uh oh!

anaprietonem commented Dec 19, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

havardhhaugen commented Dec 11, 2024

Uh oh!

anaprietonem commented Dec 11, 2024

Uh oh!

havardhhaugen commented Dec 11, 2024

Uh oh!

anaprietonem commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anaprietonem commented Dec 19, 2024 •

edited

Loading