Skip to content
This repository was archived by the owner on Dec 20, 2024. It is now read-only.

Fix mlflow authenticate bug when forking run with multiple nodes#196

Open
havardhhaugen wants to merge 2 commits intoecmwf:developfrom
metno:fix/mlflow-auth
Open

Fix mlflow authenticate bug when forking run with multiple nodes#196
havardhhaugen wants to merge 2 commits intoecmwf:developfrom
metno:fix/mlflow-auth

Conversation

@havardhhaugen
Copy link
Copy Markdown
Contributor

Issue:
see issue ecmwf/anemoi-core#6

Fix:
Add a check for rank zero only when authenticating etc for a forked run.

@anaprietonem
Copy link
Copy Markdown
Contributor

Hey @havardhhaugen thanks for opening this fix! does this just happen with forking or also if you try to resume the run? The proposed fix might break mlflow functionality related to the forking and resuming of runs when migrating between 2 servers. I will try to test it to get a better idea. But I think we should try to understand why the authenticate is failing with those large jobs. Maybe it's matter of the number of server requests. @gmertes Do you have a view regarding that?

@havardhhaugen
Copy link
Copy Markdown
Contributor Author

It only happens when forking.I have not done any testing on the fix other than seeing that it now works for the specific job that crashed earlier. So it would be great if you could do some more thorough testing. My suspicion is also that it crashes due to the large number of requests without the rank zero statement, but as far as I can tell the error message from mlflow is empty so it's hard to tell.

@anaprietonem
Copy link
Copy Markdown
Contributor

anaprietonem commented Dec 19, 2024

I had a better look at this and the proposed change @havardhhaugen unfortunately breaks the functionality to resume runs that are migrated across remote servers. The reason for this is that the function self._check_server2server_lineage(parent_run) checks the server2lineage. If it's a migrated run then this lineage returns the new run id for all runs, while if it's not s server2server then the rank0 uses the new run id while other ranks can point to the parent run to be resumed. Now if we introduce the rank 0 only, the lineage breaks for the other ranks for migrated runs and the training hangs.

From what I have chat with @gmertes , the logs you shared make us think this is a server error so we could try to update some of the settings of the server to increase the number of authentication requests. We will look at this after the Christmas break and happy to keep you in the loop. I will have a thought to see if there are other alternatives to limit the number of authentication requests without breaking the functionality for server2server runs.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants