Fix mlflow authenticate bug when forking run with multiple nodes#196
Fix mlflow authenticate bug when forking run with multiple nodes#196havardhhaugen wants to merge 2 commits intoecmwf:developfrom
Conversation
|
Hey @havardhhaugen thanks for opening this fix! does this just happen with forking or also if you try to resume the run? The proposed fix might break mlflow functionality related to the forking and resuming of runs when migrating between 2 servers. I will try to test it to get a better idea. But I think we should try to understand why the authenticate is failing with those large jobs. Maybe it's matter of the number of server requests. @gmertes Do you have a view regarding that? |
|
It only happens when forking.I have not done any testing on the fix other than seeing that it now works for the specific job that crashed earlier. So it would be great if you could do some more thorough testing. My suspicion is also that it crashes due to the large number of requests without the rank zero statement, but as far as I can tell the error message from mlflow is empty so it's hard to tell. |
|
I had a better look at this and the proposed change @havardhhaugen unfortunately breaks the functionality to resume runs that are migrated across remote servers. The reason for this is that the function From what I have chat with @gmertes , the logs you shared make us think this is a server error so we could try to update some of the settings of the server to increase the number of authentication requests. We will look at this after the Christmas break and happy to keep you in the loop. I will have a thought to see if there are other alternatives to limit the number of authentication requests without breaking the functionality for server2server runs. |
Issue:
see issue ecmwf/anemoi-core#6
Fix:
Add a check for rank zero only when authenticating etc for a forked run.