Create global mlflow run and use it for checkpoints#144
Open
irenedea wants to merge 3 commits intosingle-controller-hackathonfrom
Open
Create global mlflow run and use it for checkpoints#144irenedea wants to merge 3 commits intosingle-controller-hackathonfrom
irenedea wants to merge 3 commits intosingle-controller-hackathonfrom
Conversation
3e87ee0 to
b46fc36
Compare
rithwik-db
reviewed
Aug 10, 2025
6d38c89 to
8880bc0
Compare
works comment rebased and updated supporting absolute paths and dbfs adding artifacts moved to mlflow utils file minor fix slight adjustment
8880bc0 to
cb5fc0b
Compare
rithwik-db
reviewed
Aug 11, 2025
Comment on lines
+86
to
+98
| # NOTE: This doesn't work yet for a few reasons: | ||
| # 1. Downloading nested mlflow artifacts doesn't work correctly due to the MlflowObjectStore | ||
| # having issues. For instance, https://github.com/mosaicml/composer/blob/4ae29b1afec56ce2d54f6fa07a7f9578a0d364b0/composer/utils/object_store/mlflow_object_store.py#L465-L476 | ||
| # requires `tmp_path = os.path.join(tmp_dir, os.path.basename(artifact_path))` instead of what it currently | ||
| # does. By doing that, the symlink can be loaded correctly. | ||
| # 2. If save_folder is an absolute path (e.g. /tmp/checkpoints), the symlink will be created using this | ||
| # absolute path. This is not a valid symlink in mlflow so we need to do some os.path gymnastics to | ||
| # support absolute paths for save_folder. | ||
| # 3. We also need to support save_folder being a dbfs path eventually. | ||
| # Proposed Approach | ||
| # - Create an MlflowCheckpointActor (allowing us to set WORLD_SIZE=1) | ||
| # and create functions within that are based on MlflowObjectStore. | ||
| # that safely handle dbfs paths and absolute paths. |
Collaborator
There was a problem hiding this comment.
This PR is NOT ready for review since there's a lot of os.path gymnastics that we are doing for supporting saving things to mlflow artifacts. I am going to keep this PR on hold for now until we have time to think of a more resilient solution that addresses the problems here. (cc: @irenedea @bowenyang008)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
After this PR, we should load the experience buffer from the checkpoints in order for checkpoints to work correctly with async. (Shouldn't be too hard..) It only works for sync right now.
https://dbc-559ffd80-2bfc.cloud.databricks.com/ml/experiments/723944411900647/runs/fcbceb3f3c9142539744a0883575ab0a/system-metrics?o=7395834863327820
You can see the metrics /system metrics for two iterations, where the second was a resumption. This was a super small dummy run, so the loss values seem to not show up when they repeat at 0.0... 🤷♀️