Skip to content

saving output from different ranks#62

Merged
hanaol merged 15 commits intohanaol/test-featurefrom
hanaol/test-feature-distributed
Feb 7, 2026
Merged

saving output from different ranks#62
hanaol merged 15 commits intohanaol/test-featurefrom
hanaol/test-feature-distributed

Conversation

@hanaol
Copy link
Copy Markdown
Collaborator

@hanaol hanaol commented Jan 16, 2026

Handles logging/saving the performance metric across multiple ranks.

Comment thread src/electrai/lightning.py Outdated
Comment thread src/electrai/lightning.py Outdated
Comment thread src/electrai/lightning.py Outdated
Comment thread src/electrai/lightning.py Outdated
@hanaol hanaol requested a review from forklady42 February 4, 2026 22:44
@hanaol hanaol mentioned this pull request Feb 4, 2026
Copy link
Copy Markdown
Collaborator

@forklady42 forklady42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more suggestions. Feel free to merge after addressing.

Comment thread src/electrai/lightning.py Outdated
idx = indices[i]
pred_i = preds[i].numpy()
np.save(out_dir / f"{idx}.npy", pred_i)
np.save(self.out_dir / f"{idx}.npy", preds[i].squeeze(0).cpu().numpy())
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be safer to include self.global_rank in addition to the index in the file name. DistributedSampler pads the dataset to be evenly divisible, which can make for weirdness with the added indices.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@forklady42 do we want this to be saved/logged in the metrics (csv) file as well?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think global rank is already in the metrics file name? I'm seeing tmp_csv = self.tmp_dir / f"metrics_batch_{self.global_rank}_{batch_idx}.csv"

Copy link
Copy Markdown
Collaborator Author

@hanaol hanaol Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I meant using it when combining the temporary files into metrics.csv. If several ranks process the same index, it would appear >1 times in the file.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yes, probably good to include the rank to make the duplicates easier to reason about.

Comment thread src/electrai/lightning.py
Comment thread src/electrai/lightning.py Outdated
Comment thread src/electrai/lightning.py Outdated
Copy link
Copy Markdown
Collaborator Author

@hanaol hanaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed the comments

@hanaol hanaol merged commit cad9722 into hanaol/test-feature Feb 7, 2026
1 check passed
@hanaol hanaol deleted the hanaol/test-feature-distributed branch April 1, 2026 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants