Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docstring updates, attempt to make shape/len more precise #16

Merged
merged 16 commits into from
Oct 3, 2024

Conversation

ryan-williams
Copy link
Member

No description provided.

n_workers, worker_id = _get_worker_world_rank()
obs_per_proc, obs_rem = divmod(len(self._obs_joinids), world_size)
# obs rows assigned to this "distributed" process
n_proc_obs = obs_per_proc + bool(rank < obs_rem)
Copy link
Member

@bkmartinjr bkmartinjr Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is incorrect. Every GPU gets the same number of samples (this is a hard requirement). Counts can vary across multiple data loader workers, but each GPU worker must have exactly the same sample count. See notes in _create_obs_joinids_partition, and in particular step "#2".

If the partitioning across GPUs does not have the same cardinality, you get a crash or stall when using DDP.

currently, this is handled by dropping any residual obs rows. In the future, we may actually duplicate rows to round up (rather than truncate down), or give the user an option - both are commonly used methods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining, I've updated it to reflect that "distributed" processes get rounded-down splits, but their child "data-loader" processes can have ±0.5.

I'll add some simple tests of this math as well, later. I think it's worth codifying our assumptions/intentions here.

Copy link
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, but PTAL at the style/clarity suggestion in pytorch.py

@ryan-williams ryan-williams marked this pull request as ready for review October 3, 2024 20:53
@ryan-williams ryan-williams merged commit 7801854 into bkmartinjr/initial-non-shuffling-code Oct 3, 2024
24 checks passed
@ryan-williams ryan-williams deleted the rw/pr/6 branch October 3, 2024 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants