-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docstring updates, attempt to make shape
/len
more precise
#16
Conversation
src/tiledbsoma_ml/pytorch.py
Outdated
n_workers, worker_id = _get_worker_world_rank() | ||
obs_per_proc, obs_rem = divmod(len(self._obs_joinids), world_size) | ||
# obs rows assigned to this "distributed" process | ||
n_proc_obs = obs_per_proc + bool(rank < obs_rem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is incorrect. Every GPU gets the same number of samples (this is a hard requirement). Counts can vary across multiple data loader workers, but each GPU worker must have exactly the same sample count. See notes in _create_obs_joinids_partition
, and in particular step "#2".
If the partitioning across GPUs does not have the same cardinality, you get a crash or stall when using DDP.
currently, this is handled by dropping any residual obs rows. In the future, we may actually duplicate rows to round up (rather than truncate down), or give the user an option - both are commonly used methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining, I've updated it to reflect that "distributed" processes get rounded-down splits, but their child "data-loader" processes can have ±0.5.
I'll add some simple tests of this math as well, later. I think it's worth codifying our assumptions/intentions here.
…rse_output__batched`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, but PTAL at the style/clarity suggestion in pytorch.py
7801854
into
bkmartinjr/initial-non-shuffling-code
No description provided.