docstring updates, attempt to make `shape`/`len` more precise #16

ryan-williams · 2024-10-02T20:52:53Z

No description provided.

bkmartinjr · 2024-10-02T21:16:33Z

src/tiledbsoma_ml/pytorch.py

+        n_workers, worker_id = _get_worker_world_rank()
+        obs_per_proc, obs_rem = divmod(len(self._obs_joinids), world_size)
+        # obs rows assigned to this "distributed" process
+        n_proc_obs = obs_per_proc + bool(rank < obs_rem)


I believe this is incorrect. Every GPU gets the same number of samples (this is a hard requirement). Counts can vary across multiple data loader workers, but each GPU worker must have exactly the same sample count. See notes in _create_obs_joinids_partition, and in particular step "#2".

If the partitioning across GPUs does not have the same cardinality, you get a crash or stall when using DDP.

currently, this is handled by dropping any residual obs rows. In the future, we may actually duplicate rows to round up (rather than truncate down), or give the user an option - both are commonly used methods.

Thanks for explaining, I've updated it to reflect that "distributed" processes get rounded-down splits, but their child "data-loader" processes can have ±0.5.

I'll add some simple tests of this math as well, later. I think it's worth codifying our assumptions/intentions here.

src/tiledbsoma_ml/pytorch.py

…hed`

…rse_output__batched`

…s_X_matrix`

…for_rank`

bkmartinjr

Approving, but PTAL at the style/clarity suggestion in pytorch.py

ryan-williams requested a review from bkmartinjr October 2, 2024 20:52

bkmartinjr reviewed Oct 2, 2024

View reviewed changes

ryan-williams added 2 commits October 2, 2024 17:57

docstring updates, attempt to make shape/len more precise

fc41dbc

test parametrize use_eager_fetch

ad75598

ryan-williams force-pushed the rw/pr/6 branch from b3b5432 to ad75598 Compare October 2, 2024 21:58

ryan-williams marked this pull request as draft October 2, 2024 21:59

bkmartinjr approved these changes Oct 2, 2024

View reviewed changes

bkmartinjr reviewed Oct 3, 2024

View reviewed changes

src/tiledbsoma_ml/pytorch.py Outdated Show resolved Hide resolved

ryan-williams added 14 commits October 3, 2024 16:34

reindex batches

61f5dc5

comment updates

6c84b6c

rm deprecated typing.List usages

cefe834

XValueGen, Path type aliases

345364a

s/PipeClassImplementation/PipeClasses/g

43f0ddf

update test_non_batched, rm redundant `test_sparse_output__non_batc…

57fbfd9

…hed`

update test_uneven_soma_and_result_batches

cdcd013

update test_batching__all_batches_full_size, rm redundant `test_spa…

6821369

…rse_output__batched`

update test_soma_joinids, rm redundant `test__X_tensor_dtype_matche…

585e5ff

…s_X_matrix`

update test_batching__partial_final_batch_size

982677b

update test_batching__exactly_one_batch

446f438

update var names: batch/batches

ce0f835

update `test_distributed_and_multiprocessing__returns_data_partition_…

cdf78af

…for_rank`

add test_splits case

a13f31e

ryan-williams force-pushed the rw/pr/6 branch from 7c3c0d8 to a13f31e Compare October 3, 2024 20:34

bkmartinjr approved these changes Oct 3, 2024

View reviewed changes

ryan-williams marked this pull request as ready for review October 3, 2024 20:53

ryan-williams merged commit 7801854 into bkmartinjr/initial-non-shuffling-code Oct 3, 2024
24 checks passed

ryan-williams deleted the rw/pr/6 branch October 3, 2024 20:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docstring updates, attempt to make `shape`/`len` more precise #16

docstring updates, attempt to make `shape`/`len` more precise #16

ryan-williams commented Oct 2, 2024

bkmartinjr Oct 2, 2024 •

edited

Loading

ryan-williams Oct 2, 2024

bkmartinjr left a comment

docstring updates, attempt to make shape/len more precise #16

docstring updates, attempt to make shape/len more precise #16

Conversation

ryan-williams commented Oct 2, 2024

bkmartinjr Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

ryan-williams Oct 2, 2024

Choose a reason for hiding this comment

bkmartinjr left a comment

Choose a reason for hiding this comment

docstring updates, attempt to make `shape`/`len` more precise #16

docstring updates, attempt to make `shape`/`len` more precise #16

bkmartinjr Oct 2, 2024 •

edited

Loading