-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Master Issue: Function Compatibility with the Different Datatypes of AnnData Fields #829
Comments
In addition to what @eroell mentioned, here are my initial—perhaps naive—thoughts on a potential strategy we could adopt:
From a design perspective, strategy 2 is quite appealing:
However, there are downsides to consider:
|
Strategy 2 is more or less what we've been doing now because in almost all cases we densify. Speaking from experience, we are leaving a loooot of performance behind if we don't support sparse matrices. Moreover, enabling all algorithms with dask arrays is also important so that we can really scale and enable out-of-core computation. I would:
We can break it up into pieces and also @aGuyLearning can potentially help here. WDYT? |
I like this hybrid approach style. I could see a classic dispatch function for the considered function ( Let's try with a quick example. def knn_impute(
adata: AnnData,
...
) -> AnnData:
if isinstance(adata.X, sparse.csr_matrix): # This datatype has a special algorithm available
return _knn_impute_on_csr(adata, ...)
if isinstance(adata.X, sparse.csc_matrix): # This datatype also
return _knn_impute_on_csc(adata, ...)
# Note Dask is not supported
# Fallback to numpy
return from_numpy(_knn_impute_on_numpy(to_numpy(adata.X)))
Then one conversion function could be: @singledispatch
def to_numpy(layer: Any) -> None:
raise ValueError("Unsupported type")
@to_numpy.register
def _(layer: np.ndarray) -> np.ndarray :
return layer # The layer is already a numpy array, then return it
@to_numpy.register
def _(layer: sparse.csr_matrix) -> np.ndarray:
return layer.toarray() # Will densify if I'm not wrong
@to_numpy.register
def _(layer: DaskArray) -> np.ndarray:
return layer.whatever() # Whatever needs to be done there So to sum up, if the user calls
|
You've spotted this right @nicolassidoux - however, optimization for sparse structures if the data is sparse is huge, can reach orders of magnitude easily depending on data sparsity. Also, out of core operations will not at all be possible without following Strategy 1 :) Agree at all points with @Zethson.
absolutely, the high level steps 1 and 2 are both huge. They should rather guide a rollout for this behavior, not "a PR" ;) step 2 is in fact open-end. I'll try to make a first set of tackeable portions soon for step 1 as sub-issues |
Rather than having to/from numpy functions singledispatched, and many names for (Imagine @singledispatch
def sum_entries(data):
"""
Sum entries
Arguments
---------
data
Data to sum entries of.
Returns
-------
result
Sum of entries.
"""
raise TypeError(f"Unsupported data type: {type(data)}")
@sum_entries.register
def _(data: np.ndarray):
print("Summing entries of a NumPy array.")
return np.sum(data)
@sum_entries.register
def _(data: sparse.spmatrix):
print("Summing entries of a SciPy sparse matrix.")
return data.sum() # not so exciting example, just illustrates this could look different for this data type To implement this step by step for (batches of) our functions together with tests would form the implementation of high-level step 1: |
Raise of TypeError for everything not explicitly ok comes in very naturally here |
Looks good to me also, but beware of the limitation of singledispatch: it overloads the function based on the first argument only. I have a suggestion for a first step, if I may: we need to list every function in ehrapy that can manipulate data so we have a clear idea of what needs to be done. |
Ok, so I created this list by checking if a function manipules -> LIST MOVED TO ISSUE DESCRIPTION |
Is e.g. This would not manipulate adata.X , but could create an expensive pd.DataFrame object right. This is more than enough reason to have it tracked on this list. Just want to double-check I understand it correctly :) |
Yes, you're right, this function doesn't manipulate
So if X is something else than Btw, maybe it would be good to implement all kind of conversions (to_ndarray, to_dataframe, etc.). Of course, once again, these features would be better placed in |
I moved the list up to issue description so everyone can amend it easily, |
Another question I have on the list - e.g. in Line 469 in 2e8240f
This function and its friends should also be on the list, right? As they call On a sidenote, to avoid this function and only pick relevant features before (e.g. survival analysis often requires just 2 or so) could be the way to have these functions "dask-ified", even if we still just call |
Description of feature
Right now AnnData fields such as
.X
can have different datatypes, being e.g. numpy dense arrays, sparse arrays, or dask arrays.Some ehrapy functions are compatible with multiple of these types, while some are not.
We can roll this out in a 2 step process:
One way to implement this could be via consistent use of single-dispatch. Should also include dedicated tests.
The order could be determined with volume of estimated usage, required effort, and typical data load (e.g. operations happening on full feature set vs lower-dimensional space)
List of affected functions
The text was updated successfully, but these errors were encountered: