-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cudastf #794
base: main
Are you sure you want to change the base?
Cudastf #794
Conversation
@@ -83,6 +88,10 @@ int main([[maybe_unused]] int argc, [[maybe_unused]] char **argv) | |||
(maxn = matx::max(sqrt(norm))).run(exec); | |||
|
|||
exec.sync(); | |||
#if 1 | |||
ctx.finalize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is finalize
used for vs sync
? Could you hide the context in the executor so the user doesn't need it, and calling exec.sync()
calls finalize()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finalize terminates everything in the stf context, it waits for asynchronous tasks, deletes internal resources etc... you can only do it once, sync is more equivalent to a ctx.task_fence() which is a non blocking fence (it returns a CUDA stream, and waiting for that stream means everything was done).
I'd like to move finalize to the dtor of the executor, but there are some caveats if you define the executor as a static variable, is this allowed ? The caveat might be some inappropriate unload ordering of CUDA and STF libraries as usual ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I think the destructor is the right place. but does sync() work as expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sidelnik is it doing a task fence with a stream sync ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@caugonnet , sync() should be calling ctx.task_fence() now. I agree, I think we should place the ctx.finalize() inside the stf executor dtor
@@ -129,18 +138,30 @@ int main([[maybe_unused]] int argc, [[maybe_unused]] char **argv) | |||
|
|||
} | |||
|
|||
#if 0 | |||
cudaEventRecord(stop, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually we should mask these events behind the executor as well so the timing is the same regardless of the executor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this makes it look like the code is very different for both executors but that timing is the sole reason especially if finalize is moved to the dtor
@@ -107,7 +108,7 @@ class tensor_t : public detail::tensor_impl_t<T,RANK,Desc> { | |||
* @param rhs Object to copy from | |||
*/ | |||
__MATX_HOST__ tensor_t(tensor_t const &rhs) noexcept | |||
: detail::tensor_impl_t<T, RANK, Desc>{rhs.ldata_, rhs.desc_}, storage_(rhs.storage_) | |||
: detail::tensor_impl_t<T, RANK, Desc>{rhs.ldata_, rhs.desc_, rhs.stf_ldata_}, storage_(rhs.storage_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to understand why this extra data member is needed, because this pointer exists on the device potentially many times, so it can increase the size of the operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's where a careful review of the design is needed ... Our logical data class tracks the use of a specific piece of data, your tensor seems to be a view to some data (with shapes and so on), so it's ok to have just the pointer and shapes, but in STF we do need to keep track of the internal state of the data (who owns a copy, which tasks depend on it, etc...). This is what the logical data does on your behalf and which your tensors cannot do by merely using the pointer.
One conservative take is to say that if you slice a tensor, this is the SAME logical data, so that further concurrent write accesses are serialized. This is sub-optimal when you have non overlapping slices but we cannot do better in a simple strategy. This ensures correctness but not optimality for concurrency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cliffburdick you say it exists many times on the device, but isn't this a host only class ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tensor_t is host/device, but tensor_impl_t is device-only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then i'm even surprised a logical_data can exist in device code, or the storage for it ! But this may be a pointer to an optional logical data ... We need to improve that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh I mistyped. tensor_t is ONLY on the host. tensor_impl_t is both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, it's surprising that we allow the logical data pointer to go on a device
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ldata is local data, and is ultimately just a raw pointer that points to the data needed on the device. This may be the same as the base pointer, or it may be something like a strided/offset pointer.
include/matx/executors/stf.h
Outdated
* | ||
* @param stream CUDA stream | ||
*/ | ||
stfExecutor(cudaStream_t stream) : stream_(stream) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does a stream do here? I thought STF had its own internal streams?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cliffburdick In STF you can create nested/localized contexts & streams from existing (non-STF created) streams. This allows STF mechanisms to be correctly synchronized within the existing stream ecosystem. @caugonnet correct me if I am wrong.
@@ -177,6 +180,16 @@ class tensor_t : public detail::tensor_impl_t<T,RANK,Desc> { | |||
this->SetLocalData(storage_.data()); | |||
} | |||
|
|||
template <typename S2 = Storage, typename D2 = Desc, | |||
std::enable_if_t<is_matx_storage_v<typename remove_cvref<S2>::type> && is_matx_descriptor_v<typename remove_cvref<D2>::type>, bool> = true> | |||
tensor_t(S2 &&s, D2 &&desc, T* ldata, std::optional<stf_logicaldata_type > *stf_ldata_) : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to do something about that type ... std::optional<stf_logicaldata_type > *stf_ldata_
The rationale is to be able to define a tensor before it is associated to an executor, so the logical data might be set lazily.
} | ||
|
||
/** | ||
* Constructor for a rank-0 tensor (scalar). | ||
*/ | ||
tensor_impl_t() { | ||
|
||
auto ldptr = new std::optional<stf_logicaldata_type>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels bad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't compile anymore since we don't allow std:: types on the device. It might work with cuda::std::optional, but we don't use that anywhere currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sidelnik now that we have the notion of logical_token, i believe we might simplify that. Maybe rename stf_logicaldata_type to stf_token ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The risk with token is that if we get it wrong, it's easier to mess things : with a logical data, until you do have some "value" in it, you can't read it and you'll get runtime errors. If you have aliases which would currently use the same token under the hood, it would also have to use the same token when creating the aliased data.
template <typename DescriptorType, std::enable_if_t<is_matx_descriptor_v<typename remove_cvref<DescriptorType>::type>, bool> = true> | ||
__MATX_INLINE__ __MATX_DEVICE__ __MATX_HOST__ tensor_impl_t(T *const ldata, | ||
DescriptorType &&desc, std::optional<stf_logicaldata_type > *stf_ldata) | ||
: ldata_(ldata), desc_{std::forward<DescriptorType>(desc)}, stf_ldata_(stf_ldata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
::std::move(stf_ldata) ?
#endif | ||
|
||
if (perm == 0) { | ||
task.add_deps(ld.write()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could directly build a task_dep in CUDASTF, matching perm value with the type ... But it seems there is no such thing as a clean way to do this !
include/matx/core/tensor_impl.h
Outdated
place = getDataPlace(Data()); | ||
#endif | ||
|
||
*stf_ldata_ = ctx.logical_data(cuda::experimental::stf::void_interface()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comment would be welcome here :) This is creating a logical data with a void data interface because we don't rely on CUDASTF for transfers/allocation, it's just for sync.
Putting a value here, and not a shape of a void interface means we don't have to issue a "write" task in CUDASTF
@@ -45,6 +45,30 @@ | |||
|
|||
namespace matx { | |||
namespace detail { | |||
|
|||
#if 0 | |||
__MATX_INLINE__ cuda::experimental::stf::data_place getDataPlace(void *ptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we keep it ? Note that for void data interface it's not super critical but still ...
return data_place::current_device(); | ||
case MATX_INVALID_MEMORY: | ||
//std::cout << "Data kind is invalid: assuming managed memory\n"; | ||
return data_place::managed; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like an error
} | ||
else { | ||
//std::cout << " RANK 0 not on LHS operator = " << op.str() << '\n'; | ||
detail::matxOpT0Kernel<<<blocks, threads, 0, stream_>>>(op); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we sometimes use something without a task ? Is it coherent with STF tasks?
|
||
bool stride = detail::get_grid_dims<Op::Rank()>(blocks, threads, sizes, 256); | ||
|
||
if constexpr (Op::Rank() == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we could factorize all that constexpr cascade, and move the constexpr tests into the lambda ?
@@ -54,6 +54,9 @@ namespace matx | |||
return f_(pp_get<Dim>(indices...)); | |||
} | |||
|
|||
template <typename Task> | |||
__MATX_INLINE__ void apply_dep_to_task([[maybe_unused]] Task &&task, [[maybe_unused]] int perm=1) const noexcept { } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this operator is defined per operator, and is STF specific ? it's not part of the executor nor relying on overloads / traits ?
b_.apply_dep_to_task(tsk, 1); | ||
|
||
tsk->*[&](cudaStream_t s) { | ||
auto exec = cudaExecutor(s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So create a nested MatX executor, is that legal ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be fine. The cache is ultimately what would possibly have side effects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens exactly in the dtor of the executor @cliffburdick, nothing special like a stream sync ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it doesn't do anything
include/matx/core/tensor_impl.h
Outdated
if constexpr (is_cuda_executor_v<Executor>) { | ||
return; | ||
} | ||
else if constexpr (!is_cuda_executor_v<Executor>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just else
?
@@ -1094,6 +1165,9 @@ IGNORE_WARNING_POP_GCC | |||
protected: | |||
T *ldata_; | |||
Desc desc_; | |||
|
|||
public: | |||
mutable std::optional<stf_logicaldata_type > *stf_ldata_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed before this won't work since we can't use std:: objects on the device. It might work with cuda::std::optional
, but we'd likely need to justify the overhead vs other options
@@ -55,6 +55,9 @@ namespace matx | |||
__MATX_INLINE__ __MATX_DEVICE__ __MATX_HOST__ T operator()(Is...) const { | |||
return v_; }; | |||
|
|||
template <typename Task> | |||
__MATX_INLINE__ void apply_dep_to_task([[maybe_unused]] Task &&task, [[maybe_unused]] int perm) const noexcept { } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Operator members typically use camel case format
tsk.set_symbol("all_task"); | ||
|
||
output.PreRun(out_dims_, std::forward<Executor>(ex)); | ||
output.apply_dep_to_task(tsk, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't apply_dep_to_task
just part of PreRun
? It looks like it's called in the same place
include/matx/operators/fft.h
Outdated
if constexpr (std::is_same_v<FFTType, fft_t>) { | ||
fft_impl(permute(cuda::std::get<0>(out), perm_), permute(a_, perm_), fft_size_, norm_, ex); | ||
// stfexecutor case | ||
if constexpr (!is_cuda_executor_v<Executor>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want this to run for the host executor too?
output.apply_dep_to_task(tsk, 0); | ||
a_.apply_dep_to_task(tsk, 1); | ||
|
||
tsk->*[&](cudaStream_t s) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than checking if this is not a cuda executor, then creating one inside, can it somehow pull a stream from STF and just use that here?
@@ -120,15 +120,19 @@ namespace matx | |||
auto pApc = clone<VRANK>(pAp, clone_shape); | |||
|
|||
// A*X | |||
(Ap = matvec(A, X)).run(stream); | |||
//(Ap = matvec(A, X)).run(stream); | |||
(Ap = matvec(A, X)).run(exec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that the same to call run(exec) and run(stream) when we have a "classic" executor ? (won't it trigger much more work ?)
Initial updates to the build system to get Matx working with CUDASTF