-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/GDR_COPY: Optimize memory registration in fast-path #10342
base: master
Are you sure you want to change the base?
UCT/GDR_COPY: Optimize memory registration in fast-path #10342
Conversation
fd4c701
to
b504c8e
Compare
b504c8e
to
7fbaf35
Compare
sched_yield(); | ||
} | ||
|
||
x = __sync_fetch_and_add(&lock->l, UCS_RWLOCK_READ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my curiosity: why we are using deprecated atomic functions, while there are modern family of __atomic_*
functions __atomic_fetch_add
?
Those __sync
functions always make a full barrier, and therefore come with significant overhead when only acquire or release semantics is sufficient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From GCC documentation: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
They [family of __sync functions] should not be used for new code which should use the ‘__atomic’ builtins instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe consider using/extending atomic.h
?
static inline void ucs_rwlock_init(ucs_rwlock_t *lock) { | ||
lock->l = 0; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it must come with a set of unit tests.
And maybe multithread (like 32 threads) litmus test that would run for a while with some simple logic (like incrementing few counters by writer threads, acquring these counters by reader threads etc)
int x; | ||
|
||
while (1) { | ||
x = lock->l; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe reuse ucs_rwlock_write_trylock
here?
return 0; | ||
} | ||
|
||
return -EBUSY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's more logical to return 1 when lock is acquired, 0 otherwise?
/* Writer is waiting or has lock */ | ||
#define UCS_RWLOCK_READ 0x4 /* Reader increment */ | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also add #define UCS_RWLOCK_STATIC_INITIALIZER {0}
@@ -8,6 +8,7 @@ | |||
#define UCS_RCACHE_INL_ | |||
|
|||
#include "rcache_int.h" | |||
#include <ucs/profile/profile.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why needed?
* | ||
* Readers increment the counter by UCS_RWLOCK_READ (4) | ||
* Writers set the UCS_RWLOCK_WRITE bit when lock is held | ||
* and set the UCS_RWLOCK_WAIT bit while waiting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alignment
#define UCS_RWLOCK_WAIT 0x1 /* Writer is waiting */ | ||
#define UCS_RWLOCK_WRITE 0x2 /* Writer has the lock */ | ||
#define UCS_RWLOCK_MASK (UCS_RWLOCK_WAIT | UCS_RWLOCK_WRITE) | ||
/* Writer is waiting or has lock */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo, no need for this comment
sched_yield(); | ||
} | ||
|
||
x = __sync_fetch_and_add(&lock->l, UCS_RWLOCK_READ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe consider using/extending atomic.h
?
@@ -1187,6 +1187,44 @@ ucs_status_t ucp_mem_unmap(ucp_context_h context, ucp_mem_h memh) | |||
return UCS_OK; | |||
} | |||
|
|||
static ucs_status_t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we always use one md for mtype, maybe we can avoid ucp_memh as well, just use uct objects?
@@ -75,6 +75,7 @@ nobase_dist_libucs_la_HEADERS = \ | |||
type/param.h \ | |||
type/init_once.h \ | |||
type/spinlock.h \ | |||
type/rwlock.h \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move "introducing fast rwlock to ucs" to separate PR, including gtest for functionality and performance measurement (print lock overhead using UCS_TEST_MESSAGE + write in the PR desc how fast is new lock compared to glibc)
@@ -80,6 +81,17 @@ ucs_rcache_lookup_unsafe(ucs_rcache_t *rcache, void *address, size_t length, | |||
return region; | |||
} | |||
|
|||
static UCS_F_ALWAYS_INLINE ucs_rcache_region_t * | |||
ucs_rcache_lookup(ucs_rcache_t *rcache, void *address, size_t length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we integrate this optimizations into ucs_rcache_get?
No description provided.