-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsys-jax: optimise data loading and .zip creation #1193
base: main
Are you sure you want to change the base?
Conversation
74a3d94
to
d8056e0
Compare
…ax Python code clearer
…n profiles clearer
Plumb through `prefix` so it's more convenient to explicitly set the input data path.
…v with Python's lzma.
61357e9
to
4bf5946
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a few questions inline.
.transform(remove_program_id_and_name, axis="columns") | ||
) | ||
compile_df = compile_df.drop(columns=["EndMs"]).astype({"ProgramId": np.int32}) | ||
if len(compile_df): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks when compile_df
is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes...for non-zero len(compile_df)
the return value is a Series of that length, but for zero-length it changes to being an empty DataFrame. There is a result_type
argument that can be used to get the right behaviour, but I thought it was less clear (from the docs it seems like it might be accidental that it helps anyway).
thunk_df["Communication"] = thunk_df.loc[:, ("Name",)].apply( | ||
is_communication, axis=1 | ||
thunk_df["Communication"] = pd.Series( | ||
data=map(is_communication, thunk_df["Name"].items()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why map(f, items)
is preferable to df.apply(f)
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have the profile data to hand anymore, but IIRC the apply
approach had non-negligible overhead constructing the temporary passed to the mapped function 🤔
for comm_thunk in overlap_df.loc[ | ||
overlap_df["Communication"], ("ProjStartMs", "ProjEndMs") | ||
].itertuples(): | ||
local_df = compute_df.loc[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could call this "fully overlapped by"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure quite what you mean by "fully" there? local_df
contains all of the compute thunks whose execution at least partially overlaps with comm_thunk
.
Some rough measurements on vanilla
jax-nccl-test
and 8xH100:The differences are more pronounced on larger workloads with more activity.
The two bigger changes are:
.csv
to.parquet
as part ofnsys-jax
to avoid compressing.csv
with Python'slzma
module, which is slow and single-threaded. This speeds upnsys-jax
and subsequent data-loading.Otherwise there are some tweaks to pandas usage and minor reorganisations to make Python profiles more informative, and minor bugfixes in the example Jupyter notebook.