nsys-jax: optimise data loading and .zip creation #1193

olupton · 2024-12-10T11:21:59Z

Some rough measurements on vanilla jax-nccl-test and 8xH100:

Profile collection, whole execution: 52s (nsys), 58s (nsys-jax with this PR), 1m5s (nsys-jax without this PR)
Profile collection, restricted range: 46s (nsys), 50s (nsys-jax with this PR), 55s (nsys-jax without this PR)
Communication analysis, whole execution: 1.1s (with this PR), 2.1s (without this PR)
Communication analysis, restricted range: 1.0s (with this PR), 1.7s (without this PR)

The differences are more pronounced on larger workloads with more activity.

The two bigger changes are:

Convert .csv to .parquet as part of nsys-jax to avoid compressing .csv with Python's lzma module, which is slow and single-threaded. This speeds up nsys-jax and subsequent data-loading.
A new algorithm for calculating the hidden/exposed time of communication kernels when loading profile data -- essentially this adds a fast pandas-friendly pass to identify [most] non-overlapping kernels and skip running the [relatively slow and pandas-unfriendly] overlap calculation on them. This also removes an assumption that there is no compute-compute overlap.

Otherwise there are some tweaks to pandas usage and minor reorganisations to make Python profiles more informative, and minor bugfixes in the example Jupyter notebook.

…ax Python code clearer

…n profiles clearer

Plumb through `prefix` so it's more convenient to explicitly set the input data path.

…v with Python's lzma.

gspschmid

LGTM, a few questions inline.

gspschmid · 2025-01-10T15:47:35Z

.github/container/nsys_jax/nsys_jax/data_loaders.py

-        .transform(remove_program_id_and_name, axis="columns")
-    )
+    compile_df = compile_df.drop(columns=["EndMs"]).astype({"ProgramId": np.int32})
+    if len(compile_df):


This breaks when compile_df is empty?

Yes...for non-zero len(compile_df) the return value is a Series of that length, but for zero-length it changes to being an empty DataFrame. There is a result_type argument that can be used to get the right behaviour, but I thought it was less clear (from the docs it seems like it might be accidental that it helps anyway).

.github/container/nsys_jax/nsys_jax/data_loaders.py

gspschmid · 2025-01-10T15:58:14Z

.github/container/nsys_jax/nsys_jax/data_loaders.py

-    thunk_df["Communication"] = thunk_df.loc[:, ("Name",)].apply(
-        is_communication, axis=1
+    thunk_df["Communication"] = pd.Series(
+        data=map(is_communication, thunk_df["Name"].items()),


I'm curious why map(f, items) is preferable to df.apply(f) here.

I don't have the profile data to hand anymore, but IIRC the apply approach had non-negligible overhead constructing the temporary passed to the mapped function 🤔

gspschmid · 2025-01-10T16:04:54Z

.github/container/nsys_jax/nsys_jax/data_loaders.py

+        for comm_thunk in overlap_df.loc[
+            overlap_df["Communication"], ("ProjStartMs", "ProjEndMs")
+        ].itertuples():
+            local_df = compute_df.loc[


I guess we could call this "fully overlapped by"?

Not sure quite what you mean by "fully" there? local_df contains all of the compute thunks whose execution at least partially overlaps with comm_thunk.

olupton force-pushed the olupton/nsys-jax-python-opt branch 6 times, most recently from 74a3d94 to d8056e0 Compare December 10, 2024 15:58

olupton requested a review from gspschmid December 11, 2024 09:56

olupton added 12 commits January 10, 2025 11:06

Reduce pandas overhead in message size calculation

9ca330e

Optimise calculation of hidden communication time

96392dd

Reduce pandas overhead in communication classification

a3d15e0

Reorganise I/O into separate functions to make profiles of the nsys_j…

2b2d0b3

…ax Python code clearer

Note that CUDA graph support is impaired in 2024.6 as well as 2024.5

1dcc817

minor optimisation, comment cleanup

3dd0173

Relax IPython dependency

a33ea66

Skip multiprocessing.Pool if no parallelism is available; makes Pytho…

ef9d9fe

…n profiles clearer

Reduce pandas overhead in compilation range name cleanup

1e82faa

Minor bug fixes in example notebook

8ecf6df

Plumb through `prefix` so it's more convenient to explicitly set the input data path.

Convert .csv to .parquet in nsys-jax to avoid compressing a large .cs…

d44a9b6

…v with Python's lzma.

ruff 0.9.0 format

4bf5946

olupton force-pushed the olupton/nsys-jax-python-opt branch from 61357e9 to 4bf5946 Compare January 10, 2025 11:09

gspschmid approved these changes Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsys-jax: optimise data loading and .zip creation #1193

nsys-jax: optimise data loading and .zip creation #1193

olupton commented Dec 10, 2024

gspschmid left a comment

gspschmid Jan 10, 2025

olupton Jan 15, 2025

gspschmid Jan 10, 2025

olupton Jan 15, 2025

gspschmid Jan 10, 2025

olupton Jan 15, 2025

nsys-jax: optimise data loading and .zip creation #1193

Are you sure you want to change the base?

nsys-jax: optimise data loading and .zip creation #1193

Conversation

olupton commented Dec 10, 2024

gspschmid left a comment

Choose a reason for hiding this comment

gspschmid Jan 10, 2025

Choose a reason for hiding this comment

olupton Jan 15, 2025

Choose a reason for hiding this comment

gspschmid Jan 10, 2025

Choose a reason for hiding this comment

olupton Jan 15, 2025

Choose a reason for hiding this comment

gspschmid Jan 10, 2025

Choose a reason for hiding this comment

olupton Jan 15, 2025

Choose a reason for hiding this comment