Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(fluids): Update tests for fixed ts_type alpha #1681

Merged
merged 2 commits into from
Oct 8, 2024

Conversation

jrwrigh
Copy link
Collaborator

@jrwrigh jrwrigh commented Oct 4, 2024

The startup logic for ts_type: alpha was fixed in petsc/petsc!7816, so the reference solutions needed to be fixed as well.

The startup logic for `ts_type: alpha` was fixed in petsc/petsc!7816, so
the reference solutions needed to be fixed as well.
Copy link
Member

@jeremylt jeremylt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much appreciated!

@jeremylt
Copy link
Member

jeremylt commented Oct 4, 2024

hmm, CUDA is unhappy on Noether

@jrwrigh
Copy link
Collaborator Author

jrwrigh commented Oct 4, 2024

I have no idea why CUDA is failing, but everybody else is happy...

It's definitely related to ts_type alpha, because all the failed tests are with ts_type alpha. But those tests didn't need to have their values updated to pass on CPU (and ROCM evidently).

@jeremylt
Copy link
Member

jeremylt commented Oct 4, 2024

I'm doing a pull and rebuild just in case, but I don't think that will change things

Edit: No dice, same issue

@jrwrigh
Copy link
Collaborator Author

jrwrigh commented Oct 4, 2024

I'm going back to the original alpha fix commit to see if it's maybe something unrelated in PETSc's CUDA implementation. If that's not the case, I'm really confused.

@jeremylt
Copy link
Member

jeremylt commented Oct 4, 2024

Yeah, this seems strange unless its exposing some bad assumption somewhere in one of the CUDA impl files 🤷

@jrwrigh
Copy link
Collaborator Author

jrwrigh commented Oct 4, 2024

Yep, going back to 3f1b2ee8 works fine. Time for git bisect... 🙃

@jrwrigh
Copy link
Collaborator Author

jrwrigh commented Oct 4, 2024

@knepley
Copy link
Collaborator

knepley commented Oct 4, 2024

@jrwrigh Is the commit wrong? I can't see an error.

@jrwrigh
Copy link
Collaborator Author

jrwrigh commented Oct 4, 2024

I ran a git bisect on PETSc and it points to that commit (and MR) as the first bad commit.

As for what is wrong about the commit, I don't know; I'm not familiar with the code and how it interfaces with the fluids example or HONEE.

@jeremylt
Copy link
Member

jeremylt commented Oct 4, 2024

I sorta suspect that the commit in question exposes some oops elsewhere for CUDA vectors, as the logic there looks correct.

@jeremylt
Copy link
Member

jeremylt commented Oct 4, 2024

Jed had a theory of maybe a kernel synchronicity issue? We are only seeing an issue when writing out to binary, right?

@jrwrigh
Copy link
Collaborator Author

jrwrigh commented Oct 4, 2024

We are only seeing an issue when writing out to binary, right?

Not just when writing/reading to binary. Every fluids test reads binary files, but only some of the tests that use ts_type: alpha fail.

@jrwrigh
Copy link
Collaborator Author

jrwrigh commented Oct 4, 2024

The fact it's only alpha failing makes me think it's something to do with the alpha restart, which does several VecCopy commands.

@jeremylt
Copy link
Member

jeremylt commented Oct 4, 2024

And those are somehow async copies involving offloading to the host, which isn't great

@jeremylt
Copy link
Member

jeremylt commented Oct 4, 2024

Are we setting the DM vec type on every DM?

@jrwrigh
Copy link
Collaborator Author

jrwrigh commented Oct 4, 2024

Yes, indirectly via DMClone.

i.e. we do it once for the primary DM, then DMClone copies that setting over.

@jeremylt
Copy link
Member

jeremylt commented Oct 4, 2024

And is the TSAlpha grabbing all work vectors from the DM?

@jrwrigh
Copy link
Collaborator Author

jrwrigh commented Oct 4, 2024

It's using VecDuplicate.

@jrwrigh
Copy link
Collaborator Author

jrwrigh commented Oct 4, 2024

Clarification: Not every test using ts_type: alpha is failing. But ones that are failing are all using advection diffusion equations and use ts_type: alpha. I'm not sure how anything done in advection-diffusion would be causing issues not caught in navier-stokes.

@jeremylt
Copy link
Member

jeremylt commented Oct 4, 2024

To make it worse - this does not appear on my local machine User error - I'm seeing it locally now too

@jrwrigh
Copy link
Collaborator Author

jrwrigh commented Oct 4, 2024

Ok, so in gdb I did a backtrace for every call to CopyAsync. One possible explanation behind only advection-diffusion is that they use the MFFD jacobian for their implicit timestepping, which does feature calls to VecCopy.

@jrwrigh
Copy link
Collaborator Author

jrwrigh commented Oct 4, 2024

gdb_copyasync.output.txt

Here are those backtraces, in case that's of interest.

@jeremylt
Copy link
Member

jeremylt commented Oct 4, 2024

It doesn't help, but that block is more complicated than it needs to be

$ git diff
diff --git a/src/vec/vec/impls/seq/cupm/vecseqcupm_impl.hpp b/src/vec/vec/impls/seq/cupm/vecseqcupm_impl.hpp
index 719ad054f24..682554196fe 100644
--- a/src/vec/vec/impls/seq/cupm/vecseqcupm_impl.hpp
+++ b/src/vec/vec/impls/seq/cupm/vecseqcupm_impl.hpp
@@ -1394,26 +1394,15 @@ inline PetscErrorCode VecSeq_CUPM<T>::CopyAsync(Vec xin, Vec yout, PetscDeviceCo
     // translate from PetscOffloadMask to cupmMemcpyKind
     PetscCall(PetscDeviceContextGetOptionalNullContext_Internal(&dctx));
     switch (const auto ymask = yout->offloadmask) {
-    case PETSC_OFFLOAD_UNALLOCATED: {
-      PetscBool yiscupm;
-
-      PetscCall(PetscObjectTypeCompareAny(PetscObjectCast(yout), &yiscupm, VECSEQCUPM(), VECMPICUPM(), ""));
-      if (yiscupm) {
-        mode = PetscOffloadDevice(xmask) ? cupmMemcpyDeviceToDevice : cupmMemcpyHostToHost;
-        break;
-      }
-    } // fall-through if unallocated and not cupm
-#if PETSC_CPP_VERSION >= 17
-      [[fallthrough]];
-#endif
+    case PETSC_OFFLOAD_UNALLOCATED:
     case PETSC_OFFLOAD_CPU: {
       PetscBool yiscupm;
 
       PetscCall(PetscObjectTypeCompareAny(PetscObjectCast(yout), &yiscupm, VECSEQCUPM(), VECMPICUPM(), ""));
       if (yiscupm) {
-        mode = PetscOffloadHost(xmask) ? cupmMemcpyHostToDevice : cupmMemcpyDeviceToDevice;
+        mode = PetscOffloadDevice(xmask) ? cupmMemcpyDeviceToDevice : cupmMemcpyHostToDevice;
       } else {
-        mode = PetscOffloadHost(xmask) ? cupmMemcpyHostToHost : cupmMemcpyDeviceToHost;
+        mode = PetscOffloadDevice(xmask) ? cupmMemcpyDeviceToHost : cupmMemcpyHostToHost;
       }
       break;
     }

@jeremylt
Copy link
Member

jeremylt commented Oct 4, 2024

With this diff (we should maybe apply here

$ git diff
diff --git a/examples/fluids/src/misc.c b/examples/fluids/src/misc.c
index 4510be1f..ef754c77 100644
--- a/examples/fluids/src/misc.c
+++ b/examples/fluids/src/misc.c
@@ -142,7 +142,7 @@ PetscErrorCode LoadFluidsBinaryVec(MPI_Comm comm, PetscViewer viewer, Vec Q, Pet
 PetscErrorCode RegressionTest(AppCtx app_ctx, Vec Q) {
   Vec         Qref;
   PetscViewer viewer;
-  PetscReal   error, Qrefnorm;
+  PetscReal   error, Q_norm, Q_ref_norm, Q_err_norm;
   MPI_Comm    comm = PetscObjectComm((PetscObject)Q);
 
   PetscFunctionBeginUser;
@@ -153,14 +153,17 @@ PetscErrorCode RegressionTest(AppCtx app_ctx, Vec Q) {
   PetscCall(LoadFluidsBinaryVec(comm, viewer, Qref, NULL, NULL));
 
   // Compute error with respect to reference solution
+  PetscCall(VecNorm(Q, NORM_MAX, &Q_norm));
+  PetscCall(VecNorm(Qref, NORM_MAX, &Q_ref_norm));
   PetscCall(VecAXPY(Q, -1.0, Qref));
-  PetscCall(VecNorm(Qref, NORM_MAX, &Qrefnorm));
-  PetscCall(VecScale(Q, 1. / Qrefnorm));
+  PetscCall(VecNorm(Qref, NORM_MAX, &Q_err_norm));
+  PetscCall(VecScale(Q, 1. / Q_err_norm));
   PetscCall(VecNorm(Q, NORM_MAX, &error));
 
   // Check error
   if (error > app_ctx->test_tol) {
-    PetscCall(PetscPrintf(PETSC_COMM_WORLD, "Test failed with error norm %g\n", (double)error));
+    PetscCall(PetscPrintf(PETSC_COMM_WORLD, "Test failed with error norm %g\nReference solution norm %g; Computed solution norm %g\n", (double)error,
+                          (double)Q_ref_norm, (double)Q_norm));
   }
 
   // Cleanup

I get this, which looks fishy

$ build/fluids-navierstokes -ceed /gpu/cuda/ref -test_type solver -options_file examples/fluids/advection.yaml -ts_max_steps 5 -wind_type translation -wind_translation -0.5547002,0.83205029,0 -advection_ic_type skew -dm_plex_box_faces 2,1,1 -degree 2 -stab supg -stab_tau advdiff_shakib -Ctau_a 4 -ksp_type gmres -diffusion_coeff 5e-4 -compare_final_state_atol 7e-10 -compare_final_state_filename examples/fluids/tests-output/fluids-navierstokes-adv-skew.bin
Test failed with error norm 0.660736
Reference solution norm 1.22904; Computed solution norm 1.

@jeremylt jeremylt force-pushed the jrwrigh/fix_fluids_alpha branch from 2c6bb1e to 8535f0a Compare October 7, 2024 17:34
@jeremylt
Copy link
Member

jeremylt commented Oct 7, 2024

The error appears to be strictly in the 5th component?

@jrwrigh
Copy link
Collaborator Author

jrwrigh commented Oct 7, 2024

That's expected. The advection-diffusion problems are setup so the 5th component ("energy") is the transported scalar. The other 4 are always constant.

@jeremylt
Copy link
Member

jeremylt commented Oct 8, 2024

We should add an issue to undo our reversion of the offending commit when the bug is patched, but we can merge this now

@jrwrigh
Copy link
Collaborator Author

jrwrigh commented Oct 8, 2024

Not sure what the GitLab CI is saying the pipeline is still running, but all the jobs passed.

@jeremylt jeremylt merged commit 68ce662 into main Oct 8, 2024
22 of 23 checks passed
@jeremylt jeremylt deleted the jrwrigh/fix_fluids_alpha branch October 8, 2024 20:24
@jeremylt
Copy link
Member

jeremylt commented Oct 8, 2024

Bad script, no cookies for it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants