Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add non-tensor shared #1721

Merged
merged 5 commits into from
Jan 9, 2025
Merged

Add non-tensor shared #1721

merged 5 commits into from
Jan 9, 2025

Conversation

jeremylt
Copy link
Member

@jeremylt jeremylt commented Jan 3, 2025

This is a prereq for the long awaited */gen non-tensor support.

@jeremylt jeremylt self-assigned this Jan 3, 2025
@jeremylt jeremylt force-pushed the jeremy/shared-nontensor branch 3 times, most recently from 5d36201 to 3e42d60 Compare January 3, 2025 23:48
@jeremylt jeremylt force-pushed the jeremy/shared-nontensor branch 4 times, most recently from 324c97e to c21b7ff Compare January 6, 2025 22:06
@jeremylt
Copy link
Member Author

jeremylt commented Jan 6, 2025

Edit: This is solved, it was a bad grid size for the weights kernel

Investigating error on PETSc BP3:

Test: petsc-bps BP3, tet elements
  $ build/petsc-bps -ceed /gpu/cuda/shared -test -problem bp3 -degree 3 -ksp_max_it_clip 50,50 -simplex
ERROR: returncode = 15
Output: 
NO MESSAGE
FAIL: stderr
Output: 
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 8 FPE: Floating Point Exception,probably divide by zero
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/
[0]PETSC ERROR: or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA systems to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[0]PETSC ERROR: to get more information on the crash.
Abort(59) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0

@jeremylt
Copy link
Member Author

jeremylt commented Jan 7, 2025

Edit: This happens on my machine on main too, so its not related to this MR.

Also - t354 is failing on my machine but not in CI, so that's odd. I thought it was my CUDA update causing troubles but it persists after restarting.

@jeremylt jeremylt force-pushed the jeremy/shared-nontensor branch from c21b7ff to 7c2945f Compare January 7, 2025 20:14
@jeremylt jeremylt force-pushed the jeremy/shared-nontensor branch from 7c2945f to a3fb7fa Compare January 7, 2025 21:16
@jeremylt
Copy link
Member Author

jeremylt commented Jan 7, 2025

Ok, I need to actually test the HIP code, but we've got everything working on the CUDA side of the house.

Edit: Confirmed the new code compiles on Noether, but still bringing my local dev machine back up to date to test

$ make prove -j CEED_BACKENDS=/gpu/cuda/shared passes locally for Ratel with this branch.

@jeremylt jeremylt force-pushed the jeremy/shared-nontensor branch from a3fb7fa to 1f6c24f Compare January 8, 2025 18:37
@jeremylt jeremylt added 1-In Review and removed 0-WIP labels Jan 8, 2025
@jeremylt jeremylt merged commit 1a63be7 into main Jan 9, 2025
28 checks passed
@jeremylt jeremylt deleted the jeremy/shared-nontensor branch January 9, 2025 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants