Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joey xrays #20

Closed
wants to merge 36 commits into from
Closed

Joey xrays #20

wants to merge 36 commits into from

Conversation

Joeybraspenning
Copy link

Integrating the x-ray interpolation code in SOAP. (This is branched of the SOAP_additions_joey_roi branch as that has some fixes that are needed to pass the tests)

jchelly and others added 30 commits January 20, 2023 11:05
This should detect the bug that's causing all xray bands to
contain the same value in the output.
Fix bug that causes all X-ray bands to have the same value
This is necessary to make the documentation script work, because
we need to be able to extract the property list for each class
without creating an instance.
If a calculation causes overflow, divide by zero or an
invalid operation the code will report the halo ID and
abort. Then we can reproduce the issue with the --halo-id
flag to debug it.
@jchelly
Copy link
Owner

jchelly commented Mar 29, 2023

If we get the SOAP_additions_joey_roi branch merged into master first this pull request will become much smaller and easier to review. Maybe that's the way to go?



# Add each metal contribution individually
f_n_T_Z_temp = np.power(10, f_n_T[-1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a past life I would have suggested to write this as exp(f_n_T[-1] * ln(10)) as it is faster than pow() but the jiting might be helping already.

for j in range(len(f_n_T) - 1):
f_n_T_Z_temp += np.power(10, f_n_T[j]) * abundance_to_solar[i, j]

f_n_T_Z[i] = np.log10(f_n_T_Z_temp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you log10 this? You seem to pow(10,xxx) that result in the next function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

historical reasons with keeping within the bounds of a float mostly, and at one point we put the log() in the snapshot, but indeed, wouldn't be needed anymore


@staticmethod
@jit(nopython = True)
def get_table_interp(dn, dT, dx_T, dx_n, idx_T, idx_n, idx_he, dx_he, idx_z, dx_z, X_Ray, abundance_to_solar):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment that this is returning a logged value (unless you change that)

volumes = (masses / ((1 / scale_factor**3) * densities)).to(cm**3)


if bands == None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these tests done for every single call?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to make sure it never goes wrong, then yes I would say. If you trust our ability to do things correctly, you could get rid of them I guess


observing_types = ['photons', 'photons', 'photons']
particle_data[ptype]["XrayPhotonLuminosities_new"] = interpolate_X_Ray(table_name, particle_data[ptype]['Densities'], particle_data[ptype]['Temperatures'], particle_data[pytpe]['SmoothedElementMassFractions'], particle_data[ptype]['Masses'], bands = xray_bands, observing_types = observing_types, fill_value = 0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this going to be run once per over-density rather than once per halo?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the target_density here is the minimum density required by any of the calculations. At this point the code has found the particles in a sphere large enough to do all of the overdensities and apertures. The loop over the different calculations to do is just after this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Thanks for confirming it's fine.

@jchelly
Copy link
Owner

jchelly commented Mar 29, 2023

Do we need to write our own interpolation code here instead of using something like scipy.interpolate.RegularGridInterpolator?

@Joeybraspenning
Copy link
Author

This is identical (essentially line-by-line) to what we do in SWIFT, hence this code. I agree you could probably get away with a function, but what I remember from some testing long ago, that's not faster

for the smallest index i in bins[i] such
that table[i] < x
'''
for j in range(len(bins)):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be possible to speed this up a bit using bisect (https://docs.python.org/3/library/bisect.html) so it doesn't loop over all of the elements.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, didn't know this existed, will have a look to see how to use this

@Joeybraspenning
Copy link
Author

What about the general structure of putting it into halo_properties.py? Do you think this will work? I'd like to get it into SOAP in a fashion that at least would run before we try to optimise the x-ray interpolation code. Then we can also have an idea of how fast things are at the moment to benchmark.

@MatthieuSchaller
Copy link
Contributor

I think that's a sensible choice, yes.

@Joeybraspenning
Copy link
Author

Do we have something small I could try it on (maybe something that just runs in a few minutes even)? And what would be the commands to actually run it on cosma?

@MatthieuSchaller
Copy link
Contributor

Try the fiducial 100Mpc.

@MatthieuSchaller
Copy link
Contributor

@jchelly
Copy link
Owner

jchelly commented Mar 30, 2023

These might help: SWIFTSIM#15 . See SOAP/scripts/FLAMINGO/small_test/run_L1000N1800_HYDRO.sh .

I should probably add scripts for the 100Mpc box too.

@Joeybraspenning
Copy link
Author

Ah! That seem very useful, but it doesn't seem to work at the moment:

`*** EXCEPTION ***
Task has zero particles?! on rank 0

Traceback (most recent call last):
File "/cosma8/data/dp004/dc-bras1/SOAP/SOAP/./compute_halo_properties.py", line 480, in
compute_halo_properties()
File "/cosma8/data/dp004/dc-bras1/SOAP/SOAP/./compute_halo_properties.py", line 417, in compute_halo_properties
metadata = task_queue.execute_tasks(
File "/cosma8/data/dp004/dc-bras1/SOAP/SOAP/task_queue.py", line 195, in execute_tasks
result.append(task(*args))
File "/cosma8/data/dp004/dc-bras1/SOAP/SOAP/chunk_tasks.py", line 223, in call
raise Exception("Task has zero particles?!")
Exception: Task has zero particles?!`

@jchelly
Copy link
Owner

jchelly commented Mar 30, 2023

For the "task has zero particles" error, is it possible that the number of chunks is >1 for some reason? It should say in the stdout messages.

It works for me if I just do

cd SOAP
./scripts/FLAMINGO/small_test/run_L1000N1800_HYDRO.sh

@jchelly
Copy link
Owner

jchelly commented Mar 30, 2023

The script name gets set as the job name if you forget the -J flag and then it will be used to generate the VR path. I guess it didn't pick up your -J for some reason?

The batch script works for me if I cd to the top SOAP directory and do this:

sbatch -J HYDRO_FIDUCIAL --array=6 ./scripts/FLAMINGO/L0100N0180/group_membership_L0100N0180.sh

@jchelly
Copy link
Owner

jchelly commented Mar 30, 2023

We need to simplify how SOAP is run really. Moving a lot of the command line arguments into the config file from the colibre branch might help. And it should also be packaged up so it doesn't rely on the current working directory to find its own python files (SWIFTSIM#9).

@Joeybraspenning
Copy link
Author

For the chunks, no, there is only 1 chunk. This is the output just before the error.
It seems it has found 0 cells to read

Chunk 0 has 28 halos
Reading 28 VR halos and setting up 1 chunk(s) took 42.6s
[ 42.9s] 0: [0/1] identified 0 cells to read in 0.00s

*** EXCEPTION ***
Task has zero particles?! on rank 0

@Joeybraspenning
Copy link
Author

Trying to run SOAP now, I get a bunch of out of memory errors on the cosma-shm node, which is suspicious for a 100Mpc.

numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_OUT_OF_MEMORY raise CudaAPIError(retcode, msg) numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_OUT_OF_MEMORY raise CudaAPIError(retcode, msg) numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_OUT_OF_MEMORY raise CudaAPIError(retcode, msg) numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_OUT_OF_MEMORY

The output log is here: /cosma8/data/dp004/dc-bras1/SOAP/SOAP/logs/halo_properties_L0100N0180_HYDRO_FIDUCIAL.6.out

It seems to crash somewhere in loading swiftsimio visualisation things, but not sure why it needs those for SOAP and also not sure why initialising them would cause an out of memory error...

@jchelly
Copy link
Owner

jchelly commented Mar 31, 2023

In xray_calculator.py you have a line

from swiftsimio import load

That must be importing swiftsimio and pulling in the visualisation code. There aren't any calls to load() in the file so maybe you can just remove the import?

@jchelly
Copy link
Owner

jchelly commented Mar 31, 2023

Regarding the zero particles issue - did you modify the script at all? If so, could you post the path to it so I can take a look?

It reads in the halos to process and then tries to identify the cells in the snapshot it needs to read to have all of the particles in the SO spheres and apertures. In this case it found no cells to read. I don't see how that can happen unless all of the halos are outside the box - e.g. if it read in a VR catalogue that was made from a different snapshot than the one it's reading.

@Joeybraspenning
Copy link
Author

I'll get back to the interactive script. For now just trying the normal script for the 100Mpc, it also seems to not find any particles... I have not modified these scripts except for the output directory

There are a few thousand lines like this:
`Chunk 0 has 13612 halos
Reading 13612 VR halos and setting up 1 chunk(s) took 0.8s
[ 8.7s] 0: [0/1] identified 0 cells to read in 0.01s

*** EXCEPTION ***
Task has zero particles?! on rank 127

*** EXCEPTION ***
Task has zero particles?! on rank 0`

@Joeybraspenning
Copy link
Author

The group membership works just fine and finishes within 10 seconds or so, also generating an output file.

@jchelly
Copy link
Owner

jchelly commented Mar 31, 2023

Your batch script looks ok, assuming it's the one in /cosma8/data/dp004/dc-bras1/SOAP/SOAP/halo_properties_L0100N0180.sh. I don't think I tried running on the 100Mpc box with 128 cores though. Maybe the domain decomposition fails if you have lots of cores and a tiny simulation.

@jchelly
Copy link
Owner

jchelly commented Mar 31, 2023

If I copy your SOAP directory and submit your batch script it works for me!

Using Peano domain decomposition with bits=10
Chunk 0 has 13612 halos
Reading 13612 VR halos and setting up 1 chunk(s) took 0.5s
[    10.2s] 0: [0/1] identified 1728 cells to read in 0.09s
[    12.1s] 0: [0/1] read in 11783547 particles in 1.8s = 659.6MB/s (uncompressed)
[    12.3s] 0: [0/1] constructing shared mesh took 0.2s
[    12.3s] 0: [0/1] node has 3957.5GB of 4031.1GB memory free
...

Here I just copied all of /cosma8/data/dp004/dc-bras1/SOAP/SOAP, changed the output directories and submitted the scripts group_membership_L0100N0180.sh and halo_properties_L0100N0180.sh.

@jchelly
Copy link
Owner

jchelly commented Mar 31, 2023

Maybe it's something in your environment. What do you have in ~/.local/lib/python3.10/site-packages/?

@Joeybraspenning
Copy link
Author

Okay, good to know it's probably something strange on my side. This is what I have in my packages:

_multiprocess cupy_cuda111-11.6.0.dist-info more_itertools pathos-0.2.9.dist-info style unyt
cmyt cupyx more_itertools-8.14.0.dist-info pox style-1.1.0.dist-info unyt-2.9.3.dist-info
cmyt-1.1.2.dist-info dill multiprocess pox-0.3.1.dist-info swiftsimio update
colorspacious dill-0.3.5.1.dist-info multiprocess-0.70.13.dist-info pp swiftsimio-6.1.1.dist-info update-0.0.1.dist-info
colorspacious-1.1.2.dist-info easy-install.pth numba ppft tests yt
corner fastrlock numba-0.56.4.dist-info ppft-1.7.6.5.dist-info tomli yt-4.1.0.dist-info
corner-2.2.1.dist-info fastrlock-0.8.1.dist-info p_tqdm scipy tomli-2.0.1.dist-info yt.libs
cupy llvmlite p_tqdm-1.4.0.dist-info scipy-1.8.0.dist-info tomli_w
cupy_backends llvmlite-0.39.1.dist-info pathos scipy.libs tomli_w-1.0.0.dist-info

@Joeybraspenning
Copy link
Author

It's strange that a different package version could lead to such behaviour though! I updated numba yesterday, the other packages should be a few months old at most

@jchelly
Copy link
Owner

jchelly commented Mar 31, 2023

That looks ok. I thought there might be a conflicting h5py or mpi4py build in there.

@Joeybraspenning
Copy link
Author

But, if that is okay, why does it not work? Could it be some permission thing? That I'm not allowed to read some of the flamingo folders maybe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants