Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow np.float32 precision instead of default float64 or higher #230

Open
sdancette opened this issue Jun 11, 2024 · 2 comments
Open

allow np.float32 precision instead of default float64 or higher #230

sdancette opened this issue Jun 11, 2024 · 2 comments

Comments

@sdancette
Copy link

Is your feature request related to a problem? Please describe.
I need to perform (fast) quaternion calculation on (very) large datasets with float32 precision instead of float64.

Describe the solution you'd like
I need quat.as_quat_array( q_array ) to generate an array with float32 precision. And any quaternion algebra (multiplication, conjugate etc...) based on such array to also return the result with float32 precision.

Describe alternatives you've considered
... my own functions implementing basic quaternion algebra based on np.float32 arrays... but slower (in their pure numpy version) and with a less compact syntax such as qc = q_mult(qa, qb).

Additional context
In the 3D imaging community, volumes of 1024^3 voxels are common. In the sub-community dealing with the imaging of crystals, there's a quaternion behind each voxel. (Most commercial or academic softwares store crystallographic orientations as so-called 'Bunge' Euler angles, a flavor of the ZXZ convention, but I won't debate this here...).
So, an array of 1024^3 quaternion is already 16G in memory in float32 precision. Multiply it by 3 at least to deal with the inputs and output of the example q_mult() function above, then you get the picture of the memory problem with most computers.
Now, in such community, the angular precision of experimental systems is 0.01° at best (most of the time rather in the order of 1°). So float32 precision is usually more than enough.

@moble
Copy link
Owner

moble commented Jun 14, 2024

I'm on vacation at the moment, so I can't reply in great detail — just a quick note.

This package is written mostly in C, so I'd basically have to duplicate all the code with suitable replacements. It might be straightforward, but would probably take a while.

However, I have a suggestion that would probably work out much better for you for many reasons. I made another package called quaternionic, which should be easier to use with different-precision floats (though I haven't tried such a thing in a while). At best, it might only require constructing the QuaternionicArray type with a different dtype. At worst, I might have to expand the typelists in expressions like this one — which wouldn't be too hard.

Also, if you're that memory starved, it might be worth your while to store everything as the logarithm of the quaternion (3 numbers) rather than the full quaternion (4 numbers). Modern CPUs — at least — are generally memory-bandwidth limited, so the conversion to a full quaternion probably wouldn't be a bottleneck. And of course, working in place (so that you only need two input quaternions, and use one as the output when doing operations) would be a huge savings. This should be possible with either quaternion or quaternionic.

Finally, this quaternion package is based on numpy, which is very inefficient for numerous reasons; the quaternionic package is based on numba, which lets you get closer to optimal speed. Writing your own functions with numba would also be easier with quaternionic, meaning that you would probably save a lot of time and memory.

@sdancette
Copy link
Author

Thanks for the answer.
I also had a look at the quaternionic package. Surprisingly, it is a bit slower for elementary operations such as multiplication, at least on my computer.
On two 64*1024**2 quaternion arrays, multiplication:
With quaternion: 424 ms ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With quaternionic: 684 ms ± 5.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is more than x1000 faster with a cupy Kernell (except the first run due to jit compilation).
But in this case, 64*1024**2 size is close to the limit I can deal with at once on my portable GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants