Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Non-linear, very long index restore durations (python 3.10.12, usearch==2.16.0) #514

Open
3 tasks done
kennon opened this issue Oct 30, 2024 · 3 comments
Open
3 tasks done
Labels
bug Something isn't working

Comments

@kennon
Copy link

kennon commented Oct 30, 2024

Describe the bug

With larger usearch index sizes, restore times become impractically long. For a range of different index sizes, restore durations range from ~10s for an 11GB / 6m embedding index up to 45m (!) for a 32GB / 20m embedding index. This happens with memory mapping on or off (i.e. view=True or view=False). During the entire load time, 1 cpu core is pegged out at 100%. After being loaded, index appears to behave normally.

We are running this on an ec2 instance with 64GB of ram, so the entire index should fit very comfortably in memory even with memory_map turned off. The index files are being loaded from ephemeral SSDs attached to the ec2 instance, so disk read time should not be a major factor.

We are running this inside of docker (ECS), however we have not experienced similar file load issues with other software (we use a variety of python and non-python libraries that involve loading large files from this same storage, regularly >= 100GB) so it seems unlikely to be something at the OS/docker level 🤷 (the ECS task has access to the full amount of memory)

Steps to reproduce

import usearch.index
index = usearch.index.Index.restore(path_to_index, view=True) # also happens with view=False

The index was built with usearch using all defaults, then saved to disk via index.save(index_path). Once loaded, index functions normally.

Expected behavior

We would expect a somewhat linear-ish relationship between index size / embedding count and load times.

Thank you for such an awesome project, we have fallen in love with usearch and hope we can figure this one out, which is currently blocking us from using it!

USearch version

v2.16.0

Operating System

Ubuntu 22.04 (dockerized ECS)

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

kballou@eezy.com

Are you open to being tagged as a contributor?

  • I am open to being mentioned in the project .git history as a contributor

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct
@kennon kennon added the bug Something isn't working label Oct 30, 2024
@kennon
Copy link
Author

kennon commented Oct 30, 2024

An example index info for what we're building:

usearch.Index(ScalarKind.BF16 x 768, MetricKind.IP, multi: False, connectivity: 16, expansion: 128 & 64, 6,738,822 vectors in 5 levels, haswell hardware acceleration)

This one took ~10s to load, another one with ~20m vectors took 45 minutes.

@ashvardanian
Copy link
Contributor

@kennon, interesting, looking into it!

@kennon
Copy link
Author

kennon commented Oct 31, 2024

@ashvardanian awesome, thanks! I don’t want to post a public url but if you drop me an email I can send you a link to the index files we are trying to load. Let me know if there is any more information I can provide, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants