-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor false positive rate for memory used #32
Comments
Cuckoofilters are not bloom filters, so the parameters are different and the formulas for bloom filters won't work. For cuckoofilters, those parameters are the “I've been taking the first 64 bits of the SHA1 hash as the key.” - did you check how many false positives that introduces by itself, i.e. how many of thos hashes share their first 64bits? Might be beneficial to put the whole hashes into the filter. |
Thanks for the pointer to increasing fingerprint size - will give it a shot. I'm detecting false positives by generating an equal number of passwords not found in the dataset. The filter (currently) only takes a u64; I should be able to expand that to take the whole hash, though. |
I've tried setting It grew the exported filter size (gzipped) by a factor of 4, and increased the false positive rate to 99.99%. I suspect it's not as simple as changing that one constant. Any tips for where else I should be looking? |
It should be that easy. That sounds like you found a undiscovered bugs in the code. Due to having to compile a custom version, I suspect that code was seldomly tested with other values than a single byte. |
Eventually, we will be able to use const generics for the fingerprint size so that you don't have to edit the code anymore to change it, and therefore to get it some test coverage etc. |
I ran a check for collisions on the first 16 chars of the sha1 hash - none in the dataset I have. |
@DanielHeath I was just checking through all the old issues and this reminded me of one of the changes I made. This commit (deadf77) may be why changing the fingerprint size didn't have the desired impact. |
I'm using rust-cuckoofilter to check entries in the 'Have I Been Pwned' master list (551509767 SHA1 hashes).
I've been taking the first 64 bits of the SHA1 hash as the key.
According to this bloom filter calculator, a 1gb bloom filter should be able to give me a 0.1% false positive rate on the full dataset.
However, when I load this data into rust-cuckoofilter, I get >2% false positives with a 1gb cuckoo filter.
I'm happy to provide any assistance with diagnosing further.
The text was updated successfully, but these errors were encountered: