Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor false positive rate for memory used #32

Open
DanielHeath opened this issue Mar 18, 2019 · 7 comments
Open

Poor false positive rate for memory used #32

DanielHeath opened this issue Mar 18, 2019 · 7 comments

Comments

@DanielHeath
Copy link

I'm using rust-cuckoofilter to check entries in the 'Have I Been Pwned' master list (551509767 SHA1 hashes).

I've been taking the first 64 bits of the SHA1 hash as the key.

According to this bloom filter calculator, a 1gb bloom filter should be able to give me a 0.1% false positive rate on the full dataset.

However, when I load this data into rust-cuckoofilter, I get >2% false positives with a 1gb cuckoo filter.

I'm happy to provide any assistance with diagnosing further.

@florianjacob
Copy link
Contributor

Cuckoofilters are not bloom filters, so the parameters are different and the formulas for bloom filters won't work.

For cuckoofilters, those parameters are the capacity of the filter, BUCKET_SIZE and FINGERPRINT_SIZE, which all affect the utilization and the false positive rate. For rust-cuckoofilter, only the first is a run-time parameter, the other two are compile-time parameters. I guess you used a 1Gi capacity filter? Probably the 1 byte default fingerprint size is just too low for your target false positive rate. Due to how this rust version is implemented though, you directly need to go to 2 byte, while other implementations support intermediate steps like 12 bits.

“I've been taking the first 64 bits of the SHA1 hash as the key.” - did you check how many false positives that introduces by itself, i.e. how many of thos hashes share their first 64bits? Might be beneficial to put the whole hashes into the filter.

@DanielHeath
Copy link
Author

Thanks for the pointer to increasing fingerprint size - will give it a shot.

I'm detecting false positives by generating an equal number of passwords not found in the dataset.

The filter (currently) only takes a u64; I should be able to expand that to take the whole hash, though.

@DanielHeath
Copy link
Author

I've tried setting FINGERPRINT_SIZE to 2 (bytes).

It grew the exported filter size (gzipped) by a factor of 4, and increased the false positive rate to 99.99%.

I suspect it's not as simple as changing that one constant. Any tips for where else I should be looking?

@florianjacob
Copy link
Contributor

florianjacob commented Mar 19, 2019

It should be that easy. That sounds like you found a undiscovered bugs in the code. Due to having to compile a custom version, I suspect that code was seldomly tested with other values than a single byte.

@florianjacob
Copy link
Contributor

Eventually, we will be able to use const generics for the fingerprint size so that you don't have to edit the code anymore to change it, and therefore to get it some test coverage etc.

@DanielHeath
Copy link
Author

I ran a check for collisions on the first 16 chars of the sha1 hash - none in the dataset I have.

@CosmicHorrorDev
Copy link
Contributor

I've tried setting FINGERPRINT_SIZE to 2 (bytes).

It grew the exported filter size (gzipped) by a factor of 4, and increased the false positive rate to 99.99%.

@DanielHeath I was just checking through all the old issues and this reminded me of one of the changes I made. This commit (deadf77) may be why changing the fingerprint size didn't have the desired impact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants