Poor false positive rate for memory used #32

DanielHeath · 2019-03-18T09:41:10Z

I'm using rust-cuckoofilter to check entries in the 'Have I Been Pwned' master list (551509767 SHA1 hashes).

I've been taking the first 64 bits of the SHA1 hash as the key.

According to this bloom filter calculator, a 1gb bloom filter should be able to give me a 0.1% false positive rate on the full dataset.

However, when I load this data into rust-cuckoofilter, I get >2% false positives with a 1gb cuckoo filter.

I'm happy to provide any assistance with diagnosing further.

florianjacob · 2019-03-18T10:19:22Z

Cuckoofilters are not bloom filters, so the parameters are different and the formulas for bloom filters won't work.

For cuckoofilters, those parameters are the capacity of the filter, BUCKET_SIZE and FINGERPRINT_SIZE, which all affect the utilization and the false positive rate. For rust-cuckoofilter, only the first is a run-time parameter, the other two are compile-time parameters. I guess you used a 1Gi capacity filter? Probably the 1 byte default fingerprint size is just too low for your target false positive rate. Due to how this rust version is implemented though, you directly need to go to 2 byte, while other implementations support intermediate steps like 12 bits.

“I've been taking the first 64 bits of the SHA1 hash as the key.” - did you check how many false positives that introduces by itself, i.e. how many of thos hashes share their first 64bits? Might be beneficial to put the whole hashes into the filter.

DanielHeath · 2019-03-18T10:39:26Z

Thanks for the pointer to increasing fingerprint size - will give it a shot.

I'm detecting false positives by generating an equal number of passwords not found in the dataset.

The filter (currently) only takes a u64; I should be able to expand that to take the whole hash, though.

DanielHeath · 2019-03-18T22:24:17Z

I've tried setting FINGERPRINT_SIZE to 2 (bytes).

It grew the exported filter size (gzipped) by a factor of 4, and increased the false positive rate to 99.99%.

I suspect it's not as simple as changing that one constant. Any tips for where else I should be looking?

florianjacob · 2019-03-19T08:38:34Z

It should be that easy. That sounds like you found a undiscovered bugs in the code. Due to having to compile a custom version, I suspect that code was seldomly tested with other values than a single byte.

florianjacob · 2019-03-19T17:29:43Z

Eventually, we will be able to use const generics for the fingerprint size so that you don't have to edit the code anymore to change it, and therefore to get it some test coverage etc.

DanielHeath · 2019-03-19T21:36:23Z

I ran a check for collisions on the first 16 chars of the sha1 hash - none in the dataset I have.

CosmicHorrorDev · 2020-11-13T03:35:57Z

I've tried setting FINGERPRINT_SIZE to 2 (bytes).

It grew the exported filter size (gzipped) by a factor of 4, and increased the false positive rate to 99.99%.

@DanielHeath I was just checking through all the old issues and this reminded me of one of the changes I made. This commit (deadf77) may be why changing the fingerprint size didn't have the desired impact.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor false positive rate for memory used #32

Poor false positive rate for memory used #32

DanielHeath commented Mar 18, 2019

florianjacob commented Mar 18, 2019

DanielHeath commented Mar 18, 2019

DanielHeath commented Mar 18, 2019

florianjacob commented Mar 19, 2019 •

edited

Loading

florianjacob commented Mar 19, 2019

DanielHeath commented Mar 19, 2019

CosmicHorrorDev commented Nov 13, 2020

Poor false positive rate for memory used #32

Poor false positive rate for memory used #32

Comments

DanielHeath commented Mar 18, 2019

florianjacob commented Mar 18, 2019

DanielHeath commented Mar 18, 2019

DanielHeath commented Mar 18, 2019

florianjacob commented Mar 19, 2019 • edited Loading

florianjacob commented Mar 19, 2019

DanielHeath commented Mar 19, 2019

CosmicHorrorDev commented Nov 13, 2020

florianjacob commented Mar 19, 2019 •

edited

Loading