Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase size if needed #5

Open
seiflotfy opened this issue Jul 17, 2015 · 4 comments
Open

Increase size if needed #5

seiflotfy opened this issue Jul 17, 2015 · 4 comments

Comments

@seiflotfy
Copy link
Member

After discussing with Bin Fan (original author of the paper): he suggested the following.

To expand a cuckoo hash table in a relatively cheap way, with the assumption that doubling the table size each time is fine:

(1) Copy the old hashtable twice into a new hashtable of 2x size, so the first half and the second half of the new bigger hashtable will be identical to the old one respectively. Now each item appears twice in the new hashtable---one in the top half and one in the bottom half.

(2) For each item in the old table, check the legitimate location of this item in the new table: if it should stay in the first half, delete the copy in the second half and vice versa.

After all items in the old table are checked, the new table contains the all items in 2x buckets, without performing expensive cuckoo insertion process.

@seiflotfy
Copy link
Member Author

Based on #6 we could also go with the following implementation suggested by Bin Fan:

Assume each entry in the cuckoo hash table stores a a single bit in addition to the fingerprint. When this bit ==0, it indicates the bucket index of this entry is i1, otherwise the index is i2. Everything works like what we proposed in the cuckoo hash paper---except we spend one more bit and update this bit correspondingly when we perform cuckoo operation (i.e., when kicking a fingerprint from its i1 bucket to i2 bucket, set the bit to 1, and vise versa).

Now to resize the table, we simply create a new table with more buckets. For example, originally, the table has N buckets, now it has M (>N) buckets. For each fingerprint in original table, if it is in the i1 bucket (i.e., the indicator bit is 0), then we still put this fingerprint in the i1 bucket in the new table. However, if the it is in the i2 bucket in original table (i.e., indicator bit == 1), then we first calculate its i1 bucket index by i1 = i2 ^ hash(fingerprint) % N as in the paper, then calculate the new i2 bucket index in the new table by (i1 ^ hash(fingerprint)) % M, and move this fingerprint to the i2 bucket in the new table.

The doubling table size is purely a performance optimization so you may want still serve queries when you are performing resizing---but it is not essential. 

@burdges
Copy link

burdges commented Aug 30, 2016

Interesting, I see why you want this annotation of i1 or i2 now.

Another options might be to dynamically increase the bucket width b? You'd still need a copy though since otherwise you double your cache misses. I think Bin Fan's suggestion is better though because : If you're filter were optimal, then f might drop below cieling(log2(2b)-log2(epsilon)). In other words, your false positive rate increases as you fill up the larger filter. You've taken f=8 which afaik exceeds optimal by more than 1 bit. I think that same extra bit of fingerprint you wanted to spend on i1 vs i2 will save you here, assuming you're only doubling b once. I think Bin Fan's suggestion is better though because after the new table has filled up for a while you can double is a second time. It's probably not immediately perfect because you'd expect a bias towards i2 entries living in the second half of the new table, and a bias against in the first half, but nothing catastrophic.

@burdges
Copy link

burdges commented Aug 30, 2016

Appears Bin Fan's tweak is not necessarily optimal as described above.

We have i1 = hash(x) % N initially. As i1 values do not change, this formula remains true after switching to M, meaning i1 runs only across the first N elements, so pressure remains on this initial segment of the table, probably hurting performance and creating an artificially full effect. Also, you have i2 = (i1 ^ hash(fingerprint)) % M, so i1 = (i2 ^ hash(fingerprint)) % M requires that M be a power of 2.

A better alternative might be moving all elements to i1' = hash(fingerprint ++ sort(i1,i2)) % M or i2' = i1' ^ hash(fingerprint) % M, choosing randomly, which avoids #6. You now insert all values into this new table at i1' or i2' after first computing i1 = hash(x) and i2 = i1 ^ hash(fingerprint). It's one hash operation more expensive than Bin Fan's suggestion, but does not need the i1 vs i2 bit, and does not unbalance the table.

In any case, any i1' and i2' computation must pass through a computations of the original filter's i1 and i2. There is simply no way to extract more entropy from original inserted value x though since x no longer exists.

To me, this says one should carefully weight the simpler "expand b" approach, which does not complicate the computations of i1 and i2, nor unbalance the table. In fact, you should return to maximum insertion speed, which neither Bin Fan's scheme nor my symmetric variant achieve. You're paying a price in terms of false positives though.

Yet, another scheme is simply adding a second filter while freezing the current filter for inserts, allowing only for lookups. This increases your cache misses, but presumably you've tweaked the underlying filter so that this only needs to happen a few times.

@burdges
Copy link

burdges commented Aug 31, 2016

Appears all these schemes should impact the false positive rate similarly. It's simply that, if a false positive would happen in the original filter, then it must happen in the enlarged filter too because you cannot draw more entropy from x after insertion.

Just expanding the bucket width b pays almost no computational cost provided buckets stay cache aligned, meaning you double b every time you increase it. All these remaining schemes pay some extra hash cost, probably not much but something.

It follows that doubling b should be more efficient than the extra hash schemes if you always wish to double your filter size. If you do not wish to double then you might run into situations where some buckets straddle cache lines, so my symmetric version of Bin Fan's scheme should be more efficient, i.e. i1' = hash(fingerprint ++ sort(i1,i2)) % M.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants