Eager Empty Bucket Checking #148

Gillgamesh · 2024-03-27T19:43:57Z

Implements the following changes:

The function effective_size() tracks the number of rows in a sketch column that are at or below a nonzero bucket. That is, it tracks (index of deepest nonempty bucket) + 1.
There is an optional flag to enable the tracking of a nonempty_buckets bit array. This has the effect of making effective_size() take a constant number of instructions by clz. The flag is updating on calls to the three merge functions and to update().
sample() and exhaustive_sample() have been updated to only check a small constant number of buckets below effective_size() for each column. This is based off a number of theoretical observations about the concentration of good buckets near the "best" bucket, on both the right and left. This should speed up calls to them by a 2-4x factor without sacrafices to the success probability.
merge() and range_merge() now use effective_size() to not bother merging in buckets that are definitely zero.
Similarly, serialize() and the deserialize constructor also use effective_size() to shrink the space needed for storing and sending sketches.

The performance testing needs to be done thorouhgly, but this should slightly speed up merging, signifcantly speed up querying, and slightly slow down updating.

etwest · 2024-03-28T00:20:57Z

Since this pull request is modifying the Sketch class which is essential to correctness and performance I think we should be extremely thorough in our review. This means we may end up making a lot of critiques/suggestions. Just want to set that expectation before we start.

include/bucket.h

etwest · 2024-03-28T18:59:03Z

include/sketch.h

@@ -125,6 +125,18 @@ class Sketch {

  std::mutex mutex; // lock the sketch for applying updates in multithreaded processing

+
+  /**


Line up * and provide a little more documentation. Specify that this operates per column and that all non-empty buckets are above this cutoff.

Better name?

couldn't think of one

etwest · 2024-03-28T19:10:29Z

src/sketch.cpp

@@ -18,6 +27,14 @@ Sketch::Sketch(vec_t vector_len, uint64_t seed, size_t _samples, size_t _cols) :
    buckets[i].alpha = 0;
    buckets[i].gamma = 0;
  }
+
+  #ifdef EAGER_BUCKET_CHECK


We want to have two forms of serialization:

Direct "no-copy" serialization where we pull the bucket array data out directly.

Compressed serialization which leverages the known column sizes to be more data size efficient.

To pull this off we need the buckets and flags to be stored contiguously. Suggest allocating a few extra buckets that are then used for the flags.

NOTE: We will all have to think about the right way to "receive" these serialized forms on the other side.

src/sketch.cpp

etwest · 2024-03-28T19:14:10Z

src/sketch.cpp

@@ -78,6 +118,13 @@ void Sketch::update(const vec_t update_idx) {
    size_t bucket_id = i * bkt_per_col + depth;
    likely_if(depth < bkt_per_col) {
      Bucket_Boruvka::update(buckets[bucket_id], update_idx, checksum);
+      #ifdef EAGER_BUCKET_CHECK
+      likely_if(!Bucket_Boruvka::is_empty(buckets[bucket_id])) {


unlikely_if(is_empty)

Investigate the performance of the following:

unlikely_if(is_empty) set_bit() update() unlikely_if(is_empty) clear_bit()

etwest · 2024-03-28T19:19:38Z

src/sketch.cpp

-        return {buckets[bucket_id].alpha, GOOD};
+
+  for (size_t col = first_column; col < first_column + cols_per_sample; ++col) {
+    int row = effective_size(col)-1;


int(effective_size(col))

etwest · 2024-03-28T19:46:29Z

src/sketch.cpp

+{
+  // first, check for emptyness
+  Bucket *current_row = buckets + (col_idx * bkt_per_col);
+  if (Bucket_Boruvka::is_empty(buckets[num_buckets - 1]))


I would drop this or move it into the #else. It's a more expensive call than the clzll

etwest · 2024-03-28T19:52:12Z

src/sketch.cpp

+  }
+#ifdef EAGER_BUCKET_CHECK
+  unlikely_if(nonempty_buckets[col_idx] == 0) return 0;
+  return (uint8_t)((sizeof(unsigned long long) * 8) - __builtin_clzll(nonempty_buckets[col_idx]));


Is there a convenient way to use ctzll? It's about 3 times faster than clzll

etwest · 2024-03-28T19:53:39Z

tools/benchmark/graphcc_bench.cpp

@@ -17,7 +17,11 @@

 constexpr uint64_t KB = 1024;


Pls clean this file. (i.e. get rid of the HashSet etc.)

etwest · 2024-03-28T19:53:58Z

tools/benchmark/graphcc_bench.cpp

+}
+BENCHMARK(BM_Std_Set_Hash_Iterator)->RangeMultiplier(2)->Range(1, 1 << 14);
+
+BENCHMARK_MAIN();


Add newline

etwest · 2024-03-28T19:59:53Z

src/sketch.cpp

@@ -5,6 +5,15 @@
 #include <vector>
 #include <cassert>

+
+inline static void set_bit(vec_t &t, int position) {


Suggest switching the position of first bucket to most significant bit.
t |= 1 << (sizeof(vec_t) * 8 - position)

Gillgamesh added 2 commits March 26, 2024 22:47

first attempt at clean impl of eager checking

dadd0be

Working Impl of Eager Bucket Checking

56188ab

Gillgamesh requested review from DanielDeLayo and etwest March 27, 2024 19:43

benchmarking

4af3f9a

etwest reviewed Mar 28, 2024

View reviewed changes

include/bucket.h Outdated Show resolved Hide resolved

etwest reviewed Mar 28, 2024

View reviewed changes

include/bucket.h Outdated Show resolved Hide resolved

etwest requested changes Mar 28, 2024

View reviewed changes

etwest mentioned this pull request Apr 2, 2024

Updating Query and Merge Procedure for Sketches #147

Open

Gillgamesh added 4 commits April 13, 2024 09:33

first set for small refactors

ab41635

stuff

1032730

Work on making row-major vs column-major mostly modular

4b878ed

compressed serialization with row major flag

35211d2

DanielDeLayo removed their request for review September 6, 2024 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eager Empty Bucket Checking #148

Eager Empty Bucket Checking #148

Gillgamesh commented Mar 27, 2024

etwest commented Mar 28, 2024

etwest Mar 28, 2024

etwest Mar 28, 2024

Gillgamesh Apr 2, 2024

etwest Mar 28, 2024

etwest Mar 28, 2024

etwest Mar 28, 2024

etwest Mar 28, 2024

etwest Mar 28, 2024

etwest Mar 28, 2024

etwest Mar 28, 2024

etwest Mar 28, 2024

etwest Mar 28, 2024

		@@ -125,6 +125,18 @@ class Sketch {

		std::mutex mutex; // lock the sketch for applying updates in multithreaded processing


		/**

Eager Empty Bucket Checking #148

Are you sure you want to change the base?

Eager Empty Bucket Checking #148

Conversation

Gillgamesh commented Mar 27, 2024

etwest commented Mar 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment