CancelAllBackgroundWork: Flush will not wait for stall conditions to clear #832
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When closing the db, CancelAllBackgroundWork is called and a flush is initiated when there are unpersisted writes (without wal - has_unpersisted_data_ is true).
This flush currently waits for some stall conditions to clear (in WaitUntilFlushWouldNotStallWrites).
The wait is controlled by FlushOptions.allow_write_stall and is false by default.
Such behaviour does not seem logical since the user simply wants to close the db and they should not be forced to wait for stall conditions.
In cases such as db_bench, this leads to some obscurity by sometimes hiding the compaction debt that was accumulated during the test.
This hiding effect is caused when the test ends in a stall condition and it has unpersisted data. As mentioned above, new compactions will be executed since CancelAllBackgroundWork is waiting until the flush ends which in turn waits for the stall conditions to clear.
A different test which has a lower compaction debt will close in much shorter time and the benefit of having a lower compaction debt will be overlooked
Solve this by allowing the flush to run despite the stall conditions and thus only currently running bg operations will be executed and not new ones.
Results:
fillup cmd:
db_bench --compression_type=None -db=/data/ -num=60000000 -value_size=1000 -key_size=16 -report_interval_seconds=1 -stats_dump_period_sec=120 -num_column_families=1 -write_buffer_size=67108864 -histogram --delayed_write_rate=536870912 -max_write_buffer_number=4 -db_write_buffer_size=0 -max_background_compactions=8 -cache_size=8388608 -max_background_flushes=4 -bloom_bits=10 -benchmark_read_rate_limit=0 -benchmark_write_rate_limit=0 -report_file=fillrandom.csv --disable_wal=true --benchmarks=fillrandom,levelstats,memstats,stats -max_write_buffer_number=8
It takes 3 secs for Shutdown (" Shutdown: canceling all background work" in the LOG) in this branch and almost 40 secs in main.
lsm state when CancelAllBackgroundWork is called in fillup:
main:
files[33 62 39 420 71 0 0]
this branch:
files[33 22 28 428 86 0 0]
lsm state at start of next test:
main:
this branch:
As can be seen, the main branch has done much compaction between the two tests which is unaccounted for in the performance measurements.