-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor compression & quality for difficult-to-compress data #87
Comments
Hi Peter, I tested that double-type dataset (its values are very random as I observed, so the compression ratio is low). ABS bit-rate PSNR I think this result makes more sense. Best, |
Sheng, thanks for the suggestion. As someone who has not played around with SZ parameter settings much, is there a way to algorithmically choose the number of quantization bins, or is this basically a trial and error process that one has to go through for each data set? As evidenced by the table you listed, the rate increase starts climbing again for the last few tolerances (ideally, the rate would not increase by more than one bit), which suggests to me that one would then have to increase the number of quantization bins even further. Is there a limit on the number of bins? Should it be a power of two? I imagine that increasing the number of bins may adversely affect the compression ratio for large tolerances, i.e., there may be no single setting that works well for a wide range of tolerances. Otherwise, would there not be a better default setting? Finally, is there a way to set this parameter without using an sz.config file? It is a bit cumbersome to have to generate such a text file when using the SZ command-line utility in shell scripts. |
I'll leave the rest to Sheng.
@lindstro Not with the sz command line, but the libpressio command line from my git clone https://github.com/robertu94/spack_packages robertu94_packages
spack repo add ./robertu94_packages
spack install libpressio-tools ^ libpressio+sz+zfp
#if your compiler is older (pre c++17) you might need this instead
spack install libpressio-tools ^ libpressio+sz+zfp ^ libstdcompat+boost
spack load libpressio-tools
# doesn't matter here, but dims are in fortran order for all compressors; last 3 args print metrics
pressio -i ./chaotic_data.dat -t double -d 256 -d 256 -d 256 -b compressor=sz -o sz:max_quant_intervals=67108864 -o pressio:abs=1e-14 -m time -m size -M all
#print help for a compressor
pressio -b compressor=sz -a help |
@robertu94 Thanks for pointing that out. That is very convenient and another reason why I need to take a closer look at libpressio. |
Hi Peter, To check the detailed component of the compressed data in SZ, you can do the following things: And then, when you use the executable 'sz' command, you can add the option -q to print the stats: more information such as Huffman tree size, encoded bytes. Best, |
@lindstro to elaborate on this
LibPressio’s CLI is capable out outputting this information as well. It will automatically detect that you compiled ‘sz+stats’ and add these metrics to the output above. The entire install command then would be: ‘’’ |
Thanks for the explanation. I would argue that a tolerance of 2-20 ≈ 10-6 for data in (0, 1) is not all that low; it provides less accuracy than single precision. But it is good to know that this limitation is not unique to this data set, even if such difficult-to-compress data may require a larger number of quantization bins. Just to make sure I fully understand, using 2n bins allows SZ to encode residuals in roughly n bits (depending on entropy, of course). But for very large residuals (or small n) outside the range of available bins, SZ may have to record the residual as a full floating-point value, which generally is more expensive. And for random data, SZ is likely to give poor predictions that result in frequent, large residuals. Do I have that right?
That seems handy. Is there an analogous setting for CMake builds?
I see. In this case, should one rerun with the number of bins reported? |
Leaving the rest to @disheng222
@lindstro the corresponding setting is BUILD_STATS:BOOL=ON That is what libpressio uses when it build with |
Doh! Not sure how I missed that. Thanks. |
@lindstro "Just to make sure I fully understand, using 2^n bins allows SZ to encode residuals in roughly n bits (depending on entropy, of course). ......." You can use -q or -p or both in your compression operation as follows: [sdi@localhost example]$ sz -p -q -z -d -c sz.config -i chaotic_data.dat -M ABS -A 1E-5 -3 256 256 256 The -p metadata information are stored in the compressed data, so you can also use -p to check the compressed data as follows: [sdi@localhost example]$ sz -p -s chaotic_data.dat.sz |
@disheng222 I tried your proposed fix of using 226 quantization bins. This does improve things a little for this type of random data at high rates (low tolerances). There's a huge jump in rate, from about 32 to 60, when halving the error tolerance. That's a bit surprising. I'm also including the results of this change in number of bins in the case of compressible data--the Miranda viscosity field from SDRBench. It doesn't seem that using more bins helps at high rates here. In fact, SZ does worse when increasing the number of bins. I'm just curious if there are other parameters that might help. |
I tested SZ3 using the maximum # quantization bins 1048576, which can be set in sz3.config. The result (i.e., SZ3_1m) looks better than both SZ2 and SZ3(default). Best,. |
@disheng222 Thanks for your suggestions. We decided to go with SZ2 because it supports a pointwise relative error bound. AFAIK, SZ3 does not. Your Miranda plots look much like the "default" curve I included and less like the S-shaped curve I got after increasing the number of bins. Do you know why there's such a large jump in rate for the more random data? |
Hi Peter,
SZ3 also supports point-wise relative error bounds, but we didn't release
this function in API. We can do it soon, and will let you know when it's
ready.
As for the large jump, there are two situations.
If you are using the default setting (65536 bins), when the error bound is
small enough, most of the data points cannot be predicted and covered
within the quantization bin range (i.e., 65536*2*err_bound). In this
situation, most of the values are compressed by the binary-representation
analysis (e.g., truncate insignificant bits). That is, this is a different
"compression method" compared with the classic SZ pipeline
(prediction+quantization+Huffman+....).
If you are using a very large number of quantization bins, when the error
bound is small, the quantization bin range may still cover the distance
between predicted value and real data value. That said, the dominant method
is still prediction+quntization+..... However, note that the Huffman tree
needs to be stored together in the compressed data. When the defact number
of quantization bins is very large, the Huffman tree size could be large. I
didn't use a very compact way to store Huffman tree because I suppose
Huffman tree size is negligible for the overall compressed data size (which
is not true for extremely small error bound however). So, I believe Huffman
tree overhead could be a factor to the jump issue, especially when the
original dataset has a small or median size (such as 10MB, 100MB, or so,
depending on how small the error bound is).
…On Thu, Jul 28, 2022 at 11:19 AM Peter Lindstrom ***@***.***> wrote:
@disheng222 <https://github.com/disheng222> Thanks for your suggestions.
We decided to go with SZ2 because it supports a pointwise relative error
bound. AFAIK, SZ3 does not.
Your Miranda plots look much like the "default" curve I included and less
like the S-shaped curve I got after increasing the number of bins.
Do you know why there's such a large jump in rate for the more random data?
—
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACK3KSJVI75ZBEM7G3MIS2LVWKXIVANCNFSM5RUWBYKQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@disheng222 Thanks--that would be great. If you don't mind, can we leave this issue open until you have added support to SZ3? I may have some follow-up questions once that's working. |
Sure. Let's keep this issue open.
…On Thu, Aug 4, 2022 at 2:35 PM Peter Lindstrom ***@***.***> wrote:
SZ3 also supports point-wise relative error bounds, but we didn't release
this function in API. We can do it soon, and will let you know when it's
ready.
@disheng222 <https://github.com/disheng222> Thanks--that would be great.
If you don't mind, can we leave this issue open until you have added
support to SZ3? I may have some follow-up questions once that's working.
—
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACK3KSM3BMFDZDUX6PBCMULVXQLQ3ANCNFSM5RUWBYKQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I am doing some compression studies that involve difficult-to-compress (even incompressible) data. Consider the chaotic data generated by the logistic map xi+1 = 4 xi (1 - xi):
We wouldn't expect this data to compress at all, but the inherent randomness at least suggests a predictable relationship between (RMS) error, E, and rate, R. Let σ = 1/√8 denote the standard deviation of the input data and define the accuracy gain as
α = log₂(σ / E) - R.
Then each increment in storage, R, by one bit should result in a halving of E, so that α is essentially constant. The limit behavior is slightly different as R → 0 or E → 0, but over a large range α ought to be constant.
Below is a plot of α(R) for SZ 2.1.12.3 and other compressors applied to the above data interpreted as a 3D array of size 256 × 256 × 256. Here SZ's absolute error tolerance mode was used:
sz -d -3 256 256 256 -M ABS -A tolerance -i input.bin -z output.sz
. The tolerance was halved for each subsequent data point, starting with tolerance = 1.The plot suggests an odd relationship between R and E, with very poor compression observed for small tolerances. For instance, when the tolerance is in {2-13, 2-14, 2-15, 2-16}, the corresponding rate is {13.9, 15.3, 18.2, 30.8}, while we would expect R to increase by one bit in each case. Is this perhaps a bug in SZ? Similar behavior is observed for other difficult-to-compress data sets (see rballester/tthresh#7).
The text was updated successfully, but these errors were encountered: