An explanation of the revised lzo_compresses function in stream.c. The modifications to the lrzip program for 0.19 centered around an attempt to catch data chunks that would cause lzma compression to either take an inordinately long time or not complete at all. The files that could cause problems for lzma are already-compressed files, multimedia files, files that have compressed files in them, and files with randomized data (such as an encrypted volume or file). The lzo_compresses function is used to assess the data and return a TRUE or FALSE to the lzma_compress_buf function based on whether or not the function determined the data to be compressible or not. The simple formula cdata < odata was used (c=compressed, o=original). Some test cases were slipping through and caused the hangups. Beginning with lrzip-0.19 a new option, -T, test compression threshold has been introduced and sets configurable limits as to what is considered a compressible data chunk and what is not. In addition, with very large chunks of data, a small modification was made to the initial test buffer size to make it more representative of the entire sample. To go along with this, increased verbosity was added to the function so that the user/evaluator can better see what is going on. -v or -vv can be used to increase informational output. Functional Overview: Data chunks are passed to the lzo_copresses function in two streams. The first is the small data set in the primary hashing bucket which can be seen when using the -v or -vv option. This is normally a small sample. The second stream will be the rest. The size of the streams are dependent on how the long range analysis that is performed on the entire file and available memory. After analysis of the data chunk, a value of TRUE or FALSE is returned and lzma compression will either commence or be skipped. If skipped, data written out to the .lrz file will simply be the rzip data which is the reorganized data based on long range analysis. The lzo_compresses function traverses through the data chunk comparing larger and larger blocks. If suitable compression ratios are found, the function ends and returns TRUE. If not, and the largest sample block size has been reached, the function will traverse deeper into the chunk and analyze that region. Anytime a compressible area is found, the function returns TRUE. When the end of the data chunk has been reached and no suitable compressible blocks found, the program will return FALSE. Under most circumstances, this logic was fine. However, if the test found a chunk that could only achieve 2% compression, for example, this type of result could adversely affect the lzma compression routine. Hence, the concept of a limiting threshold. The threshold option works as a limiter that forces the lzo_compresses function to not just compare the estimated compressed size with the original, but to add a limiting threshold. This ranges a very low threshold, 1, to a very strict, 10. A threshold of 1 means that for the function to return TRUE, the estimated compressed data size for the current data chunk can be between 90-100% of the original size. This means that almost no compressible data is observed or tested for. A value of 2, means that the data MUST compress better than 90% of the original size. However, if the observed compression of the data chunk is over 90% of the original size, then lzo_compresses will fail. Each additional threshold value will increase the strictness according to the following formula CDS = Observed Compressed Data Size from LZO ODS = Original Data chunk size T = Threshold To return TRUE, CDS < ODS * (1.1-T/10) At T=1, just 0.01% compression would be OK, T=2, anything better than 10% would be OK, but under 10% compression would fail. T=3, anything better 20% would be OK, but under 20% compression would fail. ... T=10, I can't imagine a use for this. Anything better than 90% compression would be OK. This would imply that LZO would need to get a 10x compression ratio. The following actual output from the lzo_compresses function will help explain. 22501 in primary bucket (0.805%) lzo testing for incompressible data...OK for chunk 43408. Compressed size = 52.58% of chunk, 1 Passes Progress percentage pausing during lzma compression... lzo testing for incompressible data...FAILED - below threshold for chunk 523245383. Compressed size = 98.87% of chunk, 50 Passes This was for a video .VOB file of 1GB. A compression threshold of 2 was used. -T 2 means that the estimated compression size of the data chunk had to be better than 90% of the original size. There were 43,408 bytes in the primary hash bucket and this chunk was evaluated by lzo_compresses. The function estimated that the compressed data size would be 52.58% of the original 43,408 byte chunk. This resulted in LZMA compression occurring. The second data chunk which included the rest of the data in the current hash, 523,245,383 bytes, failed the test. the lzo_compresses function made 50 passes through the data using progressively larger samples until it reached the end of the data chunk. It could not find better than a 1.2% compression benefit and therefore FAILED, The result was NO LZMA compression and the data chunk was written to the .lrz file in rzip format (no compression). The higher the threshold option, the faster the LZMA compression will occur. However, this could also cause some chunks that are compressible to be omitted. After much testing, -T 2 seems to work very well in stopping data which will cause LZMA to hang yet allow most compressible data to come through. Peter Hyman pete@peterhyman.com December 2007