lrzip v0.23 Long Range ZIP or Lzma RZIP This is a compression program optimised for large files. The larger the file and the more memory you have, the better the compression advantage this will provide, especially once the files are larger than 100MB. The advantage can be chosen to be either size (much smaller than bzip2) or speed (much faster than bzip2). Decompression is always much faster than bzip2. Lrzip uses an extended version of rzip which does a first pass long distance redundancy reduction. The lrzip modifications make it scale according to memory size. The data is then either: 1. Compressed by lzma (default) which gives excellent compression at approximately half the speed of bzip2 compression 2. Compressed by lzo which on most machines compresses faster than disk writing making it as fast (or even faster) than simply copying a large file 3. Leaving it uncompressed and rzip prepared. This form improves substantially any compression performed on the resulting file in both size and speed (due to the nature of rzip preparation merging similar compressible blocks of data and creating a smaller file). 4. Compressed by bzip2 as an rzip-like compression format. The major disadvantages are: 1. It only works on single files. To get the best performance out of the compression it is best to tarball all your files together. 2. It requires a lot of memory to get the best performance out of, and is not really usable (for compression) with less than 256MB. Decompression requires very little ram and works on small ram machines. 3. Does not work on stdin/stdout. (See file Current-Benchmarks.txt for updated information) Example on a 1GB ram P4 3GHz: A tarball of a fully compiled kernel tree: Size Compression Decompression base file: 646963200 gzip 218071923 1:27.27 0:45.39 bzip2 192484690 4:41.62 1:41.20 bzip2 -1 215555795 3:24.08 1:21.45 bzip2 -9 192484690 4:53.18 1:31.40 lzma 112229937 11:48.07 0:56.38 lzma -9 97704505 27:18.77 ? lrzip 88560021 10:11.28 0:57.88 lrzip -l 191415649 0:30.19 0:50.69 lrzip -M 82708048 11:45.79 1:00.75 lrzip -n 389125460 0:31.02 0:58.9 Summary: Ratio Value(Ratio/Time) gzip 2.97 2.048 bzip2 3.36 0.717 bzip2 -1 3.00 0.883 bzip2 -9 3.36 0.688 lzma 5.76 0.488 lzma -9 6.62 0.242 lrzip 7.31 0.718 lrzip -l 3.38 6.760 * lrzip -M 7.82 * 0.666 lrzip -n 1.66 3.222 Requires: liblzo2-dev libbz2-dev libz-dev libm To build/install: ./configure make make install FAQS. Q. How do I make a static build? A. make static Q. I want the absolute maximum compression I can possibly get, what do I do? A. Try the -M option. Note it will use all available ram so expect serious swapping to occur. It may even fail to run if you do not have enough swap space allocated. Why? Well the more ram lrzip uses the better the compression it can achieve. Q. Can I use your tool for even more compression than lzma offers? A. Yes, the rzip preparation of files makes them more compressible by every other compression technique I have tried. Using the -n option will generate a .lrz file smaller than the original which should be more compressible, and since it is smaller it will compress faster than it otherwise would have. Q. How about 64bit? A. As of v0.15 64 bit is working well, but the lzo library may give grief due to naming differences. Q. Other operating systems? A. Patches are welcome. The configure/build system works only on linux at the moment, but a darwin specific Makefile without configure is included that should work. Q. Can it be made to work on stdin/stdout? A. The rzip design basically works in a way that makes this virtually impossible. Q. Really why can't I use stdin/stdout? A. Well the first compression stage (rzip) takes the largest chunk of the file your ram can fit and completely reorders all the data in it. Then it hands over the data in chunks to the compressor. Then it is written to disk. So theoretically for stdin it could buffer all input till it filled the chunk size and then start compressing. So adding stdin would not be too big a stretch. On the other side though, with stdout, the data cannot be fed to anything till it is completely decompressed and re-ordered into the original chunk size. Theoretically we could decompress a whole chunk in ram, reorder it and then start piping it to stdout. This would mean the decompression ram requirements would almost be as big as the compression requirements which makes it not portable to machines with less ram. Currently lrzip uses extraordinarily little amounts of ram on decompression, and is very fast. Adding stdout support would cancel both of those advantages. The other option for supporting stdin/stdout is to do each chunk to a separate file and then feed it. None of these are particularly desirable or practical. Since stdout support is impractical, there is no point implementing just stdin. Q. I still want stdin/stdout? A. I take patches. Q. I have another compression format that is even better than lzma, can you use that? A. You can use it yourself on rzip prepared files (see above). Alternatively if the source code is compatible with the GPL license it can be added to the lrzip source code. Libraries with functions similar to compress() and decompress() functions of zlib would make the process most painless. Please tell me if you have such a library so I can include it :) Q. What's this "Progress percentage pausing during lzma compression" message? A. While I'm a big fan of progress percentage being visible, unfortunately lzma compression can't currently be tracked when handing over 100+MB chunks over to the lzma library. Therefore you'll see progress percentage until each chunk is handed over to the lzma library. lzo, bzip2 or no compression doesn't have this problem and shows progress continuously. Q. What's this "lzo testing for incompressible data" message? A. The lzma compression is the slowest compression technique in lrzip, and lzo is the fastest. To help speed up the process, lzo compression is performed on the data first to test that the data is at all compressible. If a small block of data is not compressible, it tests progressively larger blocks until it has tested all the data (if it fails to compress at all). If no compressible data is found, then lzma compression is not even attempted. This can save a lot of time during the compression phase when there is incompressible data. It also works around a known bug that incompressible data gets the lzma compression library stuck in an endless loop. Theoretically it may be possible that data is compressible by lzma and not at all by lzo, but in practice such data achieves only miniscule amounts of compression which are not worth pursuing. Most of the time it is clear one way or the other that data is compressible or not. Q. I Have truckloads of ram so I can compress files much better, but can my generated file be decompressed on machines with less ram? A. Yes. Ram requirements for decompression go up only by the -L compression option with lzma and are never anywhere near as large as the compression requirements. Q. Any plans to turn this into a complete archiver? A. Not really. The compression format relies on being fed large files, and tar does a good job of this already. Maybe I should include a script with lrzip that automates what tar+lrzip does. Q. I've changed the compression level with -L in combination with -l and the file size doesn't vary? A. That's right, -l only has one compression level. Q. Help? I'm a newbie and have no idea how to turn my directory into a tarball! A. Here is a walkthrough for a directory called myfiles to compress: tar cf myfiles.tar myfiles lrzip myfiles.tar this will create a file called myfiles.tar.lrz to extract: lrzip -d myfiles.tar.lrz tar xf myfiles.tar will create and extract everything into a directory called myfiles Q. Why are you including bzip2 compression? A. To maintain a similar compression format to the original rzip (although the other modes are more useful). Q. What about multimedia? A. Most multimedia is already in a heavily compressed "lossy" format which by its very nature has very little redundancy. This means that there is not much that can actually be compressed. If your video/audio/picture is in a high bitrate, there will be more redundancy than a low bitrate one making it more suitable to compression. None of the compression techniques in lrzip are optimised for this sort of data. However, the nature of rzip preparation means that you'll still get better compression than most normal compression algorithms give you if you have very large files. ISO images of dvds for example are best compressed directly instead of individual .VOB files. Q. Is this multithreaded? A. As of version 0.21, the answer is yes for lzma compression only thanks to a multithreaded lzma library. However I have not found the gains to scale well with number of cpus, but there are definite performance gains with more cpus. Q. This uses heaps of memory, can I make it use less? A. Well you can by setting -w to the lowest value (1) but the huge use of memory is what makes the compression better than ordinary compression programs so it defeats the point. You'll still derive benefit with -w 1 but not as much. Q. What CFLAGS should I use? A. With a recent enough compiler (gcc>4) setting both CFLAGS and CXXFLAGS to -O3 -march=$archname -fomit-frame-pointer and putting your architecture into $archname (like pentium4) causes noticeable speed improvements with lzma without risk of breakage. Because of the c++ code used in lzma, -O3 actually does give demonstrable advantage over -O2 (unlike most c programs). Newest compilers take -march=native without needing to specify the architecture. Q. What compiler does this work with? A. It has been tested on gcc, ekopath and the intel compiler successfully. Whether the commercial compilers help or not, I could not tell you. Q. What codebase are you basing this on? A. rzip v2.1 and lzma sdk443, but it should be possible to stay in sync with each of these in the future. Q. Do we really need yet another compression format? A. It's not really a new one at all; simply a reimplementation of a few very good performing ones that will scale with memory and file size. Q. How do you use lrzip yourself? A. Two basic uses. I compress large files currently on my drive with the -l option since it is so quick to get a space saving, and when archiving data for permament storage I compress it with the default options. Q. I found a file that compressed better with plain lzma. How can that be? A. When the file is more than 5 times the size of the compression window you have available, the efficiency of rzip preparation drops off as a means of getting better compression. Eventually when the file is large enough, plain lzma compression will get better ratios. The lrzip compression will be a lot faster though. Currently I have no way around this problem without throwing more and more ram at the compression because trying to do this off disk (whether directly on the file or from swap) will mean the file is read a ridulous number of times over and over again. It presents an interesting problem for which there is no perfect solution but it certainly has us thinking hard about how to tackle it. Q. Can I use swapspace as ram for lrzip with a massive window? A. No. To make lrzip work completely from disk would make the data be read off disk an unrealistic number of times over again and again. For example, if you have 1GB of ram and a 2GB file to compress, it might read the file a billion times off disk. Most hard drives would fail in that time :) See the previous question. Update; I have been informed that people have successfully done this without destroying their hard drives and they've been _very_ patient, but it didn't take as long as I had predicted. Q. Why do you nice it to +19 by default? Can I speed up the compression by changing the nice value? A. This is a common misconception about what nice values do. They only tell the cpu process scheduler how to prioritise workloads, and if your application is the _only_ thing running it will be no faster at nice -20 nor will it be any slower at +19. Q. What is the Threshold option, -T ## (1-10)? A. It is for adjusting the sensitivity of the LZO test that is used when LZMA compression is selected. When highly random or already-compressed data chunks are evaluated for LZMA compression, sometimes LZO compression actually will create a larger chunk than the original. If this data chunk is passed to the LZMA compressor, it will take an extremely long time or hang until the program is aborted. The Threshold is used to determine a minimum compression amount relative to the size of the data being evaluated. A value of 2 is the default. This means that the compression threshold amount is >5% of the size of the original data. If the threshold is not achieved, the LZMA compression will not be done and the chunk will not be compressed. Values can be from 1 (little or no compression expected, up to 5%) to 10 (maximum compression efficiency expected). The following table can be used. For LZO compressor test T value Compression % Compression Ratio 1 0-5% 1.00-1.05 very low compression expected 3 5-10% 1.05-1.10 default value 3 10-20% 1.12-1.25 4 20-30% 1.25-1.43 5 30-40% 1.43-1.66 6 40-50% 1.66-2.00 7 50-60% 2.00-2.50 8 60-70% 2.50-3.33 9 70-80% 3.33-5.00 10 80+% 5x+ Whenever the data chunk does not compress to the Threshold value, no LZMA compression will be attempted. For example, if you select -T 5, LZMA compression will be performed if the projected compression ratio is less than 1.43. Otherwise, data will be written in rzip format. Setting a very high T value will result in a lot of uncompressed data in the lrzip file. However, a lot of time will be saved. For most people you shouldn't ever need to touch this. Q. Compression and decompression progress on large archives slows down and speeds up. There's also a jump in the percentage at the end? A. Yes, that's the nature of the compression/decompression mechanism. The jump is because the rzip preparation makes the amount of data much smaller that the compression backend (lzma) needs to compress. Q. I'm terrified that my compressed data may be corrupted and there is no test function. How can I test the integrity of the data? A. Use md5sum. Here is a walkthrough: lrzip inputfile lrzip -o test_outputfile inputfile.lrz md5sum inputfile c5f74ca56f0b4ac8b61070d11d712145 inputfile md5sum test_outputfile c5f74ca56f0b4ac8b61070d11d712145 test_outputfile The values given are examples only. If they match, then the integrity can be guaranteed. Q. Tell me about patented compression algorithms, GPL, lawyers and copyright. A. No LIMITATIONS There's still some serious limitations on window size and the possible compression performance on machines with greater than 4GB due to many 32 bit restrictions. These exist on 64bit builds as well for the time being. BUGS: Probably lots. Links: rzip: http://rzip.samba.org/ lzo: http://www.oberhumer.com/opensource/lzo/ lzma: http://www.7-zip.org/ Thanks to Andrew Tridgell for rzip. Thanks to Markus Oberhumer for lzo. Thanks to Igor Pavlov for lzma. Thanks to Jean-loup Gailly and Mark Adler for the zlib compression library. Thanks to Christian Leber for lzma compat layer, Michael J Cohen for Darwin support, Lasse Collin for fix to LZMALib.cpp and for Makefile.in suggestions, and everyone else who coded along the way. Huge thanks to Peter Hyman for most of the 0.19 changes onwards, and the update to the multithreaded lzma library and all sorts of other features Con Kolivas Fri, 21 Mar 2008