I was working on a project today where I had a directory full of bz2 files that I needed to search. Extracting all the files and seemed like such a waste of time and space. I knew there had to be a tool to grep bz2 files like there is for some other archive formats. Of course there was, but it took a little bit of searching to find it, so I decided to research this a little better and write some information about how to work with these files. First the answer to my original query.
How to grep bz2 files: The bzip2 package has a binary called bzgrep which does exactly that. To grep a file just do “bzgrep pattern filename.bz2”.
It’s just that simple. Bzgrep extracts the the file and feeds it directly to grep. There is more you can do to work with bzip2 files. Read on for more information.
What are bz2 files?
Bzip2 is a compression format that is said to be considerably more efficient than many other compression algorithms. It was originally released in 1996 by Julian Seward.
Bzip2 uses an algorithm called the Burrows-Wheeler transform. I won’t go into much more detail on the algorithm, but there is some information on Wikipedia about it if you really want to know more about it. I believe it is most efficient in compressing text since it works by grouping characters that appear frequently together. But, mind you, don’t quote me on that since I am not an expert on compression algorithms.
Just to test the format out, I did a little unscientific comparison between bzip2 and a couple popular compression format. I took a random text file (my bash history file) and compressed it using bzip2, tar with gzip compression and zip. The results weren’t great for bzip2, but zip and gzip are both very efficient.
Comparison between bz2, tar.gz and zip. Using a 26 KB text file
Extracting bz2 files
If the file is a straight bz2 file with a file extension like filename.bz2. The simples way to decompress it is using bunzip2. Bunzip2 is an alias for bzip2 -d. The “d” option here means to decompress.
The tar utility actually has bzip2 support. When tar determines the file is a bzip2 file it will call bzip2 before untaring the file. So in many cases the file will have been compressed using tar. In that case, the file extension is most likely tar.bz2. In this case it would be much more effecient to use the tar command. Otherwise you would only get a .tar file that you would need to run through tar to get the original file. The command for that would be:
tar -xf compressed_file.tar.bz2
Here the “x” option tells tar to extract the file and the “f” option notes that the name of the archive is to follow.
Creating bz2 files
If you need to create a bz2 file, the quick way to do it is just run bzip2 whith the -z option:
bzip2 -z filename.txt
However this will delete the original file (like filename.txt in this example) and replace it with the bz2 file of the same name like filename.txt.bz2. This might not be what you want to do. To keep the original file you should use the -k option as well.
bzip2 -zk filename.txt
Like I mentioned before, the tar command has bzip2 support. You may want to use that as a substitute. Filenames with a tar.bz2 extension seem to be more common to me. This is because it is easier to compress multiple files and preserve the directory structure by wrapping the files using tar first.
To compress a file using tar and have it compressed with bz2 you only have to use the -j option. The -j option tells tar to call bzip2 for the compression of the file.
tar -jcvf compressed_file.tar.bz2 filename.txt
How to cat bz2 files
The bzip2 package also contains a handy tool called bzcat. Bzcat will decompress the file and send it to stdout. This will be beneficial if you would like to pipe the file directly to another program rather than saving it to a disk.
There are numerous ways you could use this. The most obvious one being to just print the file to the terminal to see what it contains.
But one use case could be if the bz2 file contains a bash script that you would like to run once with out decompressing the archive to disk. For this purpose, you could send the output back to bash for running:
bzcat runscript.sh.bz2 | bash
Also, if you just want to peek into the file and see the top of it you could pipe it to the head command. This example will show you only the first 10 lines of the file.
bzcat runscript.sh.bz2 | head -n 10
More ways to grep bz2 files
Additionally to using bzgrep, egrep and fgrep are also supported. By using the bzegrep command to run egrep after bzip2 and using bzfgrep to run fgrep after bzip2.
To run bzegrep with a regular expression you could try something like this:
bzegrep "Some\s.*\sExpress.on" filename.txt.bz2
More bzip2 commands
While writing this I actually found a couple more commands on my system that are aliases for bzip2 and some other command. Here are these commands:
Run diff on two bz2 files and compare them
bzdiff filename.txt.bz2 filename2.txt.bz2
Extract file and send it to the less command
Recover a corrupt bz2 file
One more bzip2 command that deserves a mention in a tutorial on how to use bzip2 is bzip2recover.
If you find that a file that you are trying to extract is damaged and bzip2 complains that it is corrupted with error messages such as the one here below, you can run bzip2recover to try get the data that it can extract.
bunzip2: Data integrity error when decompressing.
Bzip2recover will try to split the file into blocks and save each block as it’s own file. This way you could at least get the parts of the file that is not corrupted. As always it is best to backup your original bz2 file before you proceed.
cp file.txt.bz2 file.txt.bz2.backup bzip2recover file.txt.bz2
That’s it for now
What was originally going to be a short article on how to grep a bz2 file, has turned into a complete tutorial on working with bzip2 files. I just hope you will find this useful the next time you need to work with bz2 files.