BYU

Office of Research Computing

We encountered problems with our infrastructure for several hours starting at about 4:25pm on Friday. The issues are now resolved. Your batch jobs may or may not have been impacted. See this email for details. Last Updated Friday, Apr 19 08:12 pm 2024

How do I compress my unused data?

Most data can be safely compressed, which will allow it to take up significantly less space on the filesystem. Some types of data are more compressible than others. This page will detail the most common ways to compress/uncompress data on Unix/Linux systems, like the Supercomputer, and what types of data are most easily compressed.

Note that compression is not an excuse to not cleanup your data. If you truly don't need data, it's better to delete it completely, than to compress it. Compressed data still takes up space, just less of it.

Note that if you're compressing as a part of a multi-processor job, it might make sense to compress using parallel compression tools. It really depends on the situation, but feel free to contact us, and we can discuss the implications with you.

How to compress data

Compressing lots of files together

In the Unix/Linux world, the most common compression file format is a compressed TAR file, otherwise known as a tarball. The TAR file concatenates files together, but does not do any compression, but this is frequently is then compressed using either the gzip or bzip2 compression scheme.

The easiest way to create a gzip compressed tarball, is to use syntax like shown below; note that it is usually considered good form to include files inside directories, and to name the file ending in either .tar.gz, or .tgz.

tar zcvf mycompressedfile.tar.gz list_of_files_and_directories_to_include

Similarly, a bzip2 compressed tarball is most easily built using syntax like the following, and is usually named ending in .tar.bz2

tar jcvf mycompressedfile.tar.bz2 list_of_files_and_directories_to_include

Note that these commands create a compressed file that contains the files specified, but do not delete the original files from the filesystem. If you are doing this to conserve space, you will have to delete the original files yourself.

Compressing single files by themselves

If you want to compress a single file by itself, the recommended commands are as follows:

#To compress a file using gzip:
gzip originalfile
#To compress a file using bzip2:
bzip2 originalfile
#To decompress a file using gzip:
gunzip originalfile.gz
#To decompress a file using bzip2:
bunzip2 originalfile.bz2

Note that these compression commands will replace originalfile with the compressed file, named originalfile.gz for gzip, and originalfile.bz2 for bzip2. Similarly, the decompression commands will replace the compressed version, with the uncompressed file named, in this case, originalfile.

Which compression type to use (gzip vs. bzip2)

Which of the two types of compression - gzip and bzip2 - that you choose to use is up to you. In general:

  • gzip is significantly faster to compress and decompress than bzip2
  • bzip2 frequently compresses data more than gzip, creating smaller files, but takes significantly longer

How to decompress data

In general, it's a good idea to get the listing of the contents of the tarball before you decompress it. This way, if the tarball contains a large number of files that aren't in a directory (considered bad form, but it does happen), you can create a subdirectory to extract into, instead of filling up your current directory with files.

To get the list of files/directories contained in a gzip-compressed tarball, use syntax like this:

tar ztvf mycompressedfile.tar.gz

Similarly, to get the list of files and directories in a bzip2-compressed tarball, use this syntax:

tar jtvf mycompressedfile.tar.bz2

When you're certain that you're ready to extract the files, use syntax like the following to do so:

#for gzip-tarballs:
tar zxvf mycompressedfile.tar.gz
#for bzip2-tarballs:
tar jxvf mycompressedfile.tar.bz2

Best types of data to compress

In general, this type of compression works best on data that contains repeating patterns, especially ASCII text, which is common for output logs, and many other files. For common data types, you can use the file command to determine what type of data is in a file, as shown here:

> file *
anaconda-ks.cfg:        ASCII English text
install.log:            ASCII text
install.log.syslog:     ASCII text
postinstall.log:        ASCII text, with very long lines
postinstall.log.errors: empty
rocks-post.log:         empty
rocks-post.sh:          empty
rocks-pre.log:          ASCII text
rocks-pre.sh:           ASCII text
scripts:                directory

While large files that show up as ASCII text, or shell scripts are almost certainly very compressible, so are other data types as well. The best thing to do is to try it, and see what happens.

Other file size considerations

In general, if you have large numbers of very small files, you will probably gain a space advantage by encapsulating these into a tar file, even if you don't compress it. This occurs because of wasted space overhead on the filesystem.

Every filesystem has a minimum allocation block size. This means that this size is the minimum amount of space that can be allocated to a file at any given time. Most filesystems, including our user home directory filesystem, use a 4 kB allocation block. However, our scratch filesystems (both user and group), and our group home directory filesystems, currently have a 64 kB allocation block, and some of the block and metadata replication characteristics will actually increase that size used as well.

What this means is relatively simple: For each file on the filesystem, the filesystem stores the data required, but allocates the smallest number of blocks that will still contain the data. This often creates some wasted space, essentially rounding up to the next block. This wasted space is theoretically up to 1 block worth, per file.

Here's an example: Let's say you have about 30,000 files, each of them storing 5 kB of data. If this is on a 4 kB block filesystem, each of these 30,000 files will take up 2 blocks, or 8 kB, meaning that your original 150,000 kB, will actually take up 240,000 kB.

If you had those same 30,000 5 kB files on a 64 kB block filesystem, each file would fit within 1 block, but would occupy the whole block, meaning that instead of 150,000 kB, you'd be using 1,920,000 kB of space.

In either case, by just putting these into a single tar file, even without compression, you'd have a single file approximately 150,000 kB. This file would fit in exactly 37,500 4 kB allocation blocks, with no loss. On a 64 kB system, this file would need 2343.75 blocks, meaning we'd have to round up to 2344, wasting about a quarter of a block, or 16 kB. In either case, the biggest file we'd have would be 150,016 kB. Compare that to the original 240,000 or 1,920,000 kB! This represents a ratio of more than 12:1 in the worst-case, and this doesn't even involve compression yet.

If you want to concatenate files into tar files, but not compress, then use commands like these:

#to create the tar file:
tar cvf myuncompressedtarfile.tar list_of_files_and_directories_to_include
#to get the list of files in the tar file:
tar tvf myuncompressedtarfile.tar
#to extract the files from the tar file:
tar xvf myuncompressedtarfile.tar

Parallel Compression/Decompression tools 

We do have a few parallel compression/decompression tools available, including the following:

  • pigz - A multithreaded (single-node) implementation of the gzip algorithm
  • pbzip2 - A multithreaded (single-node) implementation of the bzip2 algorithm

These parallel tools have a few implications that users should be aware of:

  • Just like the corresponding serial tools, in general pbzip2 generates smaller files, but takes significantly longer, than pigz.
  • Neither pigz nor pbzip2 can utilize the resources of more than one node.
  • The resulting compressed file may not be quite as small as if you'd used the corresponding serial (single-processor) compression tool (eg. gzip or bzip2). This is a relatively small effect, though, so it may be well worth it.
  • The compressed file format should be compatible with the serial compression/decompression tools, meaning, for example that a file compressed with pigz can be decompressed with gzip, and visa-versa.
  • If the file was compressed using the bzip2 serial (single-processor) compression tool, there will probably be no performance advantage to using the parallel decompression tool pbzip2. It will work, but it won't perform any better than a simple serial decompression like bunzip2 or bzip2 -d
    • For reasons we don't currently (July 2015) understand, this does not seem to be the case for gzip and pigz. pigz seems capable of speeding up the decompression of any .gz file, even if it was compressed using gzip.
  • Just like we describe here for the main processing tasks, the parallel compression tools may not make a lot of sense for your situation, depending on several factors. The parallel compression process is definitely sublinear, so if you have a lot of compression to do, it may be more efficient to use the serial tools. But, for example, if you have your job storing data on the local hard drive on a node, and you want to compress it before copying it to your compute or home directory, it would probably be good enough to use the parallel tools.

How to use pigz

To use the pigz tool, first load the corresponding environment module using syntax like this:

module load pigz

Once the module is loaded, you can run the following command to see the usage syntax:

pigz -h

While there are several options available, the following example shows the most common ones users are likely to use. This example shows the compression of a file using 4 processors:

$ pigz --verbose --keep --processes 4 FILENAME
FILENAME to FILENAME.gz

Similarly, to decompress in parallel, you can use syntax like this (again for 4 processors):

$ unpigz --verbose --processes 4 FILENAME.gz 
FILENAME.gz to FILENAME

How to use pbzip2

To use the pbzip2 tool, first load the corresponding environment module using syntax like this:

module load pbzip2

Once the module is loaded, you can run the following command to see the usage syntax:

pbzip2 -h

While there are several options available, the following example shows the most common ones users are likely to use. This example shows the compression of a file using 4 processors:

$ pbzip2 --verbose -p4 FILENAME
Parallel BZIP2 v1.1.12 [Dec 21, 2014]
By: Jeff Gilchrist [http://compression.ca]
Major contributions: Yavor Nikolov [http://javornikolov.wordpress.com]
Uses libbzip2 by Julian Seward

         # CPUs: 4
 BWT Block Size: 900 KB
File Block Size: 900 KB
 Maximum Memory: 100 MB
-------------------------------------------
         File #: 1 of 1
     Input Name: FILENAME
    Output Name: FILENAME.bz2

     Input Size: 5554996523 bytes
Compressing data...
    Output Size: 215668077 bytes
-------------------------------------------

     Wall Clock: 209.161108 seconds

Similarly, to decompress in parallel, you can use syntax like this (again for 4 processors):

$ $ pbzip2 -d --verbose -p4 FILENAME.bz2
Parallel BZIP2 v1.1.12 [Dec 21, 2014]
By: Jeff Gilchrist [http://compression.ca]
Major contributions: Yavor Nikolov [http://javornikolov.wordpress.com]
Uses libbzip2 by Julian Seward

         # CPUs: 4
 Maximum Memory: 100 MB
 Ignore Trailing Garbage: off
-------------------------------------------
         File #: 1 of 1
     Input Name: FILENAME.bz2
    Output Name: FILENAME

 BWT Block Size: 900k
     Input Size: 215668077 bytes
Decompressing data...
    Output Size: 5554996523 bytes
-------------------------------------------

     Wall Clock: 45.799170 seconds

Using tar with parallel compression tools

It is possible to use both tar and one of the threaded parallel compression tools (pigz or pbzip2) to compress multiple files or a directory tree in one step. Whether or not this makes sense for you, depends on the situation. In particular, it depends on the number of files vs the total size you're trying to compress. If you have a large number of small files, the tar process (which isn't going to run in parallel) will be doing most of the work, and this may not make sense to do. If you have a small number of large files, though, most of the time will be spent doing the compression, and therefore the parallel compression may make sense here.

Here's an example of how to compress a directory using tar with pigz; a similar syntax should work with pbzip2:

$ tar cvf MYDIRECTORY.tar.gz --use-compress-prog=pigz MYDIRECTORY

In a similar way, you can list the contents of the tarball like this:

$ tar tvf MYDIRECTORY.tar.gz --use-compress-prog=pigz

Or you can extract the tar like this:

$ tar xvf MYDIRECTORY.tar.gz --use-compress-prog=pigz

These should work for both pigz and pbzip2. Due to some modifications we made, it should also respect the CPU allocation you were given inside of a job. However, just like compressing or decompressing individual files, these tools can only utilize the processors within a single node, not the processors on multiple nodes.