Friday, April 22, 2011

Breakthrough in CUDA data compression

The goal of our adventurous endeavour run against conventional and approved conduct that compression algorithms can not make any substantial use of powerful vector processing capacity of CUDA-enabled GPUs was to explore the performance speedup that can be achieved by utilizing the multi-core Compute Unified Device Architecture.

Among the available data compressors we chose bzip2, which in its turn uses the Burrows–Wheeler transform (which again has been used for memory reduction in human genome indexing), move-to-front transform and Huffman coding, and adapted BWT to CUDA to exploit the advantages of GPGPU computing.

For comparison testing we have only performed Burrows–Wheeler transform over 2 pixel-based images and 2 PDF files of different size on one CPU core and on GPU and so far have not approached the Huffman encoding which in fact doesn't offer any chances of parallelization for GPU-computations due to sequential nature and requirement for synchronization between blocks since they are dependent on each previous block's output. Therefore, theoretically, a significant increase in performance is subject to making full use of both GPU and CPU main assets in terms of their computational faculty.
As is evident from the execution time figures below, there is a noticeable 2x to 4x speedup.

The results:

BMP, 540 Kb
Full bzip2 compression on CPU - 218 ms
BW Transform on CPU - 171 ms
Full bzip2 compression on GPU - 93 ms [ minus 53% ]
BW Transform on GPU - 46 ms [ minus 73% ]

BMP, 1112 Kb
Full bzip2 compression on CPU - 467 ms
BW Transform on CPU - 343 ms
Full bzip2 compression on GPU - 249 ms [ minus 46% ]
BW Transform on GPU - 140 ms [ minus 59% ]

PDF, 1919 Kb
Full bzip2 compression on CPU - 1513 ms
BW Transform on CPU - 731 ms
Full bzip2 compression on GPU - 1107 ms [ minus 26% ]
BW Transform on GPU - 311 ms [ minus 57% ]

PDF, 3425 Kb
Full bzip2 compression on CPU - 2168 ms
BW Transform on CPU - 793 ms
Full bzip2 compression on GPU - 1856 ms [ minus 14% ]
BW Transform on GPU - 481 ms [ minus 39% ]



Hardware:

Intel i7 860 2,8 GHz, 4 Gb RAM with nVidia GeForce GTX 260.

Current bzip2 application processes one bzip2-block per time. We use one CPU core versus GPU.

At the present time we are developing parallel bzip2 for complete comparison of full-power CPU and GPU.

6 comments:

  1. Impressive results! Could you describe in more detail which parts of the BWT you managed to parallelize on the GPU. And d you also have an inverse transform of the BWT running on GPU?

    You say you use an "adapted BWT" on the CUDA architecture, does this adapted version retain the same compression ratio and compability with other bzip2 implementations?

    Is this somehow related to the project at http://bzip2-cuda.github.com/ ?

    Also, on the topic of variable length encodings (like Huffman) you might want to check the PAVLE algorithm by Ana Balvic at http://tesla.rcub.bg.ac.rs/~taucet/coding.html

    Axel Eirola

    ReplyDelete
  2. Very good results. Can you share the code with the public so we can try to continue the general effort of GPU Compression and Decompression using CUDA and OpenCL?

    Raveh Gonen
    GPU Specialist

    ReplyDelete
  3. Hi Axel,
    We have only BW forward transformation. Our bzip2 implementation doesn't impact on archive format it just increases compression speed.
    Our project is not releated to other ones.

    ReplyDelete
  4. Hi, I'm half of the bzip2-cuda.github.com team. We got an average of 30% increase in speed on the BWT. The next important step of the bzip2 algorithm, MTF, is inherently serial. And for Huffman encoding, we tried to implement a parallel Huffman encoding algorithm by S. Arash Ostadzadeh, but that algorithm has architecture requirements that do not fit with CUDA as it is now.
    Our project stands shelved as it had a scope of only one year and now we have other commitments. But our code is GPLv3, and we're available by email. We want to see GPU-based data compression progress.
    Just thought I'd chip in with my experiences. :) I'd love to keep in touch with your team and see how you guys are going.

    ReplyDelete
  5. On behalf of all the users (decompression of compressed data) I'd like to say that I really wish there were more focus on benchmarking and development of lossless data compression algorithms based on the decompression speed.

    eg. There's this "great" software called 7zip that everyone loves to hate because it's really worthless, not because it's free, but because it's so terribly slow in decompression.

    Archives generally need to be compressed once and then they are decompressed possibly millions of times depending on the popularity of the archive. It's only logical that all the focus should be on the decompression speed.

    Increasing the slow decompression speeds from current CPU limited 40-80 MB/s of WinRAR to scale linearly with number of cores would be easily possible with minor changes in the compression logic. But authors of compression software do not see this as a selling point. I think they bit thick the head for that.

    ReplyDelete
  6. Clarification: I meant that decompression tends to be often limited to single thread speeds where it would be so trivial to make it limited by number of cores that it just boggles the mind why it's not done.

    If you can decompress multiple RAR's or ZIP's in parallel, logic says, you can also decompress single file inside a zip or rar in parallel by simply splitting the file and the compressing and decompressing those blocks in parallel.

    A child can develop such algorithms, but makers of most popular compression softwares can't. Think in the head as I said.

    ReplyDelete