This blog is no longer maintained, but will remain for archival purposes. My new blog is:

Outlook: Purple

Head over there for new entries.



April 30, 2007

GPU Hacking

My research group has recently been working on using GPUs to speed up digital forensics tools. Currently, we're using CUDA and the NVIDIA G80. Read all about the G80 and CUDA on NVIDIA's web site, but essentially, the G80 takes a radically different approach to GPU design. Rather than having dedicated fragment and vertex shaders, the 128 stream processors on the G80 are "general purpose" processors that efficiently execute SIMD code. On the 8800GTX card that we're currently using, each of the 128 processors runs at 1.35GHz and share a total of 768MB of RAM.

Programming is still tricky, but the situation is now much better than it was trying to do general purpose computing on previous-generation GPUs. To get good speedup, it's necessary to pay close attention to the guidelines provided in the CUDA documentation. Specifically, the G80 is organized into 16 "multiprocessors" of 8 processors each, with each of the multiprocessors having only a single instruction unit. This means that if you don't "think SIMD" and keep each of the 8 processors in a bank executing the same instruction stream to the greatest extent possible, then thread execution is interleaved, rather than concurrent. Other gotchas include the fact that "device memory" (the largest pool of memory available, readable and writeable by both the host and the GPU) is uncached. To get good performance, data must be staged in shared memory, of which there is only 16K available on the 8800GTX.

Our recent work has been modifying Scalpel to support threading, both to take advantage of multicore CPUs and GPUs.

The punchline is that there is HUGE potential for speeding forensics applications using GPUs (and multicore CPUs, of course), if a few things are taken into consideration. First, creative threading models are needed. Read the paper for full details, but essentially, you have to give the GPU a lot of work to do on a block of data in order to offset the cost of transferring the data to the GPU and then collecting results. The pipe between the host and GPU is currently about 2GB/second over PCIe/16. Second, the programming model for threading on the GPU isn't intuitive, if you're a seasoned threads programmer. Our prototype uses 10 million threads to search for headers and footers in a block of 10MB. No thread pools, no thread reuse: these are created and destroyed by the GPU hardware for each 10MB block. Essentially, GPUs require massive threading models. Getting good speedup on multicore CPUs, in constrast, does require traditional approaches such as reusable thread pools.

Other issues are that multicore processors are good at executing arbitary code, while the G80 GPU really needs to execute SIMD code for good speedup. On the GPU, it's generally better to use a simple algorithm that doesn't cause divergent control flow than something more "efficient" but with control flow that diverges on different data streams. And mapping work to either the multiple cores on the host CPU(s) and/or the GPUs (as appropriate) will require creativity. Combine this with mechanisms for overlapping computation, disk I/O and host <----> GPU data transfers and there's fun programming ahead.

The details are provided in a paper that will be published at the Digital Forensics Research Workshop in August 2007 in Pittsburgh.

The next version of Scalpel, 1.70MT, which embodies the work described above, will support G80 GPUs and multicore processors. It's tremendously faster than v1.60. It's in alpha testing now and will be released soon.

Posted by Golden at 6:43 PM | Comments (0)