Skip to content

Posts from the ‘CUDA’ Category


Hide latencies on GPGPU programming (CUDA here)

A very good article  by Vasily Volkov about how you could have better performance on CUDA with less threads. (PDF)

It could also apply to OpenCL, and on my own experience, I tried to hide latency this way, and anyway found that launching too much threads (or warp) just result in too much memory IO pressure, and adversely affect performance, contrary to the idea to launch a maximum amount of threads!


GPGPU supercomputing facility of the NSA

NSA is building a huge datacenter in Utah, to host it’s not-so-secret code-breaking facility. Program started in 2004 and was quickly based on GPGPU technologies, with help of government that founded nVidia (CUDA technology) to accelerate GPGPU development and performance improvements. They are targeting the ExaFlops (10 ^ 18 floating point operations per seconds). It’s reachable by now with their 10 billions dollars budget!

The main definition of a super-computer is a computer, cluster of computers or computing devices that could process 1 TeraFlops (10 ^ 12). Not peak performance, but 1 TeraFlop double-precision (64bits) effective computations while running Linpack benchmark. The Top500 of supercomputers use this classification.

It differs from peak performance largely, due to two main factor, double-precision floating point is really slower than single-precision and GPGPU are not effective when processing Linpack benchmark (due to divergences, threads that don’t follow the same exact execution path). So modern GPU are powerful floating point processors, but still far away from their peak potential on Linpack. A 4 Tflops single-precision peak performance GPU (Radeon HD 7970) deliver less than 0.5 TFlops on Linpack, that is still remarkable compared to current CPU (0.1 Tflop).

On the NSA code-breaking facility, they won’t run LinPack, they will be used to break SHA-1., AES, BlowFish, etc. There’s no floating point involved and none double-precision floating-point. Moreover all threads of all core will run the same exact code that won’t diverge at all, doing 32bit integer computations. In these case modern GPU could reach 4 x 10 ^ 12 computations per seconds with their 2000+ cores: the most effective code-breakers that have ever be invented!

Reaching the 10 ^18 ExaFlops barrier is a matter of throwing 250 000 of these GPU onto a facility. Huge number at first, but you could put 4 of them in each 1 U of a cabinet, probably around 128 per cabinet counting needed CPU, local storage and switching space. You need around 2000 cabinets that is also a big challenge, but the new facility is large enough to contain 10 000 cabinets+!

NSA is building the most fascinating supercomputer, using GPGPU to reach new performance-level. There’s still concern with the usage of this facility, you may want to read the exceptional Wired article.