nVidia have been the first major GPU designer to jump into the OpenCL wagon, a project initiated by Apple to enable cross-platform GPGPU development that is OS and vendor-agnostic, then maintained by the Khronos Group.
Today AMD and Intel are big players for OpenCL support, for both their CPU and GPU, while the new AMD Radeon GCN architecture is clearly performance leader on OpenCL when you use complex algorithms, while new nVidia Kepler architecture lag far behind the old Fermi architecture! Intel 2013 CPU+GPU architecture, Haswell, is expected to beat entry-level Kepler GT640 on any usage (Intel GT3 will beat it, trust me!).
nVidia have an hard time, with uneffective and deceptive Kepler architecture that is slower than AMD new architecture for both 3D and GPGPU, and could not even compete with 2010 nVidia architecture for GPGPU, on an open playfield that is OpenCL.
nVidia that was OpenCL leader is actually trying everything it could to stop supporting it, including removing comments or documentation in EXISTING OpenCL examples, not updating them, removing them from the SDK, etc.
GT 640 is finally here, it’s Kepler for anyone, a $109 development platform equivalent to the AMD GCN HD 7750, the new architecture, base price, Less than 75W power used, and thus powered by the PCI Express bus, PCI-Express 3.0 16x interface, lot of video outputs
I am awaiting my own, to have my CUDA/OpenCL rig sporting a Radeon HD 7550 and nVidia GT 640, will be great to compare the two architectures, develop and optimize for them, and at some point I will upgrade to higher-end GCN and Kepler GPU, when ready, my PC having 2 PCI-Express 16X slots and a great power supply. From start, I deicded to go with an high-end power supply to be able to put two high-end GPU at any point.
Sadly the GT 640 consumer version is DDR3, with only 28GB/s bandwidth, and I don’t feel it will be comparable to Radeon HD 7750, even for 3D. On OpenCL it will clearly lag behind (more news when I will benchmark it!).
With the advent of the new Kepler generation and GTX 680, nVidia decided to stop GTX 580 production, the new GTX 680 being the new flagship of the green brand.
The GTX 580 was the latest Fermi incarnation, with excellent gaming performance and was the best GPGPU card of it’s time, both for CUDA and OpenCL. The GTX 680 is a much faster gaming card, by a wide margin, but not a good GPGPU card. In fact GTX 680 GPGPU performance are deceptive, easily 30% slower than GTX 580 in many benchmarks that don’t try to reach the peak floating-point performance but measure speed in useful usage!
GTX 580 is actually preferred to GTX 680 for a majority of CUDA and OpenCL Developpers, so it’s sad that this card disappear, there’s no actual nVidia card to challenge AMD Radeon HD7000 series on the GPGPU world, the $499 GTX 680 performing even worse than the $249 HD7850 in many OpenCL benchmarks! This is clearly a no-go, and a huge shift from GPGPU to gaming-dedication for nVidia.
I think many CUDA developers are feeling sad today!
In Chess programming, many people seems to think that a modern Chess Engine look for the right move by searching the tree, and it’s all but true.
In fact if you set any Engine to fixed depth of 1-ply, without extension, they will usually give you the best move on first place, or at least a good one, in many positions.
What a Chess Engine is really doing when parsing the tree is trying to find opponent move that refute this first guess, recursively. And for that it apply the same algorithm for opponent moves, beginning by it’s best guess. And if everything is fine, will take the first move on it’s list that is not refuted, or at least seems to give a chance to have a better possible score (as the tree is not fully searched, there’s a part of chance in that!).
Don’t get me wrong, even today with chess program able to beat any human including World Chess Champion, by a huge margin, they didnn’t have the vision of a grand-master. But they are able to have at least some good ideas and check them deeply with their incredible computing power, far deeper than any human. In doing this they avoid moves that seems to be good but reveal themselves erroneous!
A very good article by Vasily Volkov about how you could have better performance on CUDA with less threads. (PDF)
It could also apply to OpenCL, and on my own experience, I tried to hide latency this way, and anyway found that launching too much threads (or warp) just result in too much memory IO pressure, and adversely affect performance, contrary to the idea to launch a maximum amount of threads!