K10: why nVidia had to do it
K10 is just a downclocked overpriced GTX 690 without any video output, sold for more than 2X the price. $2000+ K10 could not compete on some OpenCL benchmarks with a single $350 Radeon HD 7850, it’s totally deceptive, and happily nVidia decided to keep the Fermi Tesla cards for people serious about GPGPU (the first G meaning General).
Why launching a crippled GTX 690 called “K10″ for the professional market?
Kepler is not there yet. Or it is?
GK104 has been deceptive as a GPGPU solution, slower than Fermi cards, and even than low-end AMD GCN cards!
nVidia adressed some of the problems in the future GK110 chip, probably decided after the AMD GCN launch in early 2012, that explain a launch in late 2012 or early 2013 for Kepler’s GK110, that may be faster than dual-GK104 for GPGPU, but will have to be compared to future AMD HD 8970 (an optimized die-shrink of actual GCN 7970).
nVidia announced Kepler for 2011, and Maxwell for 2013, not real Kepler for early 2013 (or late 2012 if lucky but I bet it will be 2013).
2 years late it’s pretty deceptive, but I have to write that it’s mainly due to 28nm manufacturing process that is late, TSMC and other having problems to follow Intel that have an incredible advantage over any other company, and will keep it for long!
nVidia have to launch Kepler Tesla cards
If nVidia had announced that Kepler is not ready for GPGPU computing world, and that everyone will have to wait another 6 months for professional Kepler GPU card, clearly it’s position would be problematic with AMD pressure on GPGPU and OpenCL. nVidia had to release Kepler at some point, even if 18 months late! So it’s the role of the K10!
K10 and GK104 are clearly not what the GPGPU world was expecting: slower than GTX580 and old Tesla in many GPGPU benchmarks, no ECC, this could not be qualified a “professional” card more than any entry-level GT610! This is not a “performance” card, given a single Radeon HD 7850 could beat it on many GPGPU benchs.
So nVidia pretended initially that GK104 (on GTX680) is not meant for GPGPU computing. “NVIDIA has made it clear that they are focusing first and foremost on gaming performance with GTX 680 (GK104), and in the process are deemphasizing compute performance.”
Now nVidia is putting GK104 into a $2000+ non-video card (no video output). And they targeted them at some specific sub-markets where peak-FP SP is important more than flexibility, reliability, DP performance: it is a non-sense.
The non-sense of the positionning of K10
Now nVidia explain that for 2012 we will have to use the venerable 2 years and half Fermi, except if we are in specific business that don’t need flexibility of the old generation GPU, no ECC (and thus less reliability), and no or slow support of double-precision float.
Excuse me, but if you could do with GK104, without ECC, and slow or no DP fp, why don’t you buy a GTX690 instead?
Because of the Turbo integrated in the GTX690 that is variable and will give you between 20% and 30% better performance than the K10?!? Please!
Lot of Fermi videocard unveiled during nVidia Kepler’s GTC!
This is the paradox with the renaming game of nVidia (and AMD too): while they are pushing their new architecture, they are pushing boxes with old architectures instead!
This week it’s nVidia, pushing it’s Kepler GK104 and GK110 while uneveiling Fermi’s GT610, GT620 and GT630, low-end cards that may just compete with AMD and Intel integrated GPU! Their only interest for me is to enable anybody to have a Fermi GPU with 2GB memory (DDR3! lol) for a very low tag-price, enabling development of CUDA or OpenCL on Fermi with big dataset.
Will comment on Kepler’s K10 and K20 professionnal computing cards, after some checks, but I could already write that I am totally disappointed!
OpenCL LuxMark2 Bench on HD 7750
With my desktop ring being upgraded, adding a HD7750 to the GTX260, I did some OpenCL LuxMark2 benchmark. It’s incredibly impressive on this kind of real-world useful GPGPU development. This is the SALA scene results, on GPU-only
HD5850M 1GB GDDR5: 115 points
HD6750M 1GB GDDR5: 71 points
GTX260 896MB GDDR5: 109 points
HD7750 1GB GDDR5: 321 points
It’s clearly ahead of the pack, and to be 3X faster than a GTX260 is incredible, the GTX260 delivers 875 GFlops and 112GB/s bandwidth, where the HD7750 delivers only 820Gflops and 72GB/s bandwidth. GCN is an incredible GPGPU architecture, and I wonder what is possible to do with a HD7970!
How to measure OpenCL Kernel execution time
I need to be able to measure Kernel execution time to validate some options. For a long long Kernel you may use wallclock, but it’s not the right way to do it. There are few steps to measure accurately the Kernel execution time:
Create Queue with Profiling enabled
command_queue = clCreateCommandQueue(context, devices[deviceUsed], CL_QUEUE_PROFILING_ENABLE, &err);
Ensure to have executed all enqueued tasks
clFinish(command_queue);
Launch Kernel linked to an event
err = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, workGroupSize, NULL, 0, NULL, &event);
Ensure kernel execution is finished
clWaitForEvents(1 , &event);
Get the Profiling data
cl_ulong time_start, time_end;
double total_time;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
total_time = time_end - time_start;
printf("\nExecution time in milliseconds = %0.3f ms\n", (total_time / 1000000.0) );
That’s it
OpenCL: int32 vs int64
I am checking int32 vs. int64 on AMD VLIW5 platform (Radeon HD5850M) and results are pretty interesting for GPGPU developers that are doing other things than just number-crunching. Will do some more tests this week-end and report results to you, integrating GTX260 too, and GCN architecture (Radeon HD7750)… Interesting week

