Skip to content

Posts from the ‘OpenCL’ Category


Donate for OpenCL Chess development

Srdja is developping chess engine and a promising Open-Source OpenCL Chess Engine, Zeta-Chess. He needs your help to upgrade it’s actual configuration, you will find a Paypal donation page here.

Srdja worked hard in the last 3 years to develop an OpenCL Chess Engine, created the first, and AFAIK actually the only one. As these efforts are on an Open-Source Project, on it’s spare time, and not a commercial project, he needs community support to continue development, and in this case, to buy a Radeon HD 7970 to replace the HD 7750 he invested on early this year.

Read zeta-chess blog, zeta-vantage blog, zeta-dva (another chess engine), download the zeta-chess source-code, try it (it’s multiplatform!) and donate to help him upgrade it’s rig! Thanks!


OpenCL LuxMark2 Bench on HD 7750

With my desktop ring being upgraded, adding a HD7750 to the GTX260, I did some OpenCL LuxMark2 benchmark. It’s incredibly impressive on this kind of real-world useful GPGPU development. This is the SALA scene results, on GPU-only

HD5850M 1GB GDDR5: 115 points
HD6750M 1GB GDDR5:  71 points
GTX260 896MB GDDR5: 109 points

HD7750 1GB GDDR5: 321 points

It’s clearly ahead of the pack, and to be 3X faster than a GTX260 is incredible, the GTX260 delivers 875 GFlops and 112GB/s bandwidth, where the HD7750 delivers only 820Gflops and 72GB/s bandwidth. GCN is an incredible GPGPU architecture, and I wonder what is possible to do with a HD7970!


How to measure OpenCL Kernel execution time

I need to be able to measure Kernel execution time to validate some options. For a long long Kernel you may use wallclock, but it’s not the right way to do it. There are few steps to measure accurately the Kernel execution time:

Create Queue with Profiling enabled
command_queue = clCreateCommandQueue(context, devices[deviceUsed], CL_QUEUE_PROFILING_ENABLE, &err);

Ensure to have executed all enqueued tasks

Launch Kernel linked to an event
err = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, workGroupSize, NULL, 0, NULL, &event);

Ensure kernel execution is finished
clWaitForEvents(1 , &event);

Get the Profiling data
cl_ulong time_start, time_end;
double total_time;

clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
total_time = time_end - time_start;
printf("\nExecution time in milliseconds = %0.3f ms\n", (total_time / 1000000.0) );

That’s it :)


OpenCL: int32 vs int64

I am checking int32 vs. int64 on AMD VLIW5 platform (Radeon HD5850M) and results are pretty interesting for GPGPU developers that are doing other things than just number-crunching. Will do some more tests this week-end and report results to you, integrating GTX260 too, and GCN architecture (Radeon HD7750)… Interesting week :)


Some thought on OpenCL on Mac

Mac are probably the best platform for OpenCL, for two main reasons: Apple initiated OpenCL development and then transferred it to Khronos Group, that is actually in charge of this open standard, and the other is that each Mac sold since mid-2009 integrate OpenCL drivers, as well as any Mac upgraded with OS X 10.6 or 10.7!

You don’t have to ask yourself how to install OpenCL drivers, they are included from start on any recent Mac or any Mac that run recent OS version. Every Mac have a CPU driver, and those with discrete GPU have also a GPU driver, even 9400M GT or 320M GT chipset GPU. Simply put: “it just work!”.

Many Mac application use OpenCL, wether they do it explictly as Photosop CS6, LuxMark / LuxRender, or implicitly by calling OpenCL-enabled Libraries, as PixelMator do using CoreImage. OpenCL is totally integrated, it’s transparent and a really efficent way to push performance level on many Mac.

I did some test round-up yesterday and would like to share some interesting results with you.

OS X 10.6 Snow Leopard vs 10.7 Lion
I have 2 identical 27inch iMac bought together, one is using 10.6.8 for production purpose (and some compatibility problems), the other runs 10.7.3, same CPU, enough RAM in both (8GB and 20GB), same HD 5850M GPU w/1GB GDDR5 (dubbed “HD 5750″ by Apple to mislead consumers).

Using LuxMark v2, I saw differences in two kind of operations, GPU alone and CPU+GPU (using both as OpenCL devices that share the work).
CPU itself is a little slower on 10.6.8, 3% to 6% slower on 10.6, on any scene, but it might be to some background software on this Mac, it’s irrelevant.

GPU score is totally different, 10.6.8 being 28% (SALA scene) to 3.8 times (ROOM scene) slower than 10.7.3, on the same hardware: Apple and AMD seems to have optimized the OpenCL driver on Lion to provide better performance, in a huge way!
In CPU+GPU mode, where 10.7.3 exploit 98% to 99% of each of the resource when used together, 10.6.8 could just exploit 90% (SALA) to 73% (ROOM) of the available computing resource, not talking about the fact that their individual results were lower by a huge margin.

The individual low performance on 10.6.8 and the inability to fully aggregate the processing power of CPU+GPU results in overall (CPU+GPU) results being from 25% to 2 times slower than on same hardware running OS X Lion 10.7.3! Guess what I will choose for my OpenCL development???

Another little point is that 10.6.8 reported only the free CPU memory as available for CPU OpenCL device, when 10.7.3 reported the installed memory as available for CPU OpenCL device. Not a big deal, but on the other side, 10.6.8 reported only 16KB Local Memory when 10.7.3 reported 32KB Local Memory, that could hurt some development done on 10.73 when try to run on 10.6.8!

512MB instead 1GB
All the Mac I checked against OpenCL at this time had 1GB DDR5 (iMac 27Inch w/HD5850M and MacBook Pro w/HD6750M). Beside that, OpenCL reported only 512MB available OpenCL Global Memory on both devices, and I wonder why, because if OS X reserved 512MB for it’s own use, I don’t see how a 512MB 6750M could run any OpenCL application (0MB free?!?) or even a 256MB 8600M GT!
So OS X don’t need 512MB, ans probably don’t reserve it, but I suspect that it cap the OpenCL memory usage at 512MB whatever the amount of video memory you put on your Mac.

Notice that there’s just two GPU for Mac that sport 2GB video memory, an option on BTO high-end iMac (Radeon HD6970M 2GB GDDR5) and extension card for Mac Pro (nVidia Quadro 4000 2GB GDDR5). All standard configurations have 1GB or less (usually less!).

512MB is really low for some OpenCL development, where people are looking for 3GB or 6GB graphic cards. Maybe it’s to be sure that a majority of OpenCL development will run on a majority of Mac, but it’s troublesome in a world when entry-level graphic card have at least 900MB available for OpenCL!

Mac performance/price
Apple usually install low-end to middle-end graphic gears on truly expensive computers (sorry I like Mac but it should be said!). Actual MacBook Pro use HD 6770M underclocked, that run just few percent faster than my HD6750M and it’s a pity on a $2200+ laptop, $1999 iMac use HD6970M with 1GB GDDR5 that is equivalent to desktop HD7770 for 3D, and largely under $109 HD7750 for OpenCL. Premium price, under-performing parts. I know, I own some Mac!

Clearly Mac are cool computers to develop for OpenCL, but if you are looking for huge memory usage or real actual OpenCL performance, you’d better build a PC and buy a Radeon HD7970 or eventually a nVidia GTX680 (but only if you need CUDA too). You will end-up with up to 3X faster OpenCL and 5X available GPU memory for less than the tag price of an HD6970M 1GB GDDR5 iMac!

In fact the best available OpenCL device in standard Mac configuration could only cope with Radeon HD7750, a $109 flawless upgrade that will work on nearly any PC with it’s low power consumption! If you are on a budget, go with an used PC and a HD 7750, you won’t regret it!

I plan to set-up benchmarks on Windows 7 64bit on the same iMac and MacBook Pro on the next weeks, to compare OS X 10.7.3 and Windows 7 on the same Mac hardware. Stay tuned :)