Skip to content

Recent Articles

21
Jun

Xeon Phi : efficiency myth

Intel stated that it’s Xeon Phi would be more efficient than GPGPU for HPC. That Xeon Phi would have better linpack/watt and also beter linpack/raw TFlop ratio.

The first 2 supercomputers on the Top500 june’13 list are using Intel Xeon Phi and nVidia K20x respectively, so we could compare their metrics, and especially their efficiency.

The first point is Linpack TFlop/Watt, Xeon Phi delivers 33862 Linpack Tflops with 17808 KW (1.90 Tflop/KW), when K20x delivers 17590 Linpack Tflops with 8209 KW (2.14 Tflop/KW), the K20x architectured supercomputer is 12.7% more power efficient!

The second point is the ratio of RAW processing power compared to Linpack, to evaluate effectivness of an architecture, knowing that GPU aren’t as efficient as CPU, and that Intel claimed that with it’s x86 core, the Xeon Phi will be much more efficient than nVidia K20x, providing real-world performances closer than it’s peak theorical performance.

Alas on this second point, as I expected, Xeon Phi lag behind K20x another time, with 61.7% efficiency, where nVidia GPU obtains 64.9%, proving that the Intel architecture is not mature enough and still needs iterations.

Clearly, the K20x consume less power, and it’s architecture (hardware and software) is more mature and efficient than Intel Xeon Phi!

20
Jun

High-level Object-Oriented language chess engine

I am working on Chess Engine, in fact Chess Engines, and I choose to use high-level object-oriented scripting language!

This is totally inefficient to have a Chess Engine in High-Level language, especially Object-Oriented (and yes I do use classes, objects and it’s fun!): they are far slower than any low-level language, and couldn’t compare in any way to good optimized assembly code, using AVX or AVX2!

But for development, it’s much more funny and cool to use a high-level language, it’s tool, it’s ability to detect errors, including array index errors, and in fact the most important thing is not how you optimize your code, but the quality of the algorithms, and for Chess that’s all that matter.

Naturally, for a product, you will code it into low-level language (as ANSI-C or hand-written assembly for some functions), but this is the last part of the development, not it’s core.

So, I am writing Chess Engines, I have choosen names that are no more in use, but will speak to people in their late forties (as myself) or older, because, they are history, they were dreaming about them, and probably, they want to play with them again, a sense of history…

It’s also a tribute to Alain Turing, war hero – marathon man – computer scientist – IA pioneer – bad chess player – gay – genious (choose your flavor!), and also Dan & Kathe Spracklen, that have done the first microcomputer chess program that win a contest and beat million dollar minicomputers, they have made history of chess computing through TuroChamp and Sargon Chess.

18
Jun

2014 Mac Pro @ 7 Tflops

This is a good news, Apple is communicating about the future Mac Pro raw processing power using OpenCL technology, a GPGPU technology that Apple created and then gave to Khronos Group. It’s a good step forward.

The step backwartd is that the 2008 Mac Pro supports 2 AMD Radeon 7990, offering them 16Tflops of raw power, more than 2X the power of the future Mac Pro. And this is deceptive, as usually with Apple announcement…

27
Oct

The Mainframe Dinosaur

I was reading interesting some articles about Mainframes (aka “Big Iron”), stating that they are still interesting in our area of cloud computing, clusters of inexpensive computers, and availability of high-performance computing for the workstations.

What makes the Mainframe special

The mainframes are specially designed and engineered to provide an incredible physical uptime, with claims of IBM to have one downtime every 50 years usage on some series. They are delivered as a complete solution, with high tag price and usage licence for almost any software and hardware part, including the CPU.

The main selling point is to pretend to have an incredibly high CPU utilization on real-world usage, that IBM and other translates into “having massive throughput”. In fact IBM refuse to compare it’s mainframes to other solutions, explaining that “real-world” performance are far more better than benchmark results, rarely submitting their mainframes to industry-standard tests!

How to expand your CPU usage on a cluster?

Effectively, when you build a cluster of servers to handle a tasks, even with large RAID bays, and even SSD, you might end-up with your CPU waiting for IO tasks, even if you plan to aggregate graps of computing/database servers with dedicated storage bays (a good starting point).

It is *NOT* because you are a dummy, nor because your storage is too slow: your ratio computing power/storage IOps is just badly balanced, and it’s the point where mainframes are largely superior!

To have a better balanced computing power/storage IOps, Mainframes usually have incredibly low processing power compared to their cost, thus keeping their expensive CPU with incredibly high usage rate!

Wanna show you could do the same with your actual workload? Just downclock your CPU, remove some CPU on multi-CPU boards, deactivate some core at the BIOS or Kernel level: you will end-up with a perfectly balanced solution, that could be compared to much more expensive Mainframe ;)

Hardware vs. Service

We are switching from hardware and software into services, wether they are local, remote facilities, or on the cloud (Amazon, etc), we no more care about the servers, we don’t want to have to. Instead we expect to have services, from low-level (ie: Amazon block-based storage) to high-level (ie: web-based enterprise CRM). And we just expect them to work.

How we now handle reliability

As Google, Amazon and other demonstrated, reliability is no more a matter of hardware, but a matter of networked service, running on many servers. And instead expecting the hardware to run incredibly well, it’s expected to fail regularly.

The reliability is handled at the service-level (using frameworks such as OpenMP), to launch tasks on group of servers, relaunching some if some server fails, thus enabling usage of far less expensive servers, with quality parts but not relying on hardware redundancy. A side-effect is the electric power efficiency that is better than any other redundant architecture (thus reducing expenses!).

Dinosaurs for legacy software

These Mainframes Dinos are too expensive compared to any other solutions, they are not more reliable at the service-level (the one that really matter today!), they are proprietary in any way possible, and obsolete the same way big fast storage have been replaced by RAID arrays.

The only reason they are still there is to ensure compatibility with legacy software, and since the IBM/360 it has been the real selling point behind the official marketing line. Proprietary software vendors are locking companies into Dinosaur-era of computing, and they are paying the dimme to Mainframes manufacturer each year (or even each month!)…

Proprietary closed-source software have proven to be much more expensive than anyone might have thought when big companies bought (or leased) their firsts Mainframes…

2
Oct

Parallela: SuperComputing for all of us?

Adapteva presented a KickStarter project, Parallela, a $99 boards that promise super-computing for any one, with highly parallel RISC engine. Designed to be programmed using OpenCL drivers and compilers, it’s an alternative to other OpenCL devices, including GPU, IBM Cell, etc.

This project is interesting at first, but I wonder who could be interested by this projet, even if you limit the budget at $99.

GFlops/Watt
The Parallela board promised 32Gflops peak, for approximately 5W (maybe more on some cases), while a Quad-core PC with n OpenCL GPU will offers you 4000 Gflops (125X) for 500W (100X). A PC is easily 25% more efficient with a single GPGPU, and could offers as much as 2X more power efficiency (Gflops/Watt) with 3 or 4 GPU!

GFlops/$
A $1000 PC configuration with an high-end GPU will cost you 10X the price of the Parallela boards, while delivering up to 125X the performance-level: it’s 12X more effective on Gflops/$!

And it get worse if you consider having an high-end PC with 3 or 4 GPU!

Alternative at $99 price point
You might consider to invest $99 in an AMD Radeon HD 7750, that support OpenCL (and some other tools), and offers at leat 820 Gflops SP (25X faster) while adding up to 70W power-consumption.

That’s 25X more Gflops / $ invested, and it’s 3X more power-efficient if you already own a PC with an available slot! Ouch!

Development tools are available for Windows and Linux too, including Open-Source dev tools. Moreover your development will run as is on any PC or Mac having an OpenCL-enabled graphic card or CPU driver!

Parallela, the supercomputing for who?!?
The Adapteva chips are interesting if you plan to create embedded high-performance devices, but in no case Parallela could be considered a SuperComputer, neither in performance-level, in Gflops/Watt or in GFlops/$ invested.

Still it’s an interesting project, because new players in the parallel-computing fields may be game changers in the long-run. The Epiphany-IV processor for example is far more interesting than the Epiphany-III proposed for Parallela, with 3X more performance on the same power enveloppe, and thus more power-efficient than a GPGPU solution.

Why not launching Parallela with Epiphany-IV directly?!?