Earlier this month, the OpenCL specification was released by the Khronos group. Khronos is a group made up of representatives from companies in the computing industry. The group focuses on creating and managing standards for graphics, multimedia and parallel computing on everything from mobile devices to desktop and workstation computers. Part of Khronos' charge is OpenGL and all it's relatives with the Open- prefix, so naming also makes sense.

 

 

The goal of OpenCL is to make certain types of parallel programming easier and to provide vendor agnostic hardware accelerated parallel execution of code. That's a bit of a mouth full, but the bottom line is that OpenCL will give developers a common set of easy to use tools to use to take advantage of any device with an OpenCL driver (processors, graphics cards, ect.) for the processing of parallel code.

While there are already tools available that enable parallel processing, these tools are largely dedicated to task parallel models. The task parallel model is built around the idea that parallelism can be extracted by constructing threads that each have their own goal or task to complete. While most parallel programming is task parallel, there is another form of parallelism that can greatly benefit from a different model.

In contrast to the task parallel model, data parallel programming runs the same block of code on hundreds (or thousands or millions or ...) of data points. Whereas my video game may have threads for handling AI, physics, audio, game state, rendering, and possibly more finely grained tasks if I'm up to the challenge, a data parallel program to do something like image processing may spawn millions of threads to do the processing on each pixel. The way these threads are actually grouped and handled will depend on both the way the program is written and the hardware the program is running on.

 

 

As we've said many times in the past, graphics is almost infinitely parallelizable. Millions of pixels on the screen can all act (mostly) independently of each other. Light weight threads handle the calculation of everything that has to do with a particular pixel. As pixels get smaller and we pack more on screens, there is more opportunity for parallel work. Graphics cards are currently the best data parallel processing engines we have available. And once OpenCL drivers are available, developers will have access to all that horsepower for any other data parallel tasks they see fit.

Now, it won't make sense to run a word processor on your graphics card, as there just isn't enough happening at once to take advantage of the hardware. Single threaded performance on a GPU isn't that great, especially compared to a general purpose CPU, and trying to run code that isn't massively parallel just isn't going to be a great idea. But there are plenty of things that can benefit from the GPU. Basically any multimedia processing can benefit, from video and audio decoding, editing, and encoding, to image manipulation, to helping speed up your math homework (brute force computation ala Maple, Matlab, and Mathematica could certainly benefit from the GPU). There could be some interesting encryption and/or compression techniques that are born out of the data parallel approach as well.

The best applications of data parallel computing have likely not been seriously considered at this point, as it takes time to get from the availability of tools to the finished product, let alone the conception of ideas that have heretofore been precluded by the realities of parallel programming. But OpenCL isn't a miracle that will make everything speed up. Rather it is a vehicle by which developers will be able to make a small subset of tasks orders of magnitude faster using hardware that is already in most people's computers. Which is certainly nice. But let's take a closer look.

Parallel Computing: Why We Need OpenCL
POST A COMMENT

37 Comments

View All Comments

  • v12v12 - Wednesday, January 07, 2009 - link

    Testing123, ignore plz Reply
  • corporategoon - Tuesday, January 06, 2009 - link

    Did this article go through an editor? Reply
  • chizow - Friday, January 02, 2009 - link

    Kind of surprising you didn't directly address this given the amount of FUD being thrown around with regards to PhysX, particularly from AMD and its supporters. You indirectly answered what I had already suspected however, that given Nvidia has stated they plan CUDA to be fully portable to both OpenCL and DX11 there should also be no portability issues for AMD and Brook+:

    quote:

    AMD could make an investment in the CUDA for C language and create either their own compiler (nothing is stopping them). But then you still have the same problem of interoperability as if NVIDIA implemented Brook+. If NVIDIA or AMD want to make their solution work with the other guy, they would need to write a wrapper to translate CAL to PTX or PTX to CAL.


    I'm guessing the unfinished thought from the first sentence should read something like "or write a CUDA to Brook+ wrapper" as thats essentially what the last part suggests. Since both vendors will write wrappers for their code to OpenCL, perhaps this wrapper could pull double duty, although it would double the amount of transcoding needed. Less than efficient for sure, but certainly better than a complete impasse due to incompatibility.


    Reply
  • ltcommanderdata - Friday, January 02, 2009 - link

    Are you suggesting that hardware PhysX acceleration will come to AMD GPUs as soon as nVidia and AMD enable hardware OpenCL support? Because I don't think it's that simple.

    nVidia seems to have rebranded the meaning of CUDA. Maybe it's all just marketing speak, but CUDA before seemed to mean using nVidia GPUs for GPGPUs operation in general. But now since OpenCL, CUDA seems to more specifically related to the GPGPU interface to nVidia GPUs with languages being separate on top, namely OpenCL, DX11 and C for CUDA. If PhysX is written in C for CUDA, which it no doubt is seeing there wasn't anything else available up to now, then adding support for the OpenCL language in the CUDA interface layer won't help get PhysX supported on AMD GPUs. PhysX will still be written in nVidia's proprietary language which AMD GPUs can't understand. To support AMD GPUs, either nVidia will have to rewrite PhysX from C for CUDA to OpenCL, which would be awfully generous of them or AMD will have to make a C for CUDA to CAL translator and hope PhysX doesn't have any nVidia hardware specific optimizations, which it no doubt has, to mess things up.
    Reply
  • apanloco - Friday, January 02, 2009 - link


    Anyone knows if multiple applications can take advantage of OpenCL at the same time? I think OpenGL is exclusive to one application, but if OpenCL is used by regular applications this could be a problem?
    Reply
  • yyrkoon - Thursday, January 01, 2009 - link

    "With R580 AMD (then ATI) actually published part of their ISA and called the initiative CTM (for Close to Metal). Before we had a beta version of CUDA, we had folding@home GPU accelerated on R520 and R580"

    I also read an interview through gamedev.net where ATI was emulating Direct 3D 10 calls in hardware on one of their x1900xtx's ( Direct 3D 9 hardware )long before I heard about folding@home on the GPU. I remember being so impressed with the technology, that I could not wait until Vista + Directx 10 titles became available. Too bad that there are so few ( if any ) titles that currently take advantage of this technology in the ways I had hoped. Hopefully that will change soon.
    Reply
  • ltcommanderdata - Thursday, January 01, 2009 - link

    http://www.tgdaily.com/content/view/38764/140/">http://www.tgdaily.com/content/view/38764/140/

    It's interesting that you mentioned that AMD and nVidia look to be continuing to push their proprietary GPGPU solutions, but AMD has actually made statements they are abandoning their proprietary CTM GPGPU implementation and are moving fully to OpenCL. Admittedly, its probably just a realization that CTM isn't taking off as fast as CUDA and it's in their best interest to push OpenCL. In comparison, nVidia will continue to develop their own CUDA implementation alongside OpenCL.

    I wonder if you can get a statement from nVidia whether they will move PhysX to OpenCL? Right now I believe PhysX is written in C for CUDA and of course requires nVidia GPUs for hardware acceleration. If they moved to OpenCL, then AMD GPUs would support it as well. Although perhaps nVidia prefers to keep PhysX to themselves as a product differentiator.

    It'd also be interesting if you could ask AMD whether older GPUs like the X1600, X1800, and X1900 will be supported in OpenCL? You already pointed out in your article that the RV530, R520, and R580 had GPGPU folding@home clients so they are certainly capable of GPGPU operation. It'd probably be in ATI's own interest to have as large an OpenCL base as possible and ATI's original FireStream dedicated GPGPU card was R580 based as well. Apple could probably help them as well seeing the number of X1600 and X1900 used in various iMac, MacBook Pro, and Mac Pro generations that could use support for OpenCL in Snow Leopard.

    And I agree with melgross that it's strange Apple got no mention in the article seeing that they pretty much developed OpenCL, then submitted it to Khronos, and was no doubt a major driving force behind the quick ratification in order to get it ready for Snow Leopard. And I believe Apple's Aaftab Munshi was the chair of the OpenCL working group.
    Reply
  • danger22 - Thursday, January 01, 2009 - link

    i am looking forward to the day when I can run my finite element simulations on my GPU. come on Ansys its time for a GPGPU Multiphysics! Reply
  • Amiga500 - Thursday, January 01, 2009 - link

    Same boat, same boat... with both CFD and FEA.

    Have you heard of FEAST-GPU (from Dortmund university)?

    Its a GPU accelerated FE package - unfortunately it isn't out in the public domain yet.



    Anyhow - from my own digging, I'm not sure if the CPU is a major bottleneck for FE simulations - a lot of what I see tends to point towards the hard-drive and I/O performance.
    Reply
  • Sheep100 - Sunday, January 04, 2009 - link

    If you provide enough RAM to the analysis you definitely end up CPU limited for single core runs. We have 24 - 32 GB per node for Abaqus and Nastran analyses. The nodes get RAM - bandwidth limited when stepping up the number of cores used or the number of concurrent runs on a node. We are looking forward to the core i7/Nehalem Xeon systems coming soon that will provide a big improvement here. (These codes run slower on Opteron cores.)

    GPGPU versions of Abaqus, Nastran & Ansys would be very interesting given the large memory bandwidth available on the high end cards. I suspect that re-writing & validating the various solver algorithms to target OpenCL would be a long process. I'm also unsure how possible it is to get data parallelism out of them since the scaling rate of Abaqus, for example, on multi-core systems, even with good bandwidth, is not anywhere near linear. Although this might just highlight the deficiency of the current method of extracting parallelism.
    Reply

Log in

Don't have an account? Sign up now