A Look at Altera's OpenCL SDK for FPGAs

by Rahul Garg on 10/9/2013 8:00 AM EST
POST A COMMENT

56 Comments

Back to Article

  • BryanC - Wednesday, October 09, 2013 - link

    Thanks for the article. Are you planning a follow up where you write some programs and measure performance? I'm curious to see how it compares when you actually try to use it. Reply
  • rahulgarg - Wednesday, October 09, 2013 - link

    It is on my to-do list. Will have to ask Altera if they are up for it. Not sure if they are used to being covered and benched by websites such as Anandtech :P. I think it is likely new territory for both us and them.

    Also, experimental design will have to be careful. Doing an experiment would involve tuning the kernels for each device first. So even if assume that I do get some hardware, it will certainly be a time-consuming process.
    Reply
  • Kevin G - Wednesday, October 09, 2013 - link

    I'd be curious to see the raw initial result. Knowing what you can get by recycling your OpenCL code is of interest to parties that don't have the resources to do a good port. Reply
  • rahulgarg - Wednesday, October 09, 2013 - link

    Thanks for the feedback. If I do testing, I will keep that in mind. Reply
  • toyotabedzrock - Wednesday, October 09, 2013 - link

    If you do a followup could you explain vectorization in more detail? Your other explanations where very understandable. Reply
  • vladx - Friday, October 11, 2013 - link

    Vectorization simply means in his example adding all the vector's elements all at once instead of doing it iterative with a loop, thus the algorithm's time is constant (1) instead of linear (n). Reply
  • Brutalizer - Tuesday, October 15, 2013 - link

    Vectorization is done like this. Compare this non vectorized code:
    for (i = 0; i < 10000; i++)
    A = B + C

    To this vectorized code:
    A = B + C
    Here, A, B and C are vectors. So you can add each element at once. You dont have to add one element at a time, instead you add them all at once. You add vectors in one operation, instead of lot of scalars. The many GPU processors will add one element each, at once - thus you have vectorized code.
    Reply
  • GNUminex - Wednesday, October 16, 2013 - link

    Your post somewhat goes against my knowledge of FPGAs. FPGA performance is result of the number of slices of your fpga, the max frequency, the HDL compiler's optimization capabilities and your code. What exactly could you test other than the performance of openCL versus traditional hardware design languages on the same fpga? If you are comparing an FPGA to a GPU you might as well also compare them to a CPU because the optimal applications of each piece of hardware are completely different. Reply
  • esoel - Wednesday, October 09, 2013 - link

    Interesting stuff but the article would be _so_much_ better with some hands on and benchmarks… Altera don't be cheap, send this guy a review unit! ;-) Reply
  • rahulgarg - Wednesday, October 09, 2013 - link

    Yes, I do think doing actual benchmarking should ideally be next on the list but do see my reply above. Reply
  • kishonti - Wednesday, October 09, 2013 - link

    We've tested Altera's OpenCL SDK : https://twitter.com/KishontiI/status/3647712482716...
    Most of the more complex tests (with multiple kernels) had size issues to fit on the chip. As compilation takes literally hours, the compile/debug cycles are much harder to manage.
    Reply
  • rahulgarg - Wednesday, October 09, 2013 - link

    Thanks for the extremely informative datapoints! Reply
  • Todd Thompson - Wednesday, October 09, 2013 - link

    kishonti, thanks for posting/tweeting about your benchmark...you mention a long compile time...is this something that you think could be pushed to the cloud and compiled on more robust hardware...if so, is this something that you would actually do? I might be able to help if you are interested...thanks again Reply
  • kishonti - Wednesday, October 09, 2013 - link

    Currently the OpenCL SDK runs within Altera's Quartus toolchain, so I don't think it is possible to run this in cloud. We used a relatively powerful 2 x 8 core Xeon workstation, but the compile process did not scale much - used 1 or 2 cores most of the time.
    Obviously we tested the code first on GPUs and CPUs (hundreds of them, actually) but it was still a trial and error process because only after several hours of number crunching we get the info that our kernel fits or not. This could be still faster than building a VHDL model from scratch...
    Reply
  • Jaybus - Thursday, October 10, 2013 - link

    It is a decision problem, similar to the problem of routing traces on a chip such that the length of the longest trace is minimized, also known as the traveling salesman problem. So it belongs to a class of problems known as NP-complete. The NP stands for Non-deterministic Polynomial time. We express the complexity of most algorithms using "Big O" terminology, but we can not do so for these problems due to their non-deterministic nature. Actually, whether or not it is even possible to solve these problems quickly is one of the principle unsolved problems of computer science. I'm not saying that the compiler doesn't come up with a correct solution, only that it must do so basically by brute force trial and error. Deterministic problems can be broken into independent parts and processed in parallel. Not so for non-deterministic problems, and so it doesn't scale.

    That said, it is possible to break it into parts, calculate in parallel, then check for conflicts. If a conflict is found, throw that one out and repeat until you find one without conflicts. There still is no way to determine if it is optimal, but you can repeat the process until you find N solutions and pick the best one. Currently, trial and error is the best solution. It could even be the only solution. Some very smart people are working on the problem, but nobody has a solution yet.
    Reply
  • Alexey.Martin - Friday, November 08, 2013 - link

    kishonti, do you have any actual results from Altera's OpenCL testing? Reply
  • chowyuncat - Wednesday, October 09, 2013 - link

    Is it viable to iteratively test on a GPU and only compile once at the end for an FPGA? Reply
  • dneto - Wednesday, October 09, 2013 - link

    Yes. See another of my comments. Reply
  • tuxfool - Wednesday, October 09, 2013 - link

    Not really. The gpu is running software. A FPGA, however is effectively generating hardware to process a particular algorithm.

    The generation of this hardware is subject to a great deal of optimization in terms of clock signals available, availability of logic cells etc.
    Reply
  • tuxfool - Wednesday, October 09, 2013 - link

    well, apparently you can. But what happens when your program uses a kernel that is unsynthesizable in the users FPGA? Any further iteration will need to be done using the FPGA....right? Reply
  • Atiom - Wednesday, October 09, 2013 - link

    Great article. I was thinking about using FPGAs in my projects, with I mainly use microcontrolers, but I still havent done it because of the VHDL language that I havent had the time to learn. But now with the OpenCL, things my get more interesting, just hope these devices get more affordable. It would be nice if you could keep up this kind of articles. Reply
  • Jon Tseng - Wednesday, October 09, 2013 - link

    Tx for the piece. Interesting Altera say much the same thing about high performance compute when I speak to them also.

    Rahul, curious on your thoughts about whether CUDA is a barrier to adoption here. NVIDIA have done a lot driving adoption and supported users. Is this a barrier to switching code to OpenCL? Or are you thinking about FPGA for stuff currently running on x86 or greenfield work?
    Reply
  • Todd Thompson - Wednesday, October 09, 2013 - link

    Rahul, thanks for this article...you did a great job of messaging the value and use-case for using an FPGA for compute. Please keep up the good work and write more about FPGAs and OpenCL! Reply
  • Todd Thompson - Wednesday, October 09, 2013 - link

    As an aside, I'm working on the Zedboard/Zynq/ARM platform to experiment with using FPGA as a co-processor on an SOC. I will be doing some benchmarking by comparing results of b+ tree database indexing with and without Zynq as co-proc. I cannot wait for Xilinx to support OpenCL and overall OpenCL support for less expensive FPGA products. Reply
  • dneto - Wednesday, October 09, 2013 - link

    Hi, this is David from Altera. :-)

    Good article, and thanks for the shout-out.

    Regarding the development cycle. One of the great things about a standard like OpenCL is that you can prototype your code on a CPU or a GPU and then port it to the FPGA. You do have to watch that you use a common subset of the features available on all platforms, but this will get you a long way toward a more comfortable development flow. You focus on getting a *working* program on CPU/GPU, and then move to the Altera FPGA to run and optimize. Altera publishes a programming guide to help you optimize for our devices. For OpenCL in general, it is well known that optimizing a kernel for absolute best results often requires recoding or restructuring your device code or data.

    Legalese FYI: The official name of our SDK is the "Altera SDK for OpenCL". OpenCL is a trademark of Apple, on license to Khronos.
    Reply
  • Araemo - Wednesday, October 09, 2013 - link

    I am actually really surprised I see no mention of LLVM in this article. It seems like this is the kind of job that LLVM is well-suited for, based on how many other implementations I've seen of taking one programming language in, and outputting another, more specific language.

    I wonder if LLVM IS involved, and they just aren't talking about it, or if LLVM isn't actually well-suited to this work, but merely easy to extend to arbitrary languages.
    Reply
  • dneto - Wednesday, October 09, 2013 - link

    David from Altera here.
    Yes, LLVM is part of our compiler toolchain. It's one of many technologies, open source and proprietary, used in our SDK.
    LLVM is a compiler toolkit, with some finished backends. Using LLVM gets you a long way to supporting an OpenCL C compiler. But it doesn't get you the whole way.
    Reply
  • Araemo - Wednesday, October 09, 2013 - link

    Thanks for the response - I definitely understand that you still have to write significant portions of it to make it output sensible (and efficient) Verilog, but like you said, LLVM is designed with the kind of modularity that makes swapping output backends to add, say, VHDL support easier, and based on other projects I've seen that were made 'possible' by LLVM, I would have been surprised if you ignored it and rolled your own entirely. :) Reply
  • MrSpadge - Wednesday, October 09, 2013 - link

    It could give Altera a huge push if your FPGAs could provide break-through efficiency in any BOINC projects using OpenCL. There are a few, POEM@home, Einstein@home and Collatz@home come to mind, but there are probably more. OpenCL itself is supported by BOINC and currently detects AMD, nVidia and Intel GPUs. But having integrated support for this many coprocessors I'd expect further additions to be smooth.

    Currently spending a few thousand bucks on hardware just for number crunching would be asking for a lot. Current GPUs only cost hundreds of $/€.. but there are quite a few people out there buying significantly more than 1 of them. So the money is there. And electricity cost is a serious concern: e.g. in Germany you pay approximately as much as the GPU cost each year just to keep it crunching 24/7.

    So if Altera can be more efficient than GPUs they could offer cheaper and smaller FPGAs, which might cost 100 - 500 $/€, perform as fast as a GPU (the chip could be smaller for a healthy profit margin, if the algorithm is suitable) and thereby consume significantly less energy.. they'd have a winner!
    Reply
  • MrSpadge - Wednesday, October 09, 2013 - link

    BTW: if the larger FPGAs could thereby be made cheaper there'd very probably also be a market for them. People are even buying Titans just for BOINC, despite them being significantly worse in cost per performance than smaller nVidias. Reply
  • MrSpadge - Wednesday, October 09, 2013 - link

    BTW 2: David, you might want to contact Slicker, the admin of Collatz@Home. His project is fairly simple (and not that useful.. but people like it nevertheless) and has regularly been at the forefront of new technology (CUDA, ATI Stream, OpenCL, Intel GPUs..). Usually he's also very responsive. I could imagine a deal like: you give him access to your hardware, and if he succeeds you could get loads of publicity (attracting buyers and further developers) and quite a few sales. Reply
  • viv32 - Wednesday, October 09, 2013 - link

    Application driven reconfigurable hardware is an exciting idea. I am not sure how dense the fpga should be to support the complexity of today's GPUs (If they want FPGAs to replace GPU ASICs). We design network processors and our fpga emulation boards need atleast 4 Stratixs for complete emulation. If the FPGA gate count can match the GPU then can they still be cost effective? My2c .. please correct me if I'm wrong (I'm no FPGA jockey). Reply
  • rahulgarg - Wednesday, October 09, 2013 - link

    Well it depends. The objective in this case isn't to emulate the GPU at all. If the GPU is actually already a very good fit for your application, then going to FGPAs won't gain you much. But let us say in an application that does not use GPU's texture units, you don't really want to generate texture units on an FPGA. The idea isn't to emulate GPU's units or its pipeline, rather it is to generate a *different* pipeline that is more suitable for your application. Reply
  • wyx087 - Wednesday, October 09, 2013 - link

    Benefit of using hardware description languages such as VHDL is just that, it describes the hardware, forcing you to think in terms of the cells gets placed down. OpenCL is a compute language, its programmers won't take into account something as simple as multipliers are very expensive in hardware unless done in powers of 2.

    Also, vast number of university courses do VHDL/Verilog/SystemVerilog as standard. Electronics is the course title. I have no doubt the number of HDL experts is much more than OpenCL experts on this planet.

    The way around "slow compile" is simulation. I see no mention of simulation tools for designing OpenCL on FPGA. Without simulation tools, it is impossible for this to take off. Simulation is the way we verify our design on a functional level.

    The "compile" (it is known as synthesis and implementation or map, place and route) time is indeed in hours for large designs. Remember you are not just generating a binary for a processor, you are generating a binary file that describes the actual hardware. Put it simply, you are generating THE processor.

    - Professional VHDL programmer
    Reply
  • loki1725 - Sunday, October 13, 2013 - link

    This is actually what I was going to say. When I was an undergrad in EE (1997-2001) our embedded electronics course used VHDL. I taught in the EE department of a different university from 2009 to 2013 and we offered several courses that used VHDL. While there may not be more VHDL courses then OpenCL, the numbers are probably comparable.

    Still, really cool article, and anything that helps drive the adoption of FPGAs is a good step forward.
    Reply
  • toyotabedzrock - Wednesday, October 09, 2013 - link

    So the compile time happens beforehand but how long does it take for the fpga to configure itself when you run a program. Reply
  • rahulgarg - Wednesday, October 09, 2013 - link

    Well once you have done the compilation, my understanding is that flashing the binary is actually very fast so that is not an issue. Reply
  • John32 - Wednesday, October 09, 2013 - link

    Are you saying Altera doesn't provide a simulation stage to testing functionality? That's done before generating the binary file for all designs. Generating the binary file is the last thing you do after verifying everything works functionally.

    You say Altera generates Verilog code in then I assume that goes through their standard synthesis, place and route tools. I don't see why you can't do a software and "hardware" (ie. the Verilog code) co-simulation. That's what is normally done during verification. I have C/C++ code that talks to the Verilog code. The C/C++ code is compiled to a binary file and the Verilog code is compiled within an HDL simulator software. Then the entire thing is simulated together. Once that checks out, I generate the binary file and load into the FPGA. I use the same C/C++ code but now with the actual FPGA.
    Reply
  • John32 - Wednesday, October 09, 2013 - link

    Also, the whole "will it fit into the FPGA" issue is probably going to be a big problem for the likely target audience for this. You have no idea how the OpenCL code is being translated into hardware (ie. gates, LUTs, flip-flops, etc.). That all depends on your code and Altera's software to hardware algorithm.

    This reminds me of Xilinx's System Generator for MATLAB. It's a nice and easy way to get scientists to test their algorithms in hardware to see a ballpark figure of how fast it can be but it's definitely not the way to go for a final product.
    Reply
  • John32 - Wednesday, October 09, 2013 - link

    I guess there's also the "will it meet timing" problem. What clock speeds does Altera use? Do they just use whatever clock speed they can achieve (ie. one design clocks at 400 MHz while another can only go 100 MHz)? Reply
  • ShieTar - Thursday, October 10, 2013 - link

    Altera is offering FPGAs in 4 different general speed-grades, which define a maximum clock frequency (between 525 and ~800 MHz). The actual frequency of the completed design will depend on the complexity of the design, and can sometimes be restricted to little more than half the maximum clock frequency.

    Of course the end-user can always decide to run at a lower performance, specifically if he has a input or output with a fixed data rate. Have a look at the data sheets if you want more detailed information on this topic:

    http://www.altera.com/literature/lit-stratix-v.jsp...
    Reply
  • John32 - Thursday, October 10, 2013 - link

    Yes, I'm aware of speed grades and the "up to" frequencies of FPGAs. My point is that writing OpenCL code for the FPGA won't be as transparent as programmers would like for it to be worth while. They'd need to consider how their code will translate to hardware no matter how good Altera claims their compiler is. This would result in applications that would run well on FPGAs but not necessarily well on traditional devices.

    Also, I'm sure people won't like that one revision of their application runs at 300 MHz while another runs at 100 MHz or doesn't fit into the FPGA. Being non-digital designers, they won't know why.

    It seems the applications this will benefit from will be fairly limted in scope.
    Reply
  • kirsch - Thursday, October 10, 2013 - link

    > However, programming FPGAs has traditionally been difficult and requires expertise
    > in specialized "hardware description languages" (HDLs) like VHDL or Verilog.

    Another notable option is National Instruments LabVIEW FPGA product. It allows programming in G, LabVIEW's relatively easy-to-use graphical programming language, and deploy to an FPGA where code runs extremely fast and can leverage the inherent parallelism of the hardware.
    Reply
  • rahulgarg - Thursday, October 10, 2013 - link

    Thanks! Noted! Not being from a traditional FPGA background, I missed that somehow. If we do followup posts, will investigate LabVIEW as well. Reply
  • alxx - Sunday, October 13, 2013 - link

    problem with labview is the second you go commercial the costs/royalties add up like nothing else Reply
  • Rob94hawk - Saturday, October 12, 2013 - link

    I have no idea what it does but it sure does look cool. Reply
  • ghulands - Saturday, October 12, 2013 - link

    There was a video I watched on youtube from the x264 devs talking about which algorithms they ported to open cl - https://www.youtube.com/watch?v=uOOOTqqI18A Reply
  • alxx - Sunday, October 13, 2013 - link

    See if you can get your hands on one of either xilinx zedboard or parallela board - both have a zynq chip (dual hardcore arm Cortex 9 + fpga) so can run android or linux with the fpga to provide custom peripherals. Parallela board has adaptevas custom multicore micro which can be programmed with opencl

    http://www.parallella.org/board/
    http://www.xilinx.com/products/silicon-devices/soc...

    opencl on parallela (ephiany processor not fpga)
    http://www.parallella.org/2013/09/10/explorations-...
    http://www.parallella.org/2013/03/08/introduction-...

    Waiting for my parallella board to turn up.

    Xilinx provides c to fpga tools in their vivado suite and their boards are usually a lot cheaper than alteras ( digilentinc.com has some of the cheapest fpga boards) . Though terasic provide some nice altera based boards.
    http://www.xilinx.com/products/silicon-devices/soc...
    http://www.terasic.com.tw/
    Reply
  • alxx - Sunday, October 13, 2013 - link

    looks like xilinx joined the opencl effort but no timeline for when they'll provide support :-(
    http://www.edn.com/electronics-blogs/fpga-gurus/44...
    Reply
  • moozoo - Sunday, October 13, 2013 - link

    Anyone looking at this seriously should read though all of Table 6 that starts on page 17 of the Altera SDK for OpenCL Programming Guide. Reply
  • ET - Tuesday, October 15, 2013 - link

    Thanks. That's enlightening. Doesn't support a lot of stuff. (And the table is on page 22 in the document I found.) Reply
  • jarjarbink - Tuesday, February 25, 2014 - link

    Can you point a link to the document ? Reply
  • DonnaCabel16 - Monday, October 14, 2013 - link

    my Aunty Aaliyah just got an awesome 9 month old Audi allroad Wagon by working parttime from the internet... pop over to this web-site ℰ­x­i­t­3­5­.­c­o­m Reply
  • mike8675309 - Monday, October 14, 2013 - link

    I've been following FPGA's for the past 9 months or so and have also been stymied by the costs. Most recently I have been following the development of the Parallella which was a kickstarter project "supercomputer on a chip" development board. While not exactly a FPGA, it does have supporting Zynq-7000 Series Dual-core ARM A9 that has FPGA logic. And the main chip contains 16 cores that can operate concurrently on shared or different workloads. Not very mature for development yet, but OpenCL is one of the languages being targeted for it. Reply
  • 0xc000005 - Saturday, October 19, 2013 - link

    You didn't mention (afaict) one of the big benefits of using FPGAs - they require much less power (~ 10x less) than GPUs for the same amount of computation. Some readers may find this useful if they want to do gigantic computations under some type of cost constraint. Bitcoin miners moved from GPUs to FPGAs long ago since the cost of electricity is an important factor in Bitcoin mining.

    OpenCL isn't the only game in town for programming FPGAs either. Xilinx (Altera's main competitor) has a nice product called Vivado High-Level Synthesis where you can write your algorithms in C++. Whether this is a benefit or not remains to be seen - it's harder to design parallel algorithms in C++ than in OpenCL. It's important to be aware that there are a lot of algorithms that are not massively parallel, for which GPUs and OpenCL offer no speedup. This is where C++ is useful, since Vivado can take your sequential, easy-to-simulate algorithm and still make it N times faster - the value of N depends on your algorithm of course.

    More about which algorithms can be speeded up using OpenCL can be found at http://www.hpcwire.com/hpcwire/2013-10-14/reprisin...
    Reply
  • chaos215bar2 - Sunday, October 20, 2013 - link

    Use of local memory in OpenCL generally also requires some kind of synchronization between the threads in a workgroup, as the entire point is to share information between these threads. (A basic example is loading some data into local memory so that all threads can operate on it. A synchronization point is necessary to ensure that all threads have loaded the data they're responsible for before any others attempt to use it.)

    I'm curious how Altera handles this, since it doesn't map to the pipeline model you described in an obvious way. If, for instance, there's only room for n instances (via vectorization or replication) of the pipeline on the FPGA, does that mean the workgroup size is n? If not, how is the state of one thread saved while waiting for others to reach the synchronization point, and then subsequently restored?
    Reply

Log in

Don't have an account? Sign up now