A More Efficient Architecture

GPUs, like CPUs, work on streams of instructions called threads. While high end CPUs work on as many as 8 complicated threads at a time, GPUs handle many more threads in parallel.

The table below shows just how many threads each generation of NVIDIA GPU can have in flight at the same time:

  Fermi GT200 G80
Max Threads in Flight 24576 30720 12288

 

Fermi can't actually support as many threads in parallel as GT200. NVIDIA found that the majority of compute cases were bound by shared memory size, not thread count in GT200. Thus thread count went down, and shared memory size went up in Fermi.

NVIDIA groups 32 threads into a unit called a warp (taken from the looming term warp, referring to a group of parallel threads). In GT200 and G80, half of a warp was issued to an SM every clock cycle. In other words, it takes two clocks to issue a full 32 threads to a single SM.

In previous architectures, the SM dispatch logic was closely coupled to the execution hardware. If you sent threads to the SFU, the entire SM couldn't issue new instructions until those instructions were done executing. If the only execution units in use were in your SFUs, the vast majority of your SM in GT200/G80 went unused. That's terrible for efficiency.

Fermi fixes this. There are two independent dispatch units at the front end of each SM in Fermi. These units are completely decoupled from the rest of the SM. Each dispatch unit can select and issue half of a warp every clock cycle. The threads can be from different warps in order to optimize the chance of finding independent operations.

There's a full crossbar between the dispatch units and the execution hardware in the SM. Each unit can dispatch threads to any group of units within the SM (with some limitations).

The inflexibility of NVIDIA's threading architecture is that every thread in the warp must be executing the same instruction at the same time. If they are, then you get full utilization of your resources. If they aren't, then some units go idle.

A single SM can execute:

Fermi FP32 FP64 INT SFU LD/ST
Ops per clock 32 16 32 4 16

 

If you're executing FP64 instructions the entire SM can only run at 16 ops per clock. You can't dual issue FP64 and SFU operations.

The good news is that the SFU doesn't tie up the entire SM anymore. One dispatch unit can send 16 threads to the array of cores, while another can send 16 threads to the SFU. After two clocks, the dispatchers are free to send another pair of half-warps out again. As I mentioned before, in GT200/G80 the entire SM was tied up for a full 8 cycles after an SFU issue.

The flexibility is nice, or rather, the inflexibility of GT200/G80 was horrible for efficiency and Fermi fixes that.

Architecting Fermi: More Than 2x GT200 Efficiency Gets Another Boon: Parallel Kernel Support
Comments Locked

415 Comments

View All Comments

  • SiliconDoc - Thursday, October 1, 2009 - link

    Jeezus, you're just that bright, aren't you.
    The article is dated September 19th, and "they scored a picture" from another website, that "scored a picture".

    Our friendly reviewer herer at AT had the cards in his hands, on the bench, IRL.
    --
    I mean you have like no clue at all, don't you.
  • palladium - Thursday, October 1, 2009 - link

    I agree. GPGPU has come a long way, but it's still in its infancy, at least in the consumer space (Badaboom and AVIVO both had bugs).

    I just want a card that can play Crysis all very high 19x12 4xAA @60fps. Maybe a dual-GPU GT300 can deliver that.
  • wumpus - Wednesday, September 30, 2009 - link

    First first reaction after reading that the cost of double multiply would be twice that of a single was "great. Half the transistors will be sitting there idle during games." Sure, this isn't meant to be a toy, but it looks like they have given up the desktop graphics to AMD (and whenever Intel gets something working). Maybe they will get volume up enough to lower the price, but there are only so many chips TMSC can make that size.

    On second thought, those little green squares can't take up half the chip. Any guess what part of the squares are multiplies? Is the cost of fast double point something like 10% of the transistors idle during single (games)? On the gripping hand, makes the claim that "All of the processing done at the core level is now to IEEE spec. That’s IEEE-754 2008 for floating point math (same as RV870/5870)". If they seriously mean that they are prepared to include all rounding, all exceptions, and all the ugly, hairy corner cases that inhabit IEEE-754, wait for Juniper. I really mean it. If you are doing real numerical computing you need IEEE-754. If you don't (like you just want a real framerate from Crysis for once) avoid it like the plague.

    Sorry about the rant. Came for the beef on doubles, but noticed that quote when checking the article. Looks like we'll need some real information about what "core level at IEEE-754" means on different processors. Who provides all the rounding modes, and what parts get emulated slowly? [side note. Is anybody with a 5870 able to test underflow in OpenCL? You might find out a huge amount about your chip with a single test].
  • SiliconDoc - Wednesday, September 30, 2009 - link

    I think I'll stick with the giant profitables greens proven track record, not your e-weened redspliferous dissing.
    Did you watch the NV live webcast @ 1pm EST ?
    ---
    Nvidia is the only gpu company with OBE BILLION DOLLARS PER YEAR IN R&D.
    ---
    That's correct, nvidia put into research on the Geforce, the whoile BILLION ati loses selling their crappy cheap hot cores on weaker thinner pcb with near zero extra features only good for very high rez, which DOESN'T MATCH the cheapo budget pinching purchasrs who buy red to save 5-10 bang for bucks...--
    --
    Now about that marketing scheme ?
    LOL
    Ati plays to high rez wins, but has the cheapo card, expecting $2,000 monitor owners to pinch pennies.
    "great marketing" ati...
    LOL
  • PorscheRacer - Wednesday, September 30, 2009 - link

    Just so you know, ATI is a seperate division in AMD (the graphics side obviously) and did post earnings this year. ATI is keeping the CPU side of AMD afloat in all intents and purposes. Is there a way to ban or block you? I was excited to read about the GF300 and expecting some good comments and discussion about this, and then you wrecked the experience. Now I just don't care.
  • Adul - Thursday, October 1, 2009 - link

    silicon idiot is doing more harm than good. please ban him
  • SiliconDoc - Thursday, October 1, 2009 - link

    The truth is a good thing, even if you're so used to lies that you don't like it.
    I guess it's good too, that so many people have tried so hard to think of a rebuttal to any or of all my points, and they don't have one, yet.
    Isn't that wonderful ! You fit that category, too.
  • SiliconDoc - Wednesday, September 30, 2009 - link

    Do you think yhour LIES will pass with no backup ?
    " A.M.D. has struggled for two years to return to profitability, losing billions of dollars in the process.

    A.M.D., the No. 2 maker of computer microprocessors after Intel, lost $330 million, or 49 cents a share, in the second quarter. In the same period last year, it lost $1.2 billion, or $1.97 a share.

    Excluding one-time gains, A.M.D. says its loss was 62 cents a share. On that basis, analysts had predicted a loss of 47 cents a share, according to Thomson Reuters. Sales fell to $1.18 billion, down 13 percent. Analysts were expecting $1.13 billion."
    ---
    http://www.nytimes.com/2009/07/22/technology/compa...">http://www.nytimes.com/2009/07/22/technology/compa...

    ATI card sales did increase a bit, but LOST MONEY anyway. More than expected.
    --
    PS I'm not sorry I've ruined your fantasy and expsoed your lie. If you keep lying, should you be banned for it ?
  • PorscheRacer - Thursday, October 1, 2009 - link

    http://arstechnica.com/hardware/news/2009/07/intel...">http://arstechnica.com/hardware/news/20...-graphic...

    Again, the graphics group of AMD turned a profit (albeit a small one after R&D and costs) while the other divisions lost money.
  • SiliconDoc - Thursday, October 1, 2009 - link

    LOL- YOU'VE SIMPLY LIED AGAIN, AND PROVIDED A LINK, THAT CONFIRMS YOU LIED.
    It must be tough being such a slumbag.
    --
    " After the channel stopped ordering GPUs and depleted inventory in anticipation of a long drawn out worldwide recession in Q3 and Q4 of 2008, expectations were hopeful, if not high that Q1’09 would change for the better. In fact, Q1 showed improvement but it was less than expected, or hoped. Instead, Q2 was a very good quarter for vendors – counter to normal seasonality – but then these are hardly normal times.
    Things probably aren't going to get back to the normal seasonality till Q3 or Q4 this year, and we won't hit the levels of 2008 until 2010."

    As you should have a clue, noting, 2008 was bad, and they can't even reach that pathetic crash until 2010.
    An increase in sales from a recent prior full on disaster decrease, is still less than the past, is low in the present, and is " A LOSS " PERIOD.
    You don't provide text because NOTHING at your link claims what you've said, you are simply a big fat LIAR.
    Thanks for the link anyway, that links my link:
    http://jonpeddie.com/press-releases/details/amd-so...">http://jonpeddie.com/press-releases/det...ntel-and...

    This is a great quote: " We still believe there will be an impact from the stimulus programs worldwide "
    LOL
    hahahhha - just as I kept supposing.
    " -Jon Peddie Research (JPR), the industry's research and consulting firm for graphics and multimedia"
    ---
    NOTHING, AT either link, describes a profit for ati graphics, PERIOD.

    Try again mr liar.

Log in

Don't have an account? Sign up now