Derek's Conjecture Regarding SP Pipelining and TMT

Temporal Multithreading

We know that the execution of instructions in each SP is pipelined. We know that the throughput of SPs is one instruction per clock cycle and that rather than stalling the pipeline the Tesla architecture usually doesn't need to wait for computational results or memory operations because it is highly likely another thread will be ready to execute. Context switches happen every four clocks from the perspective of the SPs within an SM, and in these four clocks each of the 32 threads in the currently active and executing warp will be serviced.

Organization of the executing threads isn't something we know except SMs process warps in two groups of 16 threads. That this point was made is of interest, but there are too many possibilites for us to come up with any real guess as to how they are issuing instructions to 8 physical SPs in groups of 16. Off hand we asked them if the SPs supported hardware level SMT (simultaneous multithreading) like hyperthreading and our answer was no but with a curious twist: "... they are pipelined processors and support many threads in progress in the pipeline." This was the light switch that brought together the realization of many potential advantages of this architecture for us.

If throughput is really one instruction per clock per SP, and each SP handles 4 threads from a warp over 4 clock cycles, then the pipeline is actually working on executing one instruction from a different thread at every stage in the pipeline.

This is actually a multithreading technique known as fine-grained TMT or temporal multithreading and is different than SMT in that it doesn't expose virtual parallel processors to the software but processes multiple threads in different time slices. TMT isn't some hot new technology you missed when it hit the scene: TMT is what computers have made use of for decades to process the many threads running on a single CPU concurrently without starving them. On a desktop CPU we are very familiar with course-grained multithreading where a single thread is serviced for a while before a context switch happens and another thread starts running. This context switch will normally occur after a certain number of cycles or if a higher priority thread needs the processor or if the thread needs to wait on IO or memory for something.

The real interesting bit comes in the differences between fine and course-grained TMT. In course-grained implementations (what we are all used to) all the pipeline stages of a processor are servicing an instruction stream from a single context, whereas in fine-grained we can have multiple context switches happening within the pipeline down to a context switch every stage. Making such fine-grained implementations happen can be tough, but NVIDIA has used a couple tricks to make it easier to manage.

In G80 and GT200, because of the fact that context is stored per warp, even though the SPs are working on an instruction for a different thread in every pipeline stage, they are not working on a different context at every pipeline stage. Each SP processes four threads in a row from the same warp and thus from the same context. Because it is incredibly likely at 1.5GHz that the SPs have more than 4 pipeline stages, we will still see more than one context switch within the pipeline itself, but it still isn't down to a different context for every stage.

So what's the big deal? Latency insensitivity and a maximal avoidance of pipeline bubbles and stalls.

In a modern CPU architecture, we see many instructions from the same thread running one after the other. If everything is running as smoothly as possible we have as many instructions retiring per clock cycle as we are capable of issueing per clock cycle, but this isn't gauranteed. Data dependancies, memory operations, cache misses and the like cause instructions to wait in the pipeline which means clock cycles go by without as much work as possible being done. Techniques to reduce this sort of delay are many. Data forwarding between pipeline stages is necessary to accomodate cases where one instruction is dependant on the result of the previous. This works by forwarding the result from one stage of the pipeline back to a previous stage so that instructions needing that data won't have to wait for it. Hyperthreading is even a technology to help increase pipeline utilization in that it makes one pipeline look like two different processors in order to fill it with more independent instructions and increase utilization.

Fine-grained TMT eliminates the need for data forwarding because there are zero dependant instructions coming down the pipeline: warps are context switched out after issuing one instruction for each independant thread and if NVIDIA's scheduler does its job right then warps won't be rescheduled until their data is available. Techniques like Hyperthreading are unnecessary because the pipeline is already full of instructions from independant threads at every stage.

Managing a pool of warps that are from a mix of different shader programs and different types of shaders (vertex, geometry and pixel) means that the chance every warp being serviced by an SM is wating on the same data is minimized, but having multiple warps from the same shader program is also a good idea to make sure that once data arrives it enables the processing of more than one warp. Of course, since SMs within one TPC share texture address, filter and cache, it is also a good idea to load up similar warps across the SMs on a TPC so that texture look ups by one thread might also be useful to many others. The balance here would be interesting to know, but we'll probably have to wait for Intel to enter the graphics market before we start getting confirmation on the really cool architectural aspects of all this.

How Deep is an SP?

As for pipeline depth, NVIDIA isn't helping us out with this one either, but let's walk through a little reasoning and see what we can come up with. At the insane and stupid extreme, we know NVIDIA wouldn't build a machine with a pipeline longer than they have threads in flight to fill. We'll assume G80 and GT200 are equally pipelined as they are clocked very similarly and we'll use what we know about G80 to draw a baseline. With G80 having 24 warps in flight per SM and each warp taking up 4 pipeline stages per SP, SPs can't possibly have more than 96 stages. Sure, that's crazy anyway, but if we expect that any warp executing in the pipeline won't be rescheduled until completion, then we would expect a higher proportion of warps to be waiting than executing.

If we go on this assumption we've got less than 48 stages, and I'd think it'd be fair to guess that they'd want to have at least two thrids of their their in-flight thread not in the pipeline, so that brings us down to a potential 32 stage pipeline. On the minimal end, there are at least 4 stages because if there were any less, high priority warps wouldn't get context switched at every opportunity: the instructions form the first threads scheduled would be completed and ready to go. Having 8 stages would give maximum flexibility as warps could be scheduled every other opportunity if they were otherwise ready. This would also keep at most three contexts active at different points in the pipeline, and while this type of fine-grained TMT does offer advantages, it is not free to implement a pipeline with access to a high number of contexts. And it is possible to design a single precision FP unit that can do a MAD in 8 cycles at 1.5GHz, but using Itanium as an example is usually seen as extravagant.

It would be tough to put a finer point on it without some indication from NVIDIA, but at least 8 and at most 32 stages is as good as we can get looking at their architecture. But knowing that power and performance per watt are key concerns of NVIDIA we can be fairly certain of eliminating anything higher than say 16 pipeline stages. Everyone remembers the space heater that was the Pentium 4 in general (and Prescott in particular), and it just isn't power efficient to go too deep.

By now we are at a fairly reasonable minimum of 8 stages and taking both architecture and power into consideration 16 seems like the max we could believe. Of course, that's all the way from one end of the world to the next. Anand's original guess was 12-15, but Derek was able to sell him on 8 stages as the sweet spot because of the simplicity of the cores (there are no decode or scheduling stages in the SPs). So was all that guessing about pipeline stages useful? Not really. But it sure was fun!

Now let's blow your mind and suggest that all this combined with the other details of NVIDIA's architecture suggest that all SP operations have the same latency. This way the entire thing would just work like a clock: one in, one out, very little overhead, and as simple as possible. All the overhead is managed outside the SP and the compute core can just focus on what it does best (as long as the rest of the chip does its job and keeps it fed).

UPDATE: We got lots of response on this page, and many CUDA developers, graphics software designers and hardware enthusiasts emailed us links to many resources on these topics. We discovered some very useful info: instruction latency is actually about 22 cycles in G80, so Anand and I were both way off. This and a couple other things we learned are available in our quick update on the GT200 pipeline published a couple days after this article first went live.

Tweaks and Enahancements in GT200 NVIDIA's Dirty Dealing with DX10.1 and How GT200 Doesn't Support it
POST A COMMENT

108 Comments

View All Comments

  • skiboysteve - Tuesday, June 17, 2008 - link

    FANTASTIC write up on fine-grained TMT. I was unaware about this threading technique and was always thinking of this in class or whenever someone would talk about hyperthreading. this technique was literaly in my head for well over a year and I didn't know what it was called or that it even had a name. I always thought there had to be a more elegant way than hyperthreading to do multithreading down at the chip level without doing the OS style time slicing.

    i was sitting there wondering how the hell the schedule and run these SPs and then bam whole page about it

    really appreciate the effort that goes into researching the core of these chips. i know not everyone likes it but for guys that are educated and work in the field its really interesting
    Reply
  • DerekWilson - Tuesday, June 17, 2008 - link

    remember though that this type of fine-grained TMT only has payoffs in systems running millions of threads concurrently.

    on an OS you'll see hundreds or even thousands of threads on heavily used systems, but there still wouldn't be enough concurrent action to justify this type of architecture for general purpose computing.

    of course, as developers push towards an effort to thread their code as much as possible, who knows what architectures might be worth exploring on the desktop ...
    Reply
  • coder0000 - Tuesday, June 17, 2008 - link

    Very well written! A couple of points:

    1) Last week at WWDC Apple announced OpenCL as an alternative to CUDA. It's a C99 based HLL for creating compute kernels that can be deployed to GPU's and CPU's. Today Khronos officially announced a working group for this, and NV is a part of the committee. As such, your wish for an industry standardized compute language similar to CUDA that runs on all platforms and vendors HW may not be so far off.

    2) I believe your interpretation of how multiple threads simultaneously execute in an SM is incorrect. Per thread context switching is not free, and you would never be able to execute a different thread every cycle in the manner described. There is far too much context that needs to be swapped out, and there would be significant power implications for doing that, in addition to the latency. Instead, I believe what NV is claiming is that any given SP executes a single thread. All threads in the SM can all be a single warp, but you can also have multiple threads (one per SP) all executing simultaneously in an SM.
    Reply
  • DerekWilson - Tuesday, June 17, 2008 - link

    1) I haven't had a good chance to look at OpenCL, but I certainly hope that if it's everything everyone is saying it is in the comments here that it takes off in a bigger way than CUDA :-)

    2) it does not context switch per thread -- warps define a context, and you have 32 threads grouped together. these threads all share the same instruction stream, which is why if threads in a warp take different directions on a branch all 32 threds must follow both paths.

    NVIDIA has flat out stated that every schedule clock a new warp is scheduled and that it takes 4 clock cycles to process one warp on an SM. For both of these to be true, we conclude that the scheduler alternates scheduling SPs and SFUs on altenating clocks which means the SPs would be scheduled every 4 clocks relative to itself.

    On 8 SPs per SM, you some how need to execute 32 threads in 4 clock cycles. This makes sense if you execute 4 threads per SP in some way. The details at this point are fuzzy though.

    regardless, if an SP executes 4 different threads from the same warp, there is no need to context switch to execute any of these threads -- again, threads in the same warp share context.
    Reply
  • skiboysteve - Tuesday, June 17, 2008 - link

    could be a large explanation of the 2x register file size. and remember that the SP doesn't have to worry about the context switch, the SM handles having the data in the right place Reply
  • anandtech02148 - Monday, June 16, 2008 - link

    From this conclusion, Amd seems to be the shrewd player, let nvidia and intel duke it out in the high voltage, heat, meaningless speed gpu while Amd can pull something like its first dualcore or athlon64 for the win.
    this new beast from Nvidia will have how many developers making games for it right away? i'm guestimating maybe 2yrs-4yrs down the road we'll see a decent title that take full advantage of this hardware.
    by then Amd will have something of a midrange that can more than handle the games.
    2 things nvidia could work on that it already has, the ps3 market, and small graphic devices to improve profits. shrink the ps3 gpu further so Sony can shrink it's machinel and sell more.

    Reply
  • PrinceGaz - Monday, June 16, 2008 - link

    The GT200 core may be a technical masterpeice in terms of actually making something that big which is fully functional on GTX280 cards, but it seems to me the penalty of fabbing it at 65nm negates much of the benefits of such a wide GPU.

    They've had to drop the clock speeds throughout presumably because of the ridiculous amount of heat such a large core generates, which means the ~60% performance advantage in current games over the G80 core at similar clock-speeds is somewhat reduced.

    Given that ATI are not producing their 55nm cores in AMD's fabs but instead are getting them churned out reliably elsewhere, nVidia have made a mistake this time around in having their high-end product rely on previous-generation fabrication as it makes it run too hot to allow the clock-speeds needed for it to be the product it should be. There is always a risk in transitioning to a smaller fab technology, and nVidia suffered badly in the past by doing so too early, but with a chip the size of the GT200, they really should have gone to 55nm even if it meant a delay of a month or three, whilst the smaller cut-down derivatives were rolled out first.
    Reply
  • ekpyr - Monday, June 16, 2008 - link

    Great article, but what about the microstuttering issues present in Nvidia's 9800GX2 cards (both SLI and Quad-SLI)? There is very little discussion on this, but I've seen some benchmarks where the FPS floor is 4fps with the 9800GX2s. Can you add a subjective review of whether or not the actual gameplay is smoother with the GTX280s across these games? Aggregate numbers may say one thing, but I've returned a 9800 GX2 Quad-SLI setup because it was unable to handle the incredible amount of texture loading that was done in Age of Conan (2560x1600 4xAA 'High' settings = 4fps). The 8800 GTX Tri-SLI configuration I am currently using is more resilient to microstuttering with its increased bus and memory capacities, but I'm very curious about the GTX280s and their increased memory and bus on texture-heavy games like Age of Conan. Reply
  • DerekWilson - Monday, June 16, 2008 - link

    the only game that came close to having this issue with quad sli for us was oblivion.

    in that game at high res lag and stutter are unbearable and the game is unplayable.

    we didn't notice any stuttering issues with a single GX2.

    i'm working on some analysis tools to show details like this better in future articles.
    Reply
  • TheJian - Monday, June 16, 2008 - link

    I find it humorous that nobody discusses the fact that the shrink has already taped out and will likely be out in two months or just after. This humongous chip was only released so that when AMD releases in the next few weeks they will be behind still in single GPU cards. This is basically what Intel does to AMD every time AMD has a better chip. For all intents and purposes this is a PAPER release of what will come in 2-2.5 months (In Intel's case they just show you what will be out 6 months from now, and a large portion of people don't buy an AMD because Intel might be ahead by xmas...LOL - works like a charm every time AMD is ahead). THE DIE SHRUNK CHIP! Most likely with faster speeds. I suspect they'll come with "ULTRA" version first (and stick it on top of the price heap, so as to not kill all FAT cards in the channel already) and then filter down as these big suckers leave the channel. That's if they even plan to sell more than a few of these to begin withat 65nm. It's only out there so AMD won't look any good in two weeks.

    MIND SHARE is everything, which is why Intel's KING of the paper launch when behind strategy. They've even went to doing it for all chips no matter what now. Nehalem scores 6 months before availability. AMD's marketers have no clue an should be fired. You have to play the same DIRTY game as your enemy or you've already lost. If AMD had half a brain in their head they'd paper launch an ultra or 2x4870 version for the same reason...LOL. Then claim "our 4870x2 makes nvidia look like crap for $600"...ROFL. Who cares when it's available, just say it. Having said that, Nvidia will wipe the floor with them in 2 months anyway on a 2xGTX280 that's die shrunk. Which is all they are doing today...BUYING TIME!
    Reply

Log in

Don't have an account? Sign up now