NVIDIA's GeForce 8800 (G80): GPUs Re-architected for DirectX 10

Name: NVIDIA's GeForce 8800 (G80): GPUs Re-architected for DirectX 10
Item: NVIDIA's GeForce 8800 (G80): GPUs Re-architected for DirectX 10
Author: Anand Lal Shimpi & Derek Wilson

by Anand Lal Shimpi & Derek Wilson on November 8, 2006 6:01 PM EST

Posted in
GPUs

111 Comments | Add A Comment

111 Comments

Final Words

Back when Sony announced the specifications of the PlayStation 3, everyone asked if it meant the end of PC gaming. After all Cell looked very strong and NVIDIA's RSX GPU had tremendous power. We asked NVIDIA how long it would take until we saw a GPU faster than the RSX. Their answer: by the time the PS3 ships. So congratulations to NVIDIA for making the PS3 obsolete before it ever shipped, as G80 is truly a beast.

A single GeForce 8800 GTX is more powerful overall than a 7900 GTX SLI configuration and even NVIDIA's mammoth Quad SLI. Although it's no longer a surprise to see a new generation of GPU outperform the previous generation in SLI, the sheer performance we're able to attain because of G80 is still breathtaking. Being able to run modern day games at 2560x1600 at the highest in-game detail settings completely changes the PC gaming experience. It's an expensive proposition, sure, but it's like no other; games just look so much better on a 30" display at 2560x1600 that it makes playing titles at 1600x1200 seem just "ok". We were less impressed by the hardware itself than by gaming at 2560x1600 with all the quality settings cranked all the way up in every game we tried, and that is saying quite a lot. And in reality, that's what it's all about anyway: delivering quality and performance at levels never before thought possible.

Architecturally, G80 is a gigantic leap from the previous generation of GPUs. It's the type of leap in performance that's akin to what we saw with the Radeon 9700 Pro, and given the number of 9700 Pro-like launches we've seen, they are rare. Like 9700 Pro, we are able to enable features that improve image quality well beyond the previous generation, and we are able to run games smoothly at resolutions higher than we could hope for. And, like 9700 Pro, the best is yet to come.

With developers much more acclimated to programmable shader hardware, we expect to see a faster ramp in the availability of advanced features enabled by DirectX 10 class hardware. This is more because of the performance improvements of DX10 than anything else: game developers can create just about the same effects in SM3.0 that they can with SM4.0. The difference is that DX9 performance would be so low that features won't be worth implementing. This is different from the DX8 to DX9 transition where fully programmable shaders enabled a new class of effects. This time, DX10 simply removes the speed limit and straps on afterburners. The only fly in the ointment for DirectX 10 is the requirement that users run Windows Vista. Unfortunately, that means developers are going to be stuck with supporting both DX9 and DX10 hardware in their titles for some time, unless they simply want to eliminate Windows XP users as a potential market.

Much of the feature set for G80 can be taken advantage of through OpenGL on Windows XP today. Unfortunately, OpenGL has fallen out of use in games these days, but there are still a few who cling to its clean interface and extensibility. The ability to make use of DX10 class features is here today for those who wish to do so.

That's not to say that DX9 games won't see benefits from NVIDIA's new powerhouse. Everything we've tested here today shows incredible scaling on G80 and proves that a unified architecture is the way to go forward in graphics. More complex SM3.0 code will be capable of running on G80 faster than we've been able to see on G70 and R580, and we certainly hope developers will take advantage of that and start releasing games with the option to enable unheard of detail.

The bottom line is that we've got an excellent new GPU that enables incredible levels of performance and quality. And NVIDIA is able to do this while using a reasonable amount of power for the performance gained (despite requiring two PCIe power connectors per 8800 GTX). The chip is huge in terms of transistor count, and in terms of die area. Our estimates based on the wafer shots NVIDIA provided us with indicate that the 681 million transistor G80 die is somewhere between 480 and 530 mm^2 at 90nm. This leaves NVIDIA with the possibility of a spring refresh part based on TSMC's 80nm half-node process that could enable not only better prices, but higher performance and lower power as well.

While we weren't able to overclock the shader core of our G80 parts, NVIDIA has stated that shader core overclocking is coming. While playing around with the new nTune, overclocking the core clock does impact performance, but we'll talk more about this in our retail product review to be posted in the coming days.

With G80, NVIDIA is solidly in a leadership position and now we play the waiting game for ATI's R600 to arrive. One thing is for sure, if you were thinking about building a high end gaming system this holiday season, you only need to consider one card.

Performance with AA Disabled

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

111 Comments

View All Comments

Nightmare225 - Sunday, November 26, 2006 - link
Are the FPS posted in this article, Minimum FPS, Average FPS, or Maximum? Thanks!
multiblitz - Monday, November 20, 2006 - link
I enjoyed your reviews always a lot as they inclueded the video-capbilities for a HTPC on previous cards. Unfortunately this was this time not the case. Hopefully there will be a 2. Part covering this as well ? If so, it would be nice to make a compariosn on picture quality as well against the filters of ffdshow, as nvidia is now as well supporting postprocessing filters...
DerekWilson - Tuesday, November 21, 2006 - link
What we know right now is that 8800 gets a 128 out of 130 on HQV tests.

We haven't quite put together an HTPC look at 8800, but this is a possibility for the future.
epsil0n - Sunday, November 19, 2006 - link
I am not agree with this:

"It isn't surprising to see that NVIDIA's implementation of a unified shader is based on taking a pixel shader quad pipeline, and breaking up the vector units into 4 scalar units. Now, rather than 4 pixel quads, we see 16 SPs per "quad" or block of stream processors. Each block of 16 SPs shares 4 texture address units, 8 texture filter units, and an L1 cache."

If i understood well this sentence tells that given 4 pixels the numbers of SPs involved in the computation are 16. Then, this assumes that each component of the pixel shader is computed horizontally over 16 SP (4pixel x 4rgba = 16SP). But, are you sure??

I didn't found others articles over the web that speculate about this. Reading others articles the main idea that i realized is that a shader is computed by one and only one SP. Each vector instruction (inside the shader) is "mapped" as a sequence of scalar operations (a dot product beetwen two vectors is mapped as 4 MUD/ADD operations). As a consequence, in this scenario 4 pixels are computed only by 4 SPs.
DerekWilson - Tuesday, November 21, 2006 - link
Honestly, NVIDIA wouldn't give us this level of detail. We certainly pressed them about how vertices and pixels map to SPs, but the answer we got was always something about how dynamic the hardware is able to dynamically schedule the SPs optimally according to what needs to be done.

They can get away with being obscure about how they actually process the data because it could happen either way and provide the same effect to the developer and gamer alike.

Scheduling the simultaneous processing one vec4 MAD operation on 4 quads (16 pixels) over 4 groups of 4 SPs will take 4 clock cycles (in terms of throughput). Processing the same 16 pixels on 16 SPs will also take 4 clock cycles.

But there are reasons to believe that things happen the way we described. Loading components of 16 different "threads" (verts, pixels or whatever) would likely be harder on the cache than loading all 4 components of 4 different threads. We could see them schedule multiple ops from 4 threads to fill up each block of shaders -- like computing 4 consecutive scalar operations for 4 threads on 16 SPs.

At the same time, it might be easier to maximize SP utilization if 16 threads were processed on one block of SPs every clock.

I think the answer to this question is that NVIDIA knows, they didn't tell us, and all we can do is give it our best guess.
xtknight - Thursday, November 16, 2006 - link
This has been AT's best article in awhile. Tons of great, concise info.

I have a question about the gamma corrected AA. This would be detrimental if you've already calibrated your display, correct (assuming the game heeds to the calibration)? Do you know what gamma correction factor the cards use for 'gamma corrected AA'?
DerekWilson - Monday, November 20, 2006 - link
I don't know if they dynamically adjust gamma correction based on monitor (that would be nice though) ...

if they don't they likely adjusted for a gamma of either (or between) 2.2 or 2.5.

Also, thanks :-) There was a lot more we wanted to pack in, but I'm glad to see that we did a good job with what we were able to include.

Thanks,
Derek Wilson
bjacobson - Sunday, November 12, 2006 - link
This comment is unrelated, but could you implement some system where after rating a comment, on reload the page goes back to the comment I was just at? Otherwise I rate something halfway down and then have to spend several seconds finding where I just was. Just a little nuissance.

Thanks for the great article, fun read.
neo229 - Friday, November 10, 2006 - link

quote:
Both cards are extremely quiet during operation...

This is a very suspect quote. A card that requires two PCIe power connectors is going to dissipate a lot of heat. More heat means there must be a faster, louder fan or more substantial and costly heat sink. The extra costs associated with providing a truly quiet card mean that the bulk of manufacturers go with the loud fan option.
DerekWilson - Friday, November 10, 2006 - link
If manufacturers go with the NVIDIA reference design, then we will see a nice large heatsink with a huge quiet fan.

Really, it does move a lot of air without making a lot of noise ... Are there any devices we can get to measure the airflow of a cooling solution?

We are also seeing some designs using water cooling and theres even one with a thermo-electric (peltier) cooler on it. Manufacturers are going to great lengths to keep this thing running cool without generating much noise.

None of the 8 retail cards we are testing right now generate nearly the noise of the X1950 XTX ... We are working on a retail roundup right now, and we'll absolutely have noise numbers for all of these cards at load.

NVIDIA's GeForce 8800 (G80): GPUs Re-architected for DirectX 10

Final Words

Post Your Comment

111 Comments

View All Comments

Nightmare225 - Sunday, November 26, 2006 - link

multiblitz - Monday, November 20, 2006 - link

DerekWilson - Tuesday, November 21, 2006 - link

epsil0n - Sunday, November 19, 2006 - link

DerekWilson - Tuesday, November 21, 2006 - link

xtknight - Thursday, November 16, 2006 - link

DerekWilson - Monday, November 20, 2006 - link

bjacobson - Sunday, November 12, 2006 - link

neo229 - Friday, November 10, 2006 - link

DerekWilson - Friday, November 10, 2006 - link

Log in

Don't have an account? Sign up now