Original Link: http://www.anandtech.com/show/6877/the-great-equalizer-part-3

For the past several days I've been playing around with Futuremark's new 3DMark for Android, as well as Kishonti's GL and DXBenchmark 2.7. All of these tests are scheduled to be available on Android, iOS, Windows RT and Windows 8 - giving us the beginning of a very wonderful thing: a set of benchmarks that allow us to roughly compare mobile hardware across (virtually) all OSes. The computing world is headed for convergence in a major way, and with benchmarks like these we'll be able to better track everyone's progress as the high performance folks go low power, and the low power folks aim for higher performance.

The previous two articles I did on the topic were really focused on comparing smartphones to smartphones, and tablets to tablets. What we've been lacking however has been perspective. On the CPU side we've known how fast Atom was for quite a while. Back in 2008 I concluded that a 1.6GHz single core Atom processor delivered performance similar to that of a 1.2GHz Pentium M, or a mainstream Centrino notebook from 2003. Higher clock speeds and a second core would likely push that performance forward by another year or two at most. Given that most of the ARM based CPU competitors tend to be a bit slower than Atom, you could estimate that any of the current crop of smartphones delivers CPU performance somewhere in the range of a notebook from 2003 - 2005. Not bad. But what about graphics performance?

To find out, I went through my parts closet in search of GPUs from a similar time period. I needed hardware that supported PCIe (to make testbed construction easier), and I needed GPUs that supported DirectX 9, which had me starting at 2004. I don't always keep everything I've ever tested, but I try to keep parts of potential value to future comparisons. Rest assured that back in 2004 - 2007, I didn't think I'd be using these GPUs to put smartphone performance in perspective.

Here's what I dug up:

The Lineup (Configurations as Tested)
  Release Year Pixel Shaders Vertex Shaders Core Clock Memory Data Rate Memory Bus Width Memory Size
NVIDIA GeForce 8500 GT 2007 16 (unified) 520MHz (1040MHz shader clock) 1.4GHz 128-bit 256MB DDR3
NVIDIA GeForce 7900 GTX 2006 24 8 650MHz 1.6GHz 256-bit 512MB DDR3
NVIDIA GeForce 7900 GS 2006 20 7 480MHz 1.4GHz 256-bit 256MB DDR3
NVIDIA GeForce 7800 GT 2005 20 7 400MHz 1GHz 256-bit 256MB DDR3
NVIDIA GeForce 6600 2004 8 3 300MHz 500MHz 128-bit 256MB DDR

I wanted to toss in a GeForce 6600 GT, given just how awesome that card was back in 2004, but alas I had cleared out my old stock of PCIe 6600 GTs long ago. I had an AGP 6600 GT but that would ruin my ability to keep CPU performance in-line with Surface Pro, so I had to resort to a vanilla GeForce 6600. Both core clock and memory bandwidth suffered as a result, with the latter being cut in half from using slower DDR. The core clock on the base 6600 was only 300MHz compared to 500MHz for the GT. What does make the vanilla GeForce 6600 very interesting however is that it delivered similar performance to a very famous card: the Radeon 9700 Pro (chip codename: R300). The Radeon 9700 Pro also had 8 pixel pipes, but 4 vertex shader units, and ran at 325MHz. The 9700 Pro did have substantially higher memory bandwidth, but given the bandwidth-limited target market of our only cross-platform benchmarks we won't always see tons of memory bandwidth put to good use here.

The 7800 GT and 7900 GS/GTX were included to showcase the impacts of scaling up compute units and memory bandwidth, as the architectures aren't fundamentally all that different from the GeForce 6600 - they're just bigger and better. The 7800 GT in particular was exciting as it delivered performance competitive with the previous generation GeForce 6800 Ultra, but at a more attractive price point. Given that the 6800 Ultra was cream of the crop in 2004, the performance of the competitive 7800 GT will be important to look at.

Finally we have a mainstream part from NVIDIA's G8x family: the GeForce 8500 GT. Prior to G80 and its derivatives, NVIDIA used dedicated pixel and vertex shader hardware - similar to what it does today with its ultra mobile GPUs (Tegra 2 - 4). Starting with G80 (and eventually trickling down to G86, the basis of the 8500 GT), NVIDIA embraced a unified shader architecture with a single set of execution resources that could be used to run pixel or vertex shader programs. NVIDIA will make a similar transition in its Tegra lineup with Logan in 2014. The 8500 GT won't outperform the 7900 GTX in most gaming workloads, but it does give us a look at how NVIDIA's unified architecture deals with our two cross-platform benchmarks. Remember that both 3DMark and GL/DXBenchmark 2.7 were designed (mostly) to run on modern hardware. Although hardly modern, the 8500 GT does look a lot more like today's architectures than the G70 based cards.

You'll notice a distinct lack of ATI video cards here - that's not from a lack of trying. I dusted off an old X800 GT and an X1650 Pro, neither of which would complete the first graphics test in 3DMark or DXBenchmark's T-Rex HD test. Drivers seem to be at fault here. ATI dropped support for DX9-only GPUs long ago, the latest Catalyst available for these cards (10.2) was put out well before either benchmark was conceived. Unfortunately I don't have any AMD based ultraportables, but I did grab the old Brazos E-350. As a reminder, the E-350 was a 40nm APU that used two Bobcat cores and featured 80 GPU cores (Radeon HD 6310). While we won't see the E-350 in a tablet, a faster member of its lineage will find its way into tablets beginning this year.

Choosing a Testbed CPU

Although I was glad I could put some of these old GPUs to use (somewhat justifying them occupying space for years in my parts closet), there was the question of what CPU to pair them with. Go too insane on the CPU and I may unfairly tilt performance in favor of these cards. What I decided to do was to simulate the performance of the Core i5-3317U in Microsoft's Surface Pro. That part is a dual-core Ivy Bridge with Hyper Threading enabled (4 threads). Its max turbo is 2.6GHz for a single core, 2.4GHz for two cores. I grabbed a desktop Core i3 2100, disabled turbo, and forced its default clock speed to 2.4GHz. In many cases these mobile CPUs spend a lot of time at or near their max turbo until things get a little too toasty in the chassis. To verify that I had picked correctly I ran the 3DMark Physics test to see how close I came to the performance of the Surface Pro. As the Physics test is multithreaded and should be completely CPU bound, it shouldn't matter what GPU I paired with my testbed - they should all perform the same as the Surface Pro:

3DMark - Physics Test

3DMark - Physics

Great success! With the exception of the 8500 GT, which for some reason is a bit of an overachiever here (7% faster than Surface Pro), the rest of the NVIDIA cards all score within 3% of the performance of the Surface Pro - despite being run on an open-air desktop testbed.

With these results we also get a quick look at how AMD's Bobcat cores compare against the ARM competitors it may eventually do battle with. With only two Bobcat cores running at 1.6GHz in the E-350, AMD actually does really well here. The E-350's performance is 18% better than the dual-core Cortex A15 based Nexus 10, but it's still not quite good enough to top some of the quad-core competitors here. We could be seeing differences in drivers and/or thermal management with some of these devices since they are far more thermally constrained than the E-350. Bobcat won't surface as a competitor to anything you see here, but its faster derivative (Jaguar) will. If AMD can get Temash's power under control, it could have a very compelling tablet platform on its hands. The sad part in all of this is the fact that AMD seems to have the right CPU (and possibly GPU) architectures to be quite competitive in the ultra mobile space today. If AMD had the capital and relationships with smartphone/tablet vendors, it could be a force to be reckoned with in the ultra mobile space. As we've seen from watching Intel struggle however, it takes more than just good architecture to break into the new mobile world. You need a good baseband strategy and you need the ability to get key design wins.

Enough about what could be, let's look at how these mobile devices stack up to some of the best GPUs from 2004 - 2007.

We'll start with 3DMark. Here we're looking at performance at 720p, which immediately stops some of the cards with 256-bit memory interfaces from flexing their muscles. Never fear, we will have GL/DXBenchmark's 1080p offscreen mode for that in a moment.

Graphics Test 1

Ice Storm Graphics test 1 stresses the hardware’s ability to process lots of vertices while keeping the pixel load relatively light. Hardware on this level may have dedicated capacity for separate vertex and pixel processing. Stressing both capacities individually reveals the hardware’s limitations in both aspects.

In an average frame, 530,000 vertices are processed leading to 180,000 triangles rasterized either to the shadow map or to the screen. At the same time, 4.7 million pixels are processed per frame.

Pixel load is kept low by excluding expensive post processing steps, and by not rendering particle effects.

3DMark - Graphics Test 1

Right off the bat you should notice something wonky. All of NVIDIA's G70 and earlier architectures do very poorly here. This test is very heavy on the vertex shaders, but the 7900 GTX and friends should do a lot better than they are. These workloads however were designed for a very different set of architectures. Looking at the unified 8500 GT, we get some perspective. The fastest mobile platforms here (Adreno 320) deliver a little over half the vertex processing performance of the GeForce 8500 GT. The Radeon HD 6310 featured in AMD's E-350 is remarkably competitve as well.

The praise goes both ways of course. The fact that these mobile GPUs can do as well as they are right now is very impressive.

Graphics Test 2

Graphics test 2 stresses the hardware’s ability to process lots of pixels. It tests the ability to read textures, do per pixel computations and write to render targets.

On average, 12.6 million pixels are processed per frame. The additional pixel processing compared to Graphics test 1 comes from including particles and post processing effects such as bloom, streaks and motion blur.

In each frame, an average 75,000 vertices are processed. This number is considerably lower than in Graphics test 1 because shadows are not drawn and the processed geometry has a lower number of polygons.

3DMark - Graphics Test 2

The data starts making a lot more sense when we look at the pixel shader bound graphics test 2. In this benchmark, Adreno 320 appears to deliver better performance than the GeForce 6600 and once again roughly half the performance of the GeForce 8500 GT. Compared to the 7800 GT (or perhaps 6800 Ultra), we're looking at a bit under 33% of the performance of those cards. The Radeon HD 6310 in AMD's E-350 appears to deliver performance competitive with the Adreno 320.

3DMark - Graphics

The overall graphics score is a bit misleading given how poorly the G7x and NV4x architectures did on the first graphics test. We can conclude that the E-350 has roughly the same graphics performance as Qualcomm's Snapdragon 600, while the 8500 GT appears to have roughly 2x that. The overall Ice Storm scores pretty much repeat what we've already seen:

3DMark - Ice Storm

Again, the new 3DMark appears to unfairly penalize the older non-unified NVIDIA GPU architectures. Keep in mind that the last NVIDIA driver drop for DX9 hardware (G7x and NV4x) is about a month older than the latest driver available for the 8500 GT.

It's also worth pointing out that Ice Storm also makes Intel's HD 4000 look very good, when in reality we've seen varying degrees of competitiveness with discrete GPUs depending on the workload. If 3DMark's Ice Storm test could map to real world gaming performance, it would mean that devices like the Nexus 4 or HTC One would be able to run BioShock 2-like titles at 10x7 in the 20 fps range. As impressive as that would be, this is ultimately the downside of relying on these types of benchmarks to make comparisons - they fundamentally tell us how well these platforms would run the benchmark itself, not other games unfortunately.

At a high level, it looks like we're somewhat narrowing down the level of performance that today's high end ultra mobile GPUs deliver when put in discrete GPU terms. Let's see what GL/DXBenchmark 2.7 tell us.

GL/DXBenchmark 2.7 & Final Words

While the 3DMark tests were all run at 720p, the GL/DXBenchmark results run at roughly 2.25x the pixel count: 1080p. We get a mixture of low level and simulated game benchmarks with GL/DXBenchmark 2.7, the former isn't something 3DMark offers across all platforms today. The game simulation tests are far more strenuous here, which should do a better job of putting all of this in perspective. The other benefit we get from moving to Kishonti's test is the ability to compare to iOS and Windows RT as well. There will be a 3DMark release for both of those platforms this quarter, we just don't have final software yet.

We'll start with the low level tests, beginning with Kishonti's fill rate benchmark:

GL/DXBenchmark 2.7 - Fill Test (Offscreen)

Looking at raw pixel pushing power, everything post Apple's A5 seems to have displaced NVIDIA's GeForce 6600. NVIDIA's Tegra 3 doesn't appear to be quite up to snuff with the NV4x class of hardware here, despite similarities in the architectures. Both ARM's Mali-T604 (Nexus 10) and ImgTec's PowerVR SGX 554MP4 (iPad 4) do extremely well here. Both deliver higher fill rate than AMD's Radeon HD 6310, and in the case of the iPad 4 are capable to delivering midrange desktop GPU class performance from 2004 - 2005.

Next we'll look at raw triangle throughput. The vertex shader bound test from 3DMark did some funny stuff to the old G7x based architectures, but GL/DXBenchmark 2.7 seems to be a bit kinder:

GL/DXBenchmark 2.7 - Triangle Throughput, Fragment Lit (Offscreen)

Here the 8500 GT definitely benefits from its unified architecture as it is able to direct all of its compute resources towards the task at hand, giving it better performance than the 7900 GTX. The G7x and NV4x based architectures unfortunately have limited vertex shader hardware, and suffer as a result. That being said, most of the higher end G7x parts are a bit too much for the current crop of ultra mobile GPUs. The midrange NV4x hardware however isn't. The GeForce 6600 manages to deliver triangle throughput just south of the two Tegra 3 based devices (Surface RT, Nexus 7).

Apple's iPad 4 even delivers better performance here than the Radeon HD 6310 (E-350).

ARM's Mali-T604 doesn't do very well in this test, but none of ARM's Mali architectures have been particularly impressive in the triangle throughput tests.

With the low level tests out of the way, it's time to look at the two game scenes. We'll start with the less complex of the two, Egypt HD:

GL/DXBenchmark 2.5 - Egypt HD (Offscreen)

Now we have what we've been looking for. The iPad 4 is able to deliver similar performance to the GeForce 7900 GS, and 7800 GT, which by extension means it should be able to outperform a 6800 Ultra in this test. The vanilla GeForce 6600 remains faster than NVIDIA's Tegra 3, which is a bit disappointing for that part. The good news is Tegra 4 should be somewhere around high-end NV4x/upper-mid-range G7x performance in this sort of workload. Again we're seeing Intel's HD 4000 do remarkably well here. I do have to caution anyone looking to extrapolate game performance from these charts. At best we know how well these GPUs stack up in these benchmarks, until we get true cross-platform games we can't really be sure of anything.

For our last trick, we'll turn to the insanely heavy T-Rex HD benchmark. This test is supposed to tide the mobile market over until the next wave of OpenGL ES 3.0 based GPUs take over, at which point GL/DXBenchmark 3.0 will step in and keep everyone's ego in check.

GL/DXBenchmark 2.7 - T-Rex HD (Offscreen)

T-Rex HD puts the iPad 4 (PowerVR SGX 554MP4) squarely in the class of the 7800 GT and 7900 GS. Note the similarity in performance between the 7800 GT and 7900 GS indicates the relatively independent nature of T-Rex HD when it comes to absurd amounts of memory bandwidth (relatively speaking). Given that all of the ARM platforms south of the iPad 4 line have less than 12.8GB/s of memory bandwidth (and those are the platforms these benchmarks were designed for), a lack of appreciation for the 256-bit memory interfaces on some of the discrete cards is understandable. Here the 7900 GTX shows a 50% increase in performance over the 7900 GS. Given the 62.5% advantage the GTX holds in raw pixel shader performance, the advantage makes sense.

The 8500 GT's leading performance here is likely due to a combination of factors. Newer drivers, a unified shader architecture that lines up better with what the benchmark is optimized to run on, etc... It's still remarkable how well the iPad 4's A6X SoC does here as well as Qualcomm's Snapdragon 600/Adreno 320. The latter is even more impressive given that it's constrained to the power envelope of a large smartphone and not a tablet. The fact that we're this close with such portable hardware is seriously amazing.

At the end of the day I'd say it's safe to assume the current crop of high-end ultra mobile devices can deliver GPU performance similar to that of mid to high-end GPUs from 2006. The caveat there is that we have to be talking about performance in workloads that don't have the same memory bandwidth demands as the games from that same era. While compute power has definitely kept up (as has memory capacity), memory bandwidth is no where near as good as it was on even low end to mainstream cards from that time period. For these ultra mobile devices to really shine as gaming devices, it will take a combination of further increasing compute as well as significantly enhancing memory bandwidth. Apple (and now companies like Samsung as well) has been steadily increasing memory bandwidth on its mobile SoCs for the past few generations, but it will need to do more. I suspect the mobile SoC vendors will take a page from the console folks and/or Intel and begin looking at embedded/stacked DRAM options over the coming years to address this problem.


Log in

Don't have an account? Sign up now