What Makes Server Applications Different?

The large caches and high integer core (cluster) count in one Orochi die (four CMT module Bulldozer die) made quite a few people suspect that the Bulldozer design first and foremost was created to excel in server workloads. Reviews like our own AMD FX-8150 launch article have revealed that single-threaded performance has (slightly) regressed compared to the previous AMD CPUs (Istanbul core), while the chip performs better in heavy multi-threaded benchmarks. However, high performance in multi-threaded workstation and desktop applications does not automatically mean that the architecture is server centric.

A more in depth analysis of the Bulldozer architecture and its performance will be presented in a later article as it is out of the scope of this one. However, many of our readers are either hardcore hardware enthusiasts or IT professionals that really love to delve a bit deeper than just benchmarks showing if something is faster/slower than the competition, so it's good to start with an explanation of what makes an architecture better suited for server applications. Is the Bulldozer architecture a “server centric architecture”?

What makes a server application different anyway?

There have been extensive performance characterizations on the SPEC CPU benchmark, which contains real-world HPC (High Performance Computing), workstation, and desktop applications. The studies of commercial web and database workloads on top of real CPUs are less abundant, but we dug up quite a bit of interesting info. In summary we can say that server workloads distinguish themselves from the workstation and desktop ones in the following ways.

They spend a lot more time in the kernel. Accessing the network stack, the disk subsystem, handling the user connections, syncing high amounts of threads, demanding more memory pages for expending caches--server workloads make the OS sweat. Server applications spend about 20 to 60% of their execution time in the kernel or hypervisor, while in contrast most desktop applications rarely exceed 5% kernel time. Kernel code tends to be very low IPC  (Instructions Per Clockcycle) with lots of dependencies.

That is why for example SPECjbb, which does not perform any networking and disk access, is a decent CPU benchmark but a pretty bad server benchmark. An interesting fact is that SPECJBB, thanks to the lack of I/O subsystem interaction, typically has an IPC of 0.5-0.9, which is almost twice as high as other server workloads (0.3-0.6), even if those server workloads are not bottlenecked by the storage subsystem.

Another aspect of server applications is that they are prone to more instruction cache misses. Server workloads are more complex than most processing intensive applications. Processing intensive applications like encoders are written in C++ using a few libraries. Server workloads are developed on top of frameworks like .Net and make of lots of DLLs--or in Linux terms, they have more dependencies. Not only is the "most used" instruction footprint a lot larger, dynamically compiled software (such as .Net and Java) tends to make code that is more scattered in the memory space. As a result, server apps have much more L1 instruction cache misses than desktop applications, where instruction cache misses are much lower than data cache misses.

Similar to the above, server apps also have more L2 cache misses. Modern desktop/workstation applications miss the L1 data cache frequently and need the L2 cache too, as their datasets are much larger than the L1 data cache. But once there, few applications have significant L2 cache misses. Most server applications have higher L2 cache misses as they tend to come with even larger memory footprints and huge datasets.

The larger memory footprint and shrinking and expanding caches can cause more TLB misses too. Especially virtualized workloads need large and fast TLBs as they switch between contexts much more often.

As most server applications are easier to multi-thread (for example, a thread for each connection) but are likely to work on the same data (e.g. a relational database), keeping the caches coherent tends to produce much more coherency traffic, and locks are much more frequent.

Some desktop workloads such as compiling and games have much higher branch misprediction ratios than server applications. Server applications tend to be no more branch intensive than your average integer applications.

Quick Summary

The end result is that most server applications have low IPC. Quite a few workstation applications achieve 1.0-2.0 IPC, while many server applications execute 3 to 5 times fewer instructions on average per cycle. Performance is dominated by Memory Level Parallelism (MLP), coherency traffic, and branch prediction in that order, and to a lesser degree integer processing power.

So is "Bulldozer" a server centric architecture? We'll need a more in-depth analysis to answer this question properly, but from a high level perspective, yes, it does appear that way. Getting 16 threads and 32MB of cache inside a 115W TDP power consumption envelope is no easy feat. But let the hardware and benchmarks now speak.

Introducing AMD's Opteron 6200 Series Inside Our Interlagos Test System
Comments Locked

106 Comments

View All Comments

  • mino - Wednesday, November 16, 2011 - link

    More workload ... also you need at least 3 servers for any meaningful redundancy ... even when only needing the power of 1/4 of iether of them.

    BTW. most cpu's sold in the SMB space are far cry from the 16-core monsters reviewed here ...
  • JohanAnandtech - Thursday, November 17, 2011 - link

    Don't forget the big "Cloud" buyers. Facebook has increased the numbers of server from 10.000 somewhere in 2008 tot 10 times more in 2011. That is one of the reasons why the number of units is still growing.
  • roberto.tomas - Wednesday, November 16, 2011 - link

    seems like the front page write and this article are from different versions:

    from the write up: "Each of the 16 integer threads gets their own integer cluster, complete with integer executions units, a load/store unit, and an L1-data cache"

    from the article: "Cores (Modules)/Threads 8/16 [...] L1 Data 8x 64 KB 2-way"

    what is really surprising is calling them threads (I thought, like the write up on the front page, that they each had their own independent integer "unit"). If they have their own L1 cache, they are cores as far as I'm concerned. Then again, the article itself seems to suggest just that: they are threads without independent L1 cache.

    ps> I post comments only like once a year -- please dont delete my account. every time I do, I have to register anew :D
  • mino - Wednesday, November 16, 2011 - link

    I suits Intel better to call them threads ... so writers are ordered ... only if the pesky reality did not pop up here and there.

    BD 4200 series is an 1-chip, 4-module, 8(4*2)-core, 16(4*2)-thread processor
    BD 6200 series is a 2-chip, 8(2*4)-module, 16(2*4*2)-core, 16(2*4*2)-thread processor

    Xeon 5600 series is an (up to) 1-chip, 6-core, 12(6*2)-thread processor.

    Simple as cake. :D
  • rendroid1 - Wednesday, November 16, 2011 - link

    The L1 D-cache should be 1 per thread, 4-way, etc.

    The L1 I-cache is shared by 2 threads per "module", and is 2-way, etc.
  • JohanAnandtech - Thursday, November 17, 2011 - link

    Yep. fixed. :-)
  • Novality77 - Wednesday, November 16, 2011 - link

    One thing that I never see in any reviews is remarks about the fact that more cores with lower IPC has added costs when it comes to licensing. For instance Oracle, IBM and most other suppliers charge per core. These costs can add up pretty fast. 10000 per core is not uncommon.....
  • fumigator - Wednesday, November 16, 2011 - link

    Great review as usual. I found all the new AMD opterons very interesting. Pairing two in a dual socket G34 would make a multitasking monster on the cheap, and quite future proof.

    Abour cores vs modules vs hyperthreading, people thinking AMD cores aren't true cores, should consider the following:

    adding virtual cores on hyperthreading in intel platforms don't make performance increase 100% per core, but only less than 50%

    Also if you look at intel processor photographs, you won't notice the virtual cores anywhere in the pictures.
    While in interlagos/bulldozer you could clearly spot each core by its shape inside each module. What surprises me is how small they are, but that's for an entire different discussion.
  • MossySF - Wednesday, November 16, 2011 - link

    I'm waiting to see the follow-up Linux article. The hints in this one confirm my own experiences. At our company, we're 99% FOSS and when using Centos packages, AMD chips run just as fast as Intel chips since it's all compiled with GCC instead of Intel's "disable faster code when running on AMD processors" compiler. As an example, PostgreSQL on native Centos is just as fast on Thuban compared to Sandy Bridge at the same GHz. And when you then virtualize Centos under Centos+KVM, Thuban is 35% faster. (Nehalem goes from 10% slower natively to 50% slower under KVM!)

    The compiler issue might be something to look at in virtualization tests. If you fake an Intel identifier in your VM, optimizations for new instruction sets might kick in.

    http://www.agner.org/optimize/blog/read.php?i=49#1...
  • UberApfel - Wednesday, November 16, 2011 - link

    Amazingly biased review from Anandtech.

    A fairer comparison would be between the Opteron 6272 ($539 / 8-module) and Xeon E5645 ($579 / 6-core); both common and recent processors.

    Yet handpicking the higher clocked Opteron 6276 (for what good reason?) seems to be nothing but an aim to make the new 6200 series seem un-remarkable in both power consumption and performance. The 6272 is cheaper, more common, and would beat the Xeon X5670 in power consumption which half this review is weighted on. Otherwise you should've used the 6282 SE which would compete in performance as well as being the appropriate processor according to your own chart.

    Even the chart on Page 1 is designed to make Intel look superior all-around. For what reason would you exclude the Opteron 4274 HE (65W TDP) or the Opteron 4256 EE (35W TDP) from the 'Power Optimized' section?

    The ignorance on processor tiers is forgivable even if you're likely paid to write this... but the benchmarks themselves are completely irrelevant. Where's the IIS/Apache/Nginx benchmark? PostgreSQL/SQLite? Facebook's HipHop? Node.js? Java? Something relevant to servers and not something obscure enough to sound professional?

Log in

Don't have an account? Sign up now