Will Nehalem conquer the server world by storm?
by Johan De Gelas on February 12, 2009 12:00 AM EST- Posted in
- IT Computing general
A dramatic turn of events is the best way to describe what we'll witness in a few weeks. But let us first talk about the current situation. As we pointed out in our last server CPU comparison, AMD latest quadcore Opteron was a very positive surprise. Sure, you can show a few server benchmarks where the Intel CPU wins like Black Scholes or some exotic HPC benchmark but the server applications that really make the difference like webservers, database servers run faster on the latest AMD "Shanghai" CPU. All depends on what kind of application is important for you of course. But let us look at the complete picture: performing more than 30% faster in Virtualization benchmarks is the final proof that AMD's latest is overall the best server CPU at this point in time.
But a few weeks from now, that will all change. As always we can not disclose benchmark information before a certain date, but if you look around here at this site, you have been able to discern the omens. The K10 architecture of Shanghai is a well rounded architecture, but one that misses really crucial weapons to keep up with the Nehalem:
- Simultaneous Hyperthreading offers performance boost that IPC Improvements are not capable of delivering (up to 45%!).
- Memory latency. Nehalem's memory latency is up to 40% lower
- Memory bandwidth: 3 channels is complete overkill for desktop apps, but it does wonders for many HPC and in a lesser degree server applications.
- a really aggressive integer engine
Nehalem will use somewhat more expensive DDR-3 DIMMs, which hardly offer any real performance boosts (as compared to DDR-2). So moving to DDR-3 will not help AMD much.
Istanbul?
The details on the six-core Istanbul are still sketchy. But the dual socket Xeon "Westmere" will get six cores too and will appear in the same timeframe as AMD's hexacore. Only if AMD added SMT very secretly to Istanbul, they will be able to turn the tide. Considering that this would be a first for AMD, it is very unlikely SMT made it to Istanbul.
A dent in Nehalem's armour?
Does AMD have a chance in the server market in 2009 (and possibly 2010)? I must say it was not easy to find a weakness in Nehalem's architecture. The challenge made it very attractive to search anyway :-). So what follows is a big "IF- iF" story and you should take it with a big grain of salt ... as you should always do with forward looking articles.
There is one market where AMD has really been the leader and that is virtualization thanks to the IMC and the support for segments (four privilege levels) in the AMD64 Instruction Set Architecture. AMD's performance running VMware ESX in the "good old" ESX Binary translating mode (software virtualization) was better than running an Intel on the latest hardware virtualization hypervisor. VMware only uses hardware virtualization on an AMD server if NPT (or RVI or HAP) is present . In contrast, hardware virtualization slowed the Xeons of 2005 and 2006 a bit down but was absolutely necessary to run 64 bit guests on a hypervisor on top of a Xeon server.
Nehalem is catching up with EPT and VPID (see here), and while it was well implemented, one thing is lacking: the TLB is rather small. I have been pointing out this out about a year ago: while the TLB got AMD a lot of bad press, it will probably be the one thing that keeps AMD somewhat in Intel's slipstream. Let me make that more clear:
CPU |
L1 TLB Data |
L1 TLB Instr |
L2 TLB |
AMD Shanghai/ Opteron 238x or 838x |
48 (4 KB)
48 (large) |
48 (4 KB) 48 (large) |
512 (4 KB)
128 (large) |
Intel Penryn / Xeon 54xx |
16 (4 KB) 16 (large) |
128(4 KB) 8 (large) |
256 (4 KB)
32 (large) |
Intel Nehalem / Xeon 55xx |
64 (4KB) 32 (large) |
128 (4 KB) 14 (large) |
512 (4 KB)
0 (large) |
Notice that in case you use large pages, the Nehalem TLB has few entries. So, let us now do a thought experiment. Currently, most of the virtualization benchmarks like VMmark (VMware) and VConsolidate (Intel) use relatively small VMs. VMs are for example a small Apache webserver and Mysql server which get between 512 MB and 2 GB of RAM. As a result most of them run with large pages off (Page size = 4 KB). These benchmark are very similar to the daily practice of an enterprise which uses IT mostly for "infrastructure purposes" such as authentificating it's employees and giving them access to mail, ftp, fileserver, print serving and web browsing.
It becomes totally different when you are an IT firm that offers it's services to a relatively large amount of customers on the internet. You need a large database with many probably pretty heavy webportals which offer a good interactive experience.
So you are not going to consolidate something like 84 (14 tiles x 6 VMs) tiny VMs on one physical machine, but rather 5 to 10 "fat" VMs. With fat VMs I mean VMs that get 4 GB and more of RAM, 2 to 4 vCPUs, run a 64 bit guest OS and so on.
Those applications also open tons of connections, which they have to destroy and recreate after some time. In other words, lots of memory activity going on.
EPT and NPT can offer between 10 and 35% better performance when lots of memory management activity is going on. Compared to the shadow page table technique, each change in the page tables does not cause a trap and the associated overhead (which can be 1000s of cycles). So you could say that going to the TLB of your CPU is a lot smoother. But if the TLB fails to deliver, the hardware page walk is very costly.
In search of the real page table
A hardware page walk consists of searching in several tables which allow the CPU to find the real physical address as the running software always supplies a virtual address. With a normal OS, the OS has set the CR3 register to contain a physical address where the first table is located.The first table converts the first part of the virtual address into a physical one, a pointer towards the physical address where the next table is located. With large pages, it takes about 3 steps to translate the virtual address to the physical one.
With EPT/NPT, the Guest OS gives a (CR3) address which in fact virtual and which must be converted into a real physical address. All the Guest OS tables contain pointers to a virtual addresses. So each table gives you a virtual address towards the other table. But the next table is not located at this virtual address, so we need to go out and search for the real address. So instead of 3 accesses to the memory, we need 3x3 accesses. If this happens too many times, EPT will actually reduce performance instead of improving it!
It is a good practice to use large pages with large database. Now remember we are moving towards a datacenter where almost everything is virtualized, databases included. In that case, Nehalem's TLB can only make sure that about 32 x 2 MB or only 64 MB of data and 28 MB of code is covered by the TLB. As a result, lots of relatively heavy hardware page walks will happen. Luckily, Intel caches the real physical page tables in the L3-cache, so it should not be too painful.
The latest quadcore Opteron has a much more potent TLB. As instructions take a lot less space than data, it is safe to say that the data TLB can cover up 176 (48 + 128) times 2 MB or 352 MB of data. Considering that virtualized machines have easily between 32 and 128 GB and are much better utilized (60-80% CPU load), it is clear that the AMD chip has an advantage there. How much difference can this make? We have to measure it, but based on our profiling and early benchmarking we believe that "an overflowing TLB" can decrease virtualized performance by as much 15%. To be honest: it is to early to tell, but we are pretty sure it is not peanuts in some important applications.
So what are we saying? Well, it is possible that the Opteron might be able to do some "damage control" compared to Nehalem when we try out a benchmark with large and fat VMs (Like we have done here). But there are a lot of "IF"s. Firstly, AMD must also cache the page tables in the caches. If for some reason they keep the page tables out of the caches, the advantage will probably be partly negated. Secondly, if the applications running on the physical machine demand a lot of bandwidth, the fact that the Nehalem platform has up to 70% more bandwidth might spoil the advantage too.
The last AMD Stronghold?
So Should Intel worry about this? Most likely not. For simplicity sake, let us assume that both cores - Shanghai and Nehalem- offer equal crunching power. They more or less do when it comes to pure raw FP power, but SpecInt makes it clear that Nehalem is faster in integer loads.
But let us forget that, as most server applications are unable to use all that superscalar power anyway. The AMD chip is still disadvantaged by the fact that it does not have SMT. Considering that most server apps have ample threads and that virtualization makes it easier to load each logical CPU up to 80% that remains a hard to close gap. Secondly, many of these applications do not fit entirely in the cache, so the fact that AMD's memory latency is up to 40% higher is not helping either. Thirdly, all top Xeons (2.66 GHz and higher) are capable of adding 2 extra speedbins even if all 4 cores are busy (like it was the case in SAP). It will be interesting to see how much power this costs, and if Turbo mode is possible with a 80% loaded virtualized machine.
In a nutshell: expect Nehalem with it's ample bandwidth and EPT to do very well in VMmark. However, we think that AMD might stay in the slipstream of the Intel flagship in some virtualization setups. It is possible that AMD counters with an even better optimized memory controller in Istanbul, but it is going to be tough.
Return to Linpack
The benchmarks where AMD will be able to stay close should have no use for massive amounts of memory bandwidth, SMT or Turbo mode. Feel free to educate us, but so far we have only found one benchmark that answers this profile: Linpack. Linpack achieves the highest IPC rates of probably almost all softwares. That means the Nehalem Xeon will be consuming peak power, and will not be able to use Turbo mode. Linpack (with MKL or ACML) is also so carefully optimized that it runs almost completely in the caches, and SMT or hyperthreading is only disturbing the carefully placed code lines. Considering that a 2.7 GHz Shanghai CPU with registered RAM was only a tiny bit slower than a Nehalem CPU with non registered RAM, you may expect to see both CPUs very close in this benchmark.
Outlook to 2009
The AMD quadcore is now the server CPU to get, but it is not going to stay that way very long. Until AMD comes up with SMT or another form of multi-threading and a faster memory controller, Intel's newest platform and CPU will force AMD to make the quadcore opteron very cheap. We expect that the AMD quadcore will only be competitive in Linpack and some virtualization scenario's.
And unless Istanbul has a very nice surprise for us, it is not going to change soon. Agreed, to our loyal readers, this does not come as a surprise...
17 Comments
View All Comments
Rackmountsales - Tuesday, February 24, 2009 - link
Thanks for sharing information and i appreciate it.Looking for more discussion and waiting for new topics here.Rackmountsales
mikaels - Monday, February 23, 2009 - link
I´m abit late for commenting on these comments but here I go.First of all, it is very sad to see these ongoing accusation towards Anandtech about beeing Intel-lover etc.
Personally, for me, I see the article as a whole, not just one page or one chart with numbers.
I read up, think and decide if the article has a good value in terms of things mentioned, considered etc.
I personally am an Intel fan. Even during the days with the Pentium 4. Though, to my defense I was abit immature at the time. Didn´t "want" AMD to win. But that´s another story.
I can now clearly recommend AMD based products if they meet the criteria for the purchase to be done.
For the servermarket it´s abit more difficult.
You have to look at much more numbers, manufacturer, ownership cost, spare parts, support things etc...
AMDOpteronPhil - Friday, February 13, 2009 - link
Johan - as always, I appreciate your expertise on these things, particularly in the area of virtualization and recognizing some of the subtle differences in the two architectures.* Why is there no mention regarding the cost of DDR3 memory? It's important to call out that customers are going to have to pay a premium for it over DDR2 for at least the remainder of 2009.
* What about power consumption at the wall? That's quickly becoming the number one buying criteria these days.
* The 45 percent advantage with SMT? Where are you getting this number? Is it from a benchmark you ran or did Intel provide you with this?
JohanAnandtech - Friday, February 13, 2009 - link
" Why is there no mention regarding the cost of DDR3 memory? It's important to call out that customers are going to have to pay a premium for it over DDR2 for at least the remainder of 2009. "I currently have no idea what buffered DDR-3 is going to cost, but looking at the unbuffered stuff, it is doubtfull if DDR-3 is really going to have a large impact on your server price.
Power Consumption: we'll measure this in detail, but why would you assume that there will a large difference? Previous server comparisons have shown that most of the difference came from large amount of FB-DIMM and Intel won't use them on the Nehalem EP.
"
* The 45 percent advantage with SMT? Where are you getting this number? "
I wrote "Up to 45 percent". But cases of 25-30% higher performance are pretty common. That is huge. Just imagine what kind of clockspeed or IPC advantage you must have to counter this.
balancedthinking - Monday, February 16, 2009 - link
Not a large impact? Good unbuffered DDR3 costs over twice as much as good unbuffered DDR2.Buffered DDR3 will be somewhat a niche produkt, while memory companys starving to death, you expect it to be cheap? This is finally a product they can make money with and you bet they will do.
Power Consumption no difference? Seen all these desktop reviews where Nehalem leads the pack from the negative standpoint?
Conclusion: everything you know, you know from Intel. Otherwise you would tell us basic stuff like power consumption at the wall.
Your blog posts have the value of Intel press releases and should be considered as those.
marc1000 - Tuesday, February 17, 2009 - link
I believe everything in the server market has a premium price. because the lack of information, all we can do about the memory price is to wait.About the power consuption, you are only partially correct. Yes, Nehalem is a power hog, let's assume it consumes 30% more power than Core2 and Shangai, but it also do the work 50% faster (rough numbers). So make the simple calculation below.
previus CPU's: 100w over 5minutes; nehalem: 130W over 2.5 minutes. Job done, CPU goes to idle. Idle power is the same for both platforms. wich one will cost more money by the end of the month???
Finally - Wednesday, February 18, 2009 - link
50% faster is not 100% faster. You just halved the time.Learn to do the math or cease bullshitting.
marc1000 - Wednesday, February 18, 2009 - link
You are right, Mr Finally. I thank you for correcting me. I did a mental calculation and got trapped in the "2.5 is 50% of 5", but for sure the right math was hard enough that you couldn't do it yourself. It is easy to blame others and add nothing in the comments here. I wonder why do we have such passionate posts anyway.But let's focus on the technology instead of blaming on each other. If anyone else wants to know the "math", here is the "bug-fixed" example:
Others CPUs: 100w over 5minutes, Nehalem: 130W over 3,3334 minutes. So total power is... 5x100 = 500 and 3.3334x130 = 433,3.
Ops, I guess Nehalem did it again. 66,7W cheaper than any other CPU when doing the same work. Watts per work done should be the measure, not peak power.
Finally - Wednesday, February 18, 2009 - link
Hehe. Guess that I actually DID the math to make my comment.Once again you fail @ suggesting things.
In your faulty "bugfixed" calculation you seem to think that a Nehalem system in idle mode uses ZER0 power. Yeah, sure...
So, what's next?
Will Nehalem prove to be a perpetuum mobile?
Really. This whole what-if is written by an Infel fanatic for Intel fanatics... but what's the news?
gomakeit - Wednesday, February 18, 2009 - link
lol I just have to say finally sounds like an amd fanatics. oh well what else is new :P/joke
seriously even you count the idle power consumption its hard to overcome the more than 10% difference in performance/watt estimated from the "bug-fix" calculation since both procs idles at a fraction of peak power usage.
and on DDR3, I guess what the author means is that it will not have a big impact on "server" price. DDR3 is a lot more expensive than DDR2 when used in a desktop application. but servers, especially those for HPC application are a lot more expensive to begin with. so the marginal increase might not be too big. also given that DDR3 is really on free fall in terms of pricing right now it's likely they'll be a lot lower towards later part of the year
ultimately I think AMD has great value in certain segment of the market. but I think we all have to tip the hat to intel as it owns the performance crown at the moment. this is from a person still using amd on his own rig :P