How AMD's Istanbul might close the gap with Nehalem EP
by Johan De Gelas on February 25, 2009 12:00 AM EST- Posted in
- IT Computing general
The Istanbul cores are the same as those that can be found in the AMD's latest Shanghai CPU. But the "uncore" part of Istanbul is more interesting. By now, you have probably heard about AMD's "HT-assist" technology, a probe or snoop filter. Every time a new cacheline is brought into the L3-cache of for example CPU 1 on the current Shanghai Platform, a broadcast message is sent to all L3-caches of all CPUs, and CPU 1 has to wait until those CPUs answer.
In the case of Istanbul, the CPU will simply check it's snoop filter in it's own L3-cache, and if none of the other CPUs have that certain cacheline, it can go ahead. This lowers the latency of bringing in a new cacheline and raises the effective bandwidth.
To better understand this, we combined our own stream benchmarking with the one that AMD presented. All AMD systems are using DDR-2 800.
As each Stream thread works on its own data, there is no reason to send out coherency synchronization requests. These requests slow the process of getting new cachelines in the L3 and hence lower effective memory bandwidth. What is interesting is that this will not only benefit the applications that use the HT interconnects a lot for coherency traffic, but also applications like stream which do not need the HT interconnects. Also notice that HT 3.0 does not improve memory bandwidth, as Stream will try to keep its thread data local. Our testing used SUSE SLES 10 SP2 and AMD used Windows 2008. Both OSs are well optimized and NUMA aware.
This means that especially HPC applications, with many threads all working on their own data, will benefit from the higher effective bandwidth. Besides HT assist, AMD has now confirmed to us that the memory controller has been tuned quite a bit. This higher amount of bandwidth will allow the quad Istanbul to stay out of the reach of the dual Nehalem EP Xeons in many HPC applications.
HT assist might also improve the SAP and OLTP scores quite a bit, but for a different reason. SAP and OLTP applications perform a lot of cache coherency syncronization requests, so the snoop filter will substantially lower the average latency of such requests as in some cases:
- the CPU will only wait on one other CPU (instead of waiting for all responses to come back)
- the CPU won't have to wait at all, as the other CPUs don't have this line.
Secondly, this will also lower memory latency, which is a bonus for almost every multi-threaded application.
Lower memory latency, higher bandwidth, lower "cache coherency" latency and more interconnect bandwidth: the improved "uncore" of Istanbul will be vital to close the gap with Nehalem. Much will depend on how quickly Intel introduces its own hexacore 32 nm Xeons, but that probably won't happen before 2010. Istanbul is shaping up to be a really good alternative for Intel's quadcore Nehalem. We might see a good fight after all...
Don't forget to check it.anandtech.com (IT portal) often, as many of our blogposts (for example the VMworld 2009 coverage) are not published on the frontpage of Anandtech.com.
40 Comments
View All Comments
zebrax2 - Thursday, February 26, 2009 - link
" I was sooooo frustrated with AMD when they were talking about how good the K8 was, because I thought it kind of sucked, when it was beating the Prescott."they should be bragging about it because they managed to beat their competitors product specially considering that the other side is bigger than them and throw a lot more money on R&D than they could ever could.
as for the performance of istanbul, jarred might be wrong but the same is can be said on you. the truth is we just don't have enough information right now to say which one of you is right. outrightly saying that istanbul will have no chance of competing without any hard evidence is just plain wrong
zebrax2 - Thursday, February 26, 2009 - link
sorry for my really crappy English :(icrf - Wednesday, February 25, 2009 - link
You're no longer allowed to compare CPUs. If you do, you're not allowed to talk specifics and the end result must always be ambiguous.tshen83 - Wednesday, February 25, 2009 - link
This blog post is clearly targeted at my comment yesterday on how the 6 core Istanbul is not going to be competitive against Nehalem-EP. Yet, you fail at explaining how Istanbul can catch up. You showed that a DUAL SOCKET Nehalem-EP gets 34GB/sec on the benchmark while the Quad Shanghai gets 25GB/sec. Even assuming that the 2 extra cores will scale linearly, that means Quad Istanbul will be at around 37GB/sec, which is finally performance competitive. BUT....Now let me get this straight. How is buying a quad socket system at 4* 1200+ dollar CPUs a better solution than a 2* 1000 dollar Nehalem-EP dual socket system while burning more than twice as much power? (Assuming that Istanbul will be 95W ACP and around 120W TDP x4 = 480W TDP vs 2x 95W TDP Nehalem-EPs).
On this metric: performance per watt per dollar, the quad socket Istanbul will be twice as expensive, and draw more than twice as much power, Nehalem-EP will be four times as good performance/watt/dollar.
rmlarsen - Friday, February 27, 2009 - link
Enough of your blathering, fanboy! It is obvious to everybody here that Intel has a killer server CPU on their hands with Nehalem. The AT/DT writers have repeatedly (and more or less openly) stated that Nehalem-based Xeons are poised to blow AMD out of the water (or out of much of the server market as it were). BUT that doesn't mean it is suddenly uninteresting to see what AMD is up to. You apparently just like reading your own posts, but I think you are alone in that regard, so shut up already!Natfly - Wednesday, February 25, 2009 - link
"This blog post is clearly targeted at my comment yesterday on how the 6 core Istanbul is not going to be competitive against Nehalem-EP."Dude, get over yourself. The post isn't targeted at you, it was stated in the other blog post they would write a post about this.
jap0nes - Wednesday, February 25, 2009 - link
I think the calculation should be total power usage over time, during a certain task. If a cpu consumes more power, but finishes the task first, then it consumes less power at the outlet, not on paper. And that's what matter.Dont know if that applies to Istambul, though.
tshen83 - Wednesday, February 25, 2009 - link
Crap, never mind the 37GB/sec thing, I miss read the charts. So Anandtech apparently got Istanbul system benchmarked at 41GB/sec. Wow. The performance/watt/dollar equation is still missing.How did Anandtech get the system or the benchmark numbers? From AMD? hehe....all sorts of questionable practices here.
SilentSin - Wednesday, February 25, 2009 - link
Probably from the AMD demo, kind of hard to fudge with straight bandwidth numbers and the Shanghai numbers AT is showing here match with the benchmark that AMD showed as well: http://www.theinquirer.net/inquirer/news/107/10511...">http://www.theinquirer.net/inquirer/news/107/10511... .AMD still does have the DCM 2x6-core Magny-Cours to come out with. Hopefully that is going as smoothly as Istanbul seems to be and will be released around the time that the 8-core Boxboro parts are coming out. Could be an interesting matchup if they can keep the TDP of the 12 core part at or under 125W. That part will require a new socket, however, which could throw a wrench into AMD's easy upgrade marketing.
tshen83 - Wednesday, February 25, 2009 - link
BTW, wait for the Boxboro benchmarks. I am extrapolating a cool 100GB/sec from 16 channels of FB-DIMM2 DDR3-1333. :)