Istanbul versus Nehalem, some extra notes
by Johan De Gelas on February 27, 2009 11:00 AM EST- Posted in
- IT Computing general
My last post generated quite a bit of discussion, some of it based on misunderstandings. In this post I'll try to make a few things more clear. In a previous post, I pointed out that there are a good indications that a dual Nehalem EP has a 40 to 100% advantage over Shanghai (depending on the application, based on the SAP and Core i7 workstation benchmarks).
If Istanbul is introduced in the early part of H2 2009, AMD will have a small window of opportunity of competing with a hex-core versus a quad-core (Intel's Nehalem EP). Time will tell of course how small, large or non-existing this window will be.
In well threaded applications, the best a "hex-core Shanghai" can do is give about a 30-40% boost to performance compared to the current Shanghai, which is most likely not enough to close the gap with the upcoming Nehalem CPU (let alone the 32 nm hex-core version). However, Istanbul is more than a hex-core Shanghai. The improved memory controller and HT-assist can lower the latency of inter-CPU syncing and increase the effective memory bandwidth. For that reason, Istanbul will do better than just "a shanghai with 2 added cores" in many applications such as SAP, OLTP databases, Virtualization scenario's and HPC. Depending on the application, Istanbul might prove to be competitive with the quad-core Nehalem. It is clear that the hex-core "Westmere" which will have a slightly improved architecture will be a different matter.
But back to the "this higher amount of bandwidth will allow the quad Istanbul to stay out of the reach of the dual Nehalem EP Xeons" comment. It is very embarrassing, and simply bad PR if a quad socket platform is beaten by a dual socket platform in any benchmark. This is something we have witnessed in the early SAP numbers. That is why I commented that the improved "uncore" will help the quad socket Istanbul to stay out of the reach of the dual Nehalem EP. I was and am not implying that people who would consider a dual Nehalem EP are suddenly going to consider a quad Istanbul.
It is clear those looking for a 4S and 2S server are in a slightly overlapping but mostly different market. Quad socket is mostly chosen for large back end applications such as OLTP databases or for virtualization consolidation. The number of DIMM slots in that case is a very important factor. However, even with the advantage of having more DIMM slots, better RAS etc., a quad socket platform that cannot outperform a dual socket platform will leave a bad taste in the mouth of potential buyers. It is important that there is a minimal performance advantage.
The fact that the performance/power ratio of such a quad server will be worse than a dual socket server is an entirely different discussion. IBM's market research (see the picture below) shows which form factor is bought mostly for consolidating VMs. As you can see it comes down to some people being convinced that a number of 4-socket rack servers is the best way, others are firm believers that about twice as much low power 2-socket blades is the way to go. It is very hard to convince the latter or former group to switch sides and that is why I feel that 2S and 4S servers are mostly in different markets.
In many cases, the number of virtual machines you can consolidate on one physical server is mostly a function of the amount of RAM. If the number of DIMM slots allows you to consolidate twice as many virtual machines on the quad socket machine, the consumed energy might be better than using two DP machines with the same number of DIMMs.
So despite the fact that the two DP machines have a lot more CPU power, the "scale up" buyers still prefer to go for a large box with more memory; they are not limited by raw CPU power, but by the amount of RAM that they can put in this server. It is these people that AMD will target with their 4S platform, a platform which has - especially for virtualization - a number of advantages over the current Intel 4S "Dunnington" platform... at least until Intel's octal-core arrives. Whether you choose the 2S blades or 4S rack servers depends on whether you believe in the "scale up" or "scale out" philosophy.
The conclusion is that many 4S rack servers are not only bought for raw CPU performance, but for the amount of RAM, their RAS features, and so on. However, it is clear that a 4S server should still outperform 2S servers so that the group of buyers who are believers in the "scale up" philosophy feel good about their purchase.
18 Comments
View All Comments
fredsky - Tuesday, March 3, 2009 - link
1st to market is... Apple mac pro !http://www.apple.com/macpro/features/processor.htm...">http://www.apple.com/macpro/features/processor.htm...
about 1.9x time faster than Hapertown.
lots of benchmarks inclued.
about the max memory advantage for 4s vs 2s.
we have some HP xw8600 workstation 2s, which can accommodate up to 128GB RAM...
duploxxx - Sunday, March 1, 2009 - link
oh common is that all you can post?My last post generated quite a bit of discussion, some of it based on misunderstandings. In this post I'll try to make a few things more clear. In a previous post, I pointed out that there are a good indications that a dual Nehalem EP has a 40 to 100% advantage over Shanghai (depending on the application, based on the SAP and Core i7 workstation benchmarks).
because a sap bench gives 100% advantage of a hyperthreaded core you already think that it will scale in all applications, you should know better then that before posting such nonsense. Why don't you wait for the real performance charts before you post. Now you have 0 backup of your comments if its true then let it be but if not you can't back this up then just shut up. You just sound like payed blue marketing.
lets see choosing between 12 real cores or 8 real + 8 virtual in virtualisation is not going to cut it for nehalem, although vmware changed the code to better work with HT they also believed this should not be seen as a real core !!!!!!!
tshen83 - Saturday, February 28, 2009 - link
BUT, I guess I will just point out the obvious."In well threaded applications, the best a "hex-core Shanghai" can do is give about a 30-40% boost to performance compared to the current Shanghai, which is most likely not enough to close the gap with the upcoming Nehalem CPU (let alone the 32 nm hex-core version). However, Istanbul is more than a hex-core Shanghai. The improved memory controller and HT-assist can lower the latency of inter-CPU syncing and increase the effective memory bandwidth. For that reason, Istanbul will do better than just "a shanghai with 2 added cores" in many applications such as SAP, OLTP databases, Virtualization scenario's and HPC. Depending on the application, Istanbul might prove to be competitive with the quad-core Nehalem. It is clear that the hex-core "Westmere" which will have a slightly improved architecture will be a different matter."
So you begin the argument by saying "which is most likely not enough to close the gap with the upcoming Nehalem CPU", and then you close the argument with "Depending on the application, Istanbul might prove to be competitive with the quad-core Nehalem"? Which one are you on? Oh I see the operative word here is "might". Yeah, it is hard to speculate performance numbers when you don't have Istanbul silicon in hand do you? I have been very generous when extrapolating Istanbul performance by saying that assuming "linear scalability", which is probably the best case situation for AMD. It still would not overcome the 100% advantage Nehalem-EP has over Shanghais.
Another funny thing is that the 41GB/sec Stream Bandwidth benchmark you posted, I am not sure that it is from the current Nvidia 3600 chipset. Theoretically, quad socket dual channel DDR2-800 cannot produce that amount of bandwidth in the first place.
The only thing that you presented that is true is the 32DIMM argument. But you failed at pointing out the 32 DIMM disadvantage on Opterons. When you use 8 DIMMS per Socket, the memory bus downclocks to DDR2-533, which isn't much when you are talking about Barcelonas when the default is 667Mhz, but it is a huge downclock when you talk about Shanghai and Istabul's DDR2-800. All benchmarks published on websites are done with 4 DIMMs per socket, which operates at full speed. When you push 8 DIMMs per socket, you get dual channel DDR2-533, which is close to a 33% performance degradation, something you fail to mention.
Now, I don't know many people here who are virtualization customers who push 8 DIMMs per socket on their 4S servers. Database people are different, because even with slower memory bus, it is still faster than disk seek.
Then your argument on "scaling up" vs "scaling out". It isn't a philosophical question. People without software expertise to "scaling out" would have to "scale up". It is particularly true for database because it is hard to "shard". Another reason is when you buy expensive commercial licenses like Windows Server Enterprise and Oracle licenses or SQL Server Enterprise or VMware ESX Server, the licenses themselves force you to purchase the largest box money can buy.
Given that 2S Nehalem is likely to canibalize Intel's own 4S Dunningtons, I expect to see Becton released very soon. I don't see how Intel can allow Nehalem to eat away at its own fat cash cow.
BTW, your HP SSD denial article is still missing, and March 25, 2009 isn't much better than March 1, 2009.
JohanAnandtech - Sunday, March 1, 2009 - link
[quote]So you begin the argument by saying "which is most likely not enough to close the gap with the upcoming Nehalem CPU", and then you close the argument with "Depending on the application, Istanbul might prove to be competitive with the quad-core Nehalem"? Which one are you on?
[/quote]
How hard can it be to understand that a theoretical Sixcore version of Shanghai can not keep up with Nehalem, while the extra improvements (HT assist, mem controller) of Istanbul might bring the Istanbul CPU closer?
[quote]
It still would not overcome the 100% advantage Nehalem-EP has over Shanghais.
[/quote]
So by your reasoning Nehalem always has a 100% advantage? It is the best Server CPU Intel has brought out in years, but let us keep it sensible, shall we?
[quote]
Another funny thing is that the 41GB/sec Stream Bandwidth benchmark you posted, I am not sure that it is from the current Nvidia 3600 chipset. Theoretically, quad socket dual channel DDR2-800 cannot produce that amount of bandwidth in the first place.
[/quote]
It is that funny because it shows you are extremely critical for someone else, but for your own posting you are pretty sloppy.
4 Sockets x 2 Channels x DDR800 * 8 Byte/channel= 51.2 GB/s is the theoretical maximum.
[quote]
Then your argument on "scaling up" vs "scaling out". It isn't a philosophical question. People without software expertise to "scaling out" would have to "scale up".
[/quote]
In case of virtualization (and that is what I was talking about), it is a lot more easier to make software scale out or scale up. For example, a badly scaling php site (hard to scale up) could be divided into several VMs, and be one large NLB cluster. In this way you can both Scale out (few VMs on many servers) and up (many VMs on few 4S-8S servers).
[quote]
"The only thing that you presented that is true is the 32DIMM argument."
[/quote]
And you are the ultimate judge on that right? And yet I have already demonstrated 4 serious errors in your reasoning. And I didn't even talk about your completely off 37 GB/s (6/4 * 25 GB/s) calculation in your first post. As if Stream scales with the number cores.... (Remember Dunnington??)
I hope that you can keep the discussion more respectful instead of always jumping to the gun. I have been in this business for 10 years, and I have always learned a lot from good debates. So I have no problem with people point out in a respectful manner that I made a technical or reasoning error.
But constantly shouting that "you are so wrong" while you build up posts full of factual errors is simply a waste of time.
tshen83 - Sunday, March 1, 2009 - link
Johan, you are really starting to piss me off.[Quote]
How hard can it be to understand that a theoretical Sixcore version of Shanghai can not keep up with Nehalem, while the extra improvements (HT assist, mem controller) of Istanbul might bring the Istanbul CPU closer?
[/Quote]
When isn't Istanbul a sixcore version of Shanghai? The extra improvements such as HT assist is there because otherwise, the six cores won't experience linear scaling. The memory controller is another issue I will point out later. All available data shows that 2S Nehalem is equivalent to 4S Shanghai right now, the problem I have with you is your constant usage of the word "might", which indicates that you don't have data to back up your assumption that the extra improvements like HT assist can overcome the 100% performance per watt advantage of Nehalem-EP vs Shanghai.
[Quote]
So by your reasoning Nehalem always has a 100% advantage? It is the best Server CPU Intel has brought out in years, but let us keep it sensible, shall we?
[/Quote]
No, not always, but in anything memory related, yes, triple channel DDR3-1333 will have double the bandwidth of dual channel DDR2-800 per Socket.
[Quote]
It is that funny because it shows you are extremely critical for someone else, but for your own posting you are pretty sloppy.
4 Sockets x 2 Channels x DDR800 * 8 Byte/channel= 51.2 GB/s is the theoretical maximum.
[/Quote]
Divide by two please!
Anyone can go to wikipedia and find out the theoretical memory bandwidth, but only the hardware designer and software engineers know that for each memory operation, it needs both a 64bit address and 64bit data, so the maximum theoretical DATA throughput like Stream benchmark, it is half of the theoretical memory bandwidth because address lines goes through the same memory bus. Look at the Quad Shanghai's Stream benchmark 25GB/sec, which is precisely the maximum Quad Socket Dual Channel DDR2-800 can offer. What magic did AMD pull off to suddenly get a 17GB/sec extra memory bandwidth off of Dual channel DDR2-800? I have thought about this more, and think I found out what game AMD is playing. Just read the comments below after I finish defending myself here.
[Quote]
In case of virtualization (and that is what I was talking about), it is a lot more easier to make software scale out or scale up. For example, a badly scaling php site (hard to scale up) could be divided into several VMs, and be one large NLB cluster. In this way you can both Scale out (few VMs on many servers) and up (many VMs on few 4S-8S servers).
[/Quote]
Using VM to scale up because you can't write a PHP program to scale up is stupid and a temporary solution. I really don't want to hear the argument that AMD is better at Virtualization anymore. They WERE better, but not anymore since the release of Dunnington and Nehalem. AMD HAD a stronghold in VM simply because of the cost of the VMware ESX licensing, making 4S AMD the only viable hardware to deploy on. But Dunnington already changed that. So is Nehalem-EP and soon Nehalem-EX.
[Quote]And you are the ultimate judge on that right? And yet I have already demonstrated 4 serious errors in your reasoning. And I didn't even talk about your completely off 37 GB/s (6/4 * 25 GB/s) calculation in your first post. As if Stream scales with the number cores.... (Remember Dunnington??)
[/Quote]
What 4 errors? You mean 4 facts you thought were errors? How is my 37GB/sec calculation wrong when it is only 10% off the AMD's internal Stream Benchmark? That number assumes linear scaling, which is pretty close. Stream does not scale with cores but it does scale with memory controller.
[Quote]
I hope that you can keep the discussion more respectful instead of always jumping to the gun. I have been in this business for 10 years, and I have always learned a lot from good debates. So I have no problem with people point out in a respectful manner that I made a technical or reasoning error.
[/Quote]
I have no trouble being rude to people who compare 4S AMD to 2S Nehalem and claims that 4S Istanbul is OMG winning against 2S Nehalem. It is simply human stupidity. The fact that you are in this business for 10 years only amplify how true that statement is. You should know why you did something wrong.
Now that I have defended my position, I have something more to add, that Johan you aren't going to enjoy reading.
The 41GB/sec Stream benchmark isn't and can't be from Nvidia 3600 chipset. It can't be done on Quad Socket Dual Channel DDR2-800. The only possible explanation you can find is that this benchmark is done on Dual Channel DDR3-1333. That Istanbul is a Phenom II with dual channel DDR3-1333 controller with two extra cores. The fact that Phenom II design had both DDR2 and DDR3 controller makes this all possible.
The benchmark is done with Dual Channel DDR3-1333. So quad Istabul gives you 8 channels of DDR3-1333 performance, which compared to Dual Nehalem-EP, which is 6 total channels of DDR3-1333. That's why you get 41GB/sec vs Nehalem-EP's 34GB/sec. Math would then tell you that AMD's implementation of DDR3 controller is 10% worse in performance than Intel's. Anyways, that also means, that benchmark is done on AMD's own HT3.0 enabled chipset supporting DDR3. So Istanbul is likely going to be a complete platform change rather than a drop in replacement, which is a major risk for AMD.
Now, that Istanbul is also a drop in replacement for Socket F on the DDR2-800 platform because it also has a ddr2 controller. Similar to how Phenom II can use both DDR2 and DDR3. But memory performance on 4 DIMM per socket DDR2-800 will be limited, let alone 8DIMM per Socket DDR2-533 downclock.
The game AMD is playing is that it wants the reviewer sites to publish benchmark using complete DDR3 platform, which nobody right now knows if it is mature enough yet. But AMD really wants to sell to the existing barcelona Socket F upgrade market. See? Advertise DDR3 based performance, but trying to fudge the performance down to DDR2 based platform where memory bandwidth will be cut by half. And that is the Dubai game.
Of course no one knows how it is going to play out yet, and my opinion is that Johan, you should only push benchmark when you have retail CPUs in hand.
Another final point I want to make is this: adding cores isn't a solution to performance per watt deficiency of the AMD CPU design. Even if the cores scale linearly, so does the power it takes. So performance per watt actually stays the same. In fact performance per watt per dollar actually goes down because 6 core will cost more than 4 core. Are you going to choose a 6 core Istanbul with 120W TDP over a 95W TDP Nehalem-EP knowing that it is also 50% slower? I wouldn't. Performance per watt tells you you shouldn't. Anyone can CTRL-C and CTRL-V a whole bunch of cores, it does not solve the finer point that AMD's cores are at half the performance/watt compared to Nahalem cores. Imagine that Intel had to compete on performance/watt against AMD when it had the FSB/FB-DIMM handicap, now that the handicaps are removed, that's how you get a 100% boost in performance/watt. This isn't something AMD can engineer over night.
Johan, in the bigger picture, you are just getting paid to write "notes" to the hardware community based on what AMD PR department wants the public to believe. Being in the hardware scene for that long, you should have the intelligence to tell which benchmarks are true, and which benchmarks are fluff.
If you want to continue discussing, go straight ahead. I would actually contact your AMD PR representative and ask how they got the 41GB/sec Stream benchmark. Or did you get that information from Youtube?
JPForums - Friday, March 20, 2009 - link
[Quote] Johan, you are really starting to piss me off. [/Quote]Whether you are right or wrong, this isn't the way to convince someone. It tends to make people think that you aren't thinking clearly and they tend to just assume you are wrong.
[Quote] ... which indicates that you don't have data to back up your assumption that the extra improvements like HT assist can overcome the 100% performance per watt advantage of Nehalem-EP vs Shanghai. [/Quote]
I seem to remember reading that it might be able to compete with the more common 40% advantage. The reason he said "So by your reasoning Nehalem always has a 100% advantage?" is because that's the case you are constantly bringing up.
[Quote] ... only the hardware designer and software engineers know that for each memory operation, it needs both a 64bit address and 64bit data [/Quote]
First, there are separate address and data lines (generally of different bit widths) going to each DDR chip. If you'd ever designed a hardware memory interface you'd know this.
Second, the way memory works in practice (simplified) is it takes a single memory address and bursts out data sequentially starting from that address. The reason you can't hit theoretical rates is that there is access latency associated with each operation. (Do a search on Column Access Strobe and keep reading about associated memory timings ... read again ... come back in a month) Bursting data was an idea implemented long long ago to partially hide latencies like this. It should be noted, though, that the number of data bits you can burst is still limited (particularly if the data required isn't stored linearly).
Third, making absolute generalizations like "only the hardware designer and software engineers know ..." is presumptuous. In my experience, most software engineers I know don't really know much about the hardware outside of how it affects their code. I.E. they do an exceptional job of designing code to take advantage of n cores with a maximum memory bandwidth of X GB/s and an average latency of Z ns, but they couldn't look at a system and tell you its bandwidth and latency characteristics. There are obviously exceptions. Are you by chance a software engineer?
[Quote] How is my 37GB/sec calculation wrong when it is only 10% off the AMD's internal Stream Benchmark? [/Quote]
Your 37GB/s is a theoretical maximum. If they are indeed getting 41GB/s in a real world application with the system in the article, then being off by only 10% isn't any less wrong as it exceeds the maximum.
[Quote] The 41GB/sec Stream benchmark isn't and can't be from Nvidia 3600 chipset. It can't be done on Quad Socket Dual Channel DDR2-800. The only possible explanation you can find is that this benchmark is done on Dual Channel DDR3-1333. [/Quote]
This is contradicted by the following.
[Quote] To better understand this, we combined our own stream benchmarking with the one that AMD presented. All AMD systems are using DDR-2 800. [/Quote]
Am I to understand that your real issue is the honesty of not just AMD but the author? While I don't have the same misgivings about the author, I do agree with the idea that suppositions should be back with evidence. What is presented here is hardly more than theory.
However, this is only a blog post. The only real purpose of this is to let people know that Istanbul is more significant than adding two cores to Shanghai. Detailed analysis should be saved for the detailed review. Whether AMD is trying to pull the wool over our eyes or not will be obvious then.
biodude10 - Saturday, March 14, 2009 - link
Quote: "Divide by two please!"I'm not sure who you think you are, but you have no idea what you are talking about on this STREAM BW / DRAM topic. Johan's math is right. Yours is not. I won't be surprised if you respond to this post claiming (yet again) that you are right . . . but it just isn't so. Save yourself some embarassment, learn something, and admit you made a mistake and learned something from the valuable feedback you've received from this discussion so that future dialogue can be enhanced.
A single error in math is understandable, everyone makes mistakes. But when you continue to make the same error over and over (even when it's been pointed out) you just look like a fool.
ssj4Gogeta - Friday, February 27, 2009 - link
ok, i don't know much about servers, but from what i understand, AMD's 24 cores are barely able to beat Intel's 8 cores. LOLi can just imagine what will happen when Intel launches 8 core Beckton quad socket. Probably AMD will have to release a 24 core quad socket to beat it.
duploxxx - Sunday, March 1, 2009 - link
you said it correct, you barely know anything about servers, leave it that way, 24 cores will give a total kill to any 2s system as long as you do not consider power and price.winterspan - Saturday, February 28, 2009 - link
Yeah, the 4-socket 32-core "Beckton" Xeon platform will be absolutely insane! It has quad channel memory and FB-DIMMs so it should support an enormous amount of memory.. Coming in H1 2010 I believe