LINPACK: Intel's Nehalem versus AMD Shanghai
by Johan De Gelas on November 28, 2008 12:00 AM EST- Posted in
- IT Computing general
A "beta BIOS update" broke compatibility with ESX, so we had to postpone our virtualization testing on our quad CPU AMD 8384 System.
So we started an in depth comparison of the 45 nm Opterons, Xeons and Core i7 CPUs. One of our benchmarks, the famous LINPACK (you can read all about it here) painted a pretty interesting performance picture.
We had to test with a matrix size of 18000 (2.5 GB of RAM necessary), as we only had 3 GB of DDR-3 on the Core i7 platform. That should not be a huge problem as we tested with only one CPU. We normally need about 4 GB for each quadcore CPU to reach the best performance.
We also used the 9.1 version of Intel's LINPACK, as we wanted the same binary on both platforms. As we have show before, this version of LINPACK performs best on both AMD and Intel platforms when the matrix size is low. The current 10.1 version does not work on AMD CPUs unfortunately.
We don't pretend that the comparison is completely fair: the Nehalem platform uses unbuffered RAM which has slightly lower latency and higher bandwidth than the Xeon "Nehalem" will get. But we had to satisfy our curiousity: how does the new "Shanghai" core compare to "Nehalem"?
Quite interesting, don't you think? Hyperthreading (SMT) gives the Nehalem core a significant advantage in most multi-threaded applications, but not in Linpack: it slows the CPU down by 10%. May we have found the first multi-threaded application that is slowed down by Hyperthreading on Nehalem? That should not spoil the fun for Intel though, as many other HPC benchmarks show a larger gap. AMD has the advantage of being first to the market, Nehalem based Xeons are still a few months away.
Also, the impact of the memory subsystem is limited, as a 50% increase in memory speed results in a meager 6% performance increase. The Math Kernel Libraries are so well optimized that the effect of memory speed is minimized. This in great contrast to other HPC applications where the tripple channel DDR-3 memory system of Nehalem really pays off. More later...
60 Comments
View All Comments
befair - Friday, November 28, 2008 - link
No no no .. they dont want to do anything that can show that AMD performs better than Intel. Hey .. where do you think you are? This in Intel land dude!BlueBlazer - Saturday, November 29, 2008 - link
Then go back to your fantasy land. Your irrational and useless comments are not required here, they are just more of less like one of those everyday spam.psychobriggsy - Friday, November 28, 2008 - link
Any timeframe for Nehalem 4 socket (16 core, 32 threads) comparisons against Shanghai 4S or 8S?A major advantage of Shanghai is that it is a drop-in replacement for a couple of years' worth of S1207 server infrastructure, which is quite useful in these credit restricted times in my opinion.
Clauzii - Sunday, November 30, 2008 - link
Also the power usage is a positive factor. I think AMD have played their cards right by maybe not making the fastest CPU out there overall, but hit a sweet spot with a moderate speedbump and less power needed at a rather good price.It seems like it will be really sweet for those already running Barcelonas.
TruePath - Friday, November 28, 2008 - link
Well yes, HT seems to slow LINPACK down but given your comments about memory it's not clear that it would still slow things down if more memory was present.I mean LINPACK is heavily optimized and no doubt behaves differently depending on the number of cores it sees. Given that HT appears to add a core it will likely behave as if it has access to two cores and thus run into the memory barrier you mentioned above. I'd be curious to see how HT performed with more memory.
Also, I suspect that *highly* optimized SMP code will fairly frequently show reduced performance with HT. In particular if the code reacts to the number of processors and chooses the implementation/algorithm accordingly it shouldn't be uncommon to select an implementation for 2 processors that's actually more wasteful of FLOPS/IOPS than the single processor code. After all if you use 1.8 times the operations required for 1 CPU to compute the result evenly divided between 2 processors you've still finished faster. However, HT doesn't really give you twice the computational power so it will very likely fool this kind of optimization.
Is there a way that heavily optimized code like LINPACK could recognize that it's not really two processors but actually HT and react accordingly?
BlueBlazer - Saturday, November 29, 2008 - link
I also would guess that since LINPACK was so optimized, it would keep the FPU/SSE unit busy full time. With HT that single FPU/SSE unit will be shared between 2 threads, thus queueing occurs. This would delay somewhat if one thread had to rely on the other thread to finish a calculation to continue (dependancy). And if both of those threads are on the same core, that would be some delay. In other applications like those 3D renderers, each thread is highly independant (no dependancy on other threads) and that would make a difference.My 2 cents anyway.
JohanAnandtech - Saturday, November 29, 2008 - link
Ah! It has been staring me right in the face, thanks for pointing that out. Linpack is known to have ultra high IPC (I believe it is probably between 2 and 3, will give it a shot). There is simply "no room" anymore for a second thread: the other thread is fully using the FP ports.Pelle1948 - Saturday, November 29, 2008 - link
They have several tests showing where HT off is better:http://techreport.com/articles.x/15818/12">http://techreport.com/articles.x/15818/12
Mathos - Friday, November 28, 2008 - link
On the other hand, once AMD releases the new platform for their server chip it should shore up the performance gap. Especially if it's the additional bandwidth from DDR3 that's causing it, or the difference between HT2.0 and QPI speed.duploxxx - Friday, November 28, 2008 - link
actually I was going to say the same, knowing that i7 performance is here a bit high due to mem config i think shanghai actually does well and will only close the gap more when ht3 + ddr3 is released after 2p nehalem launch.Won't be the only test that we will see negative HT influence, big questions will rise with virtulization, the old pentium xeon architecture was also way better off with HT off on that.