AMD's K10: a "dead" product or not?
by Johan De Gelas on May 12, 2008 12:00 AM EST- Posted in
- IT Computing general
A few years ago it was fashionable to bash Intel's Pentium 4 as a braindead architecture. The fact that the Pentium 4 Northwood (533 MHz FSB) was the best performing processor from mid 2002 until late 2003 in many applications, and that the Pentium 4 Northwood remained competitive until early 2004 was conveniently forgotten: nuances do not make good headlines.
It is now trendy to bash AMD. One" PC doctor" at ZDNet goes as far to say that:
"When I look at AMD’s current product line, all I see is a forest of
deadness. Intel has products trump every category of products
going. Server, desktop, mobile, low-end, high-end, dual-core,
quad-core. Intel has all these markets stitched up."
Nuances, who needs them when you can make a sensational headline? And indeed, the lastest desktop CPU articles here at Anandtech show that Intel's midrange CPU have a significant lead over the fastest Phenom processors.
Like any design, the K10 is a trade-off. And most trade-offs were made in favor of the applications in the server and HPC market, at the expense of games and other desktop applications.
First take a look at this page which compares a Core 2 Duo 4400 (2 GHz, 2 MB L2 and 800 MHz FSB) with a slower 1.86 GHz Core 2 Duo E6320 (4 MB of L2 and a 1066 MHz FSB). One thing is for sure: games prefer the larger L2 cache. Some of the games were up to 10% faster on the CPU which was clocked 7% lower but with twice the L2-cache. The fact that games prefer a 4 MB L2 is not going to change when you run it on a AMD CPU with integrated memory controller. A L2 can deliver the necessary data in 12-20 cycles, an IMC needs about 100 cycles.
Now, take a look at the Cache architecture of AMD's K10/Barcelona. If your run a single threaded game on it, it gets a fast 512 KB L2-cache and after that a relatively slow (44-48 cycles!) 2MB L3. If you know that the same game can benefit from more than 2 MB cache, it is pretty clear that the 512 KB L2 is not going to cope, you'll end up using the L3 a lot. A dual threaded game might need a little less per thread, but the same problem will happen again: it needs to go to that slow L3 cache all too often. Run that same game on Intel Core CPU and each thread of your dual threaded game gets a low latency 4 MB (or 6 MB) L2.
Now let us now imagine that we run 4 threads of an HPC workload on it. Each thread has a very limited number of instructions, which perfectly fit in each of the L2 caches. You get 4 threads which gets a total of 4x the bandwidth of L2. In case of Intel, each two threads have to share the available bandwidth of the L2. The amount of data is huge, so caching the data is hardly possible. The fast IMC does wonders for the K10 chip.Data that is shared between the 4 cores remains in the L3-cache and all L2 caches are kept coherent at a incredibly fast SRI. So your cache coherency overhead does not increase with the number of caches, it increases per socket. Going from 2 to 4 sockets means that you double the amount of cache coherency traffic. Compare that to the Intel platform where all L2 caches need to be kept coherent.
It is just one example why we could never expect the K10 chip to be a super desktop chip. But how is Barcelona doing in the server world? Is it limited to an HPC niche market? Well, let us see what Intel thinks. First of all, where do most of the 45 nm chips go? Just a few weeks ago, Anand reported that Intel had no intention of flooding the desktop with 45 nm Core 2 chips quickly.
Those 45 nm chips are going to the server market. Why? Several reasons.
First of all, the server market might be only 20% of Intel's revenue. But look at this:
CPU
ASP
Profit margin (estimate)
Percentage of revenue Intel Server CPU >$400 >$300
+/- 20%
AMD Server CPU
$300-$400
$220-$330
+/- 16%
Intel Mobile/Desktop CPU
$100
$40-$50
+/- 80%
AMD Mobile/Desktop CPU
$50-65
$5-$30
>80%
Secondly, Intel needs those 45 nm to be competitive in the HPC market. A 2 GHz Barcelona is capable of keeping up with the best 65 nm Xeons in those applications.
It is pretty clear why AMD focused on the server market. Without a complete redesign it is not possible to beat Intel's integer crunching power and the fast and big L2-cache and that is exactly what a modern game needs. Barcelona built further on the K8 architecture and inherited the relatively inflexible integer pipeline. While Core 2 has sophisticated reordering of loads and stores, Barcelona does a limited reordering of loads. While Core 2 offers a 32 entry queue to the integer units, Barcelona has 3 rather inflexible separated 8 entry queues.
So the right way forward for AMD was to focus on HPC and server applications where it could leverage it's strong points. We can bash AMD for being so late, and coming up with relatively low clocked CPUs, but even a 2.8 GHz Phenom would not have raise AMD's ASP significantly in the desktop market.
We are almost done with our first round of quad socket benchmarking and we can tell you that we are having a lot more fun than Anand: it is a good old exciting fight between AMD and Intel. Don't believe us? Let Intel do the talking again:
Yes, projecting the bad performance of the desktop chip to say that "AMD's products are a dead forest" is ... just silly. If you have missed the previous entries of our IT blog, just go to it.anandtech.com
74 Comments
View All Comments
magagne - Friday, May 16, 2008 - link
This is all very interesting, a lot of insightful comments, as usual a great job by Johan,love it.I am suprised nobody has mentioned the obvious (or may have, some comments are really long). I don't have time to read every site and every benchmark to identify which Intel/AMD is superior in every given situation. And I don't care. For day to day work/gaming, it does not matter if I get 5 FPS more or less. Nor if power is 5 Watts more or less. Nor if is costs 20 bucks more or less. I just need a relatively performing CPU, like most people do.
Yes I know the Core Duos 2/4 cores are overall better products on the Desktop. But I will still buy AMD CPUs for 2 (maybe questionable) reasons
1) I have had them for 5 years, they still work fine, never had any problems (customer loyalty)
2) AMD is the underdog. If they go down, you can kiss competition good bye (customer morals)
Don't flame me as a fanboy, it's like buying bio-food guys, it comes down to choice. If they produced only crap, then I would be the fool sinking with the ship. But with reasonnably comparable performance/price ratio, it's AMD for me.
Cheers to all!
crashmanII - Tuesday, May 20, 2008 - link
Keep in mind that Intel has at least 10 Fabs and AMD just two. Every cache bit needs 6 transistors, so 1 Meg of Cache >50 M. I thinkAMD hasnt the capcities to build a wolfdale monster.
btw, excellent article, good comments!
0g1 - Thursday, May 15, 2008 - link
Interesting article. Made me realize that AMD's K8/10 cache design is the main reason for its poor performance in desktop apps vs Core2. AMD's design is for smaller programs, heavily threaded. Intel's Nehalem design is similar, except even more heavily threaded programs. It can't compete with Penryn's performance in larger programs because of the amount and speed of the L2 cache. Its like 6MB vs 0.5MB programs. There's not very many 0.5MB desktop programs.L3 cache is almost irrelevant because of how slow it is. However it really helps with feeding a lot of processors with private L2's.
Nehalem's cache seems to be worse than Shanghai's. 0.25MB vs 0.5MB L2 cache means even worse performance in desktop apps. The L3 cache speed of both CPU's seems to be the same.
Nehalem has Hyper Threading, triple channel DDR3, and increased load store buffers (but these are only the result of HT) -- these things are more for greater parallelism (ie servers) but mean almost nothing desktop applications. I think HT will even slow down most of todays games that use 2-4 cores because 4 processors is faster than 2+2 virtual processors.
Overall, I think Nehalem will be a lot slower compared to Penryn as a desktop processor and even a little slower per clock than Shanghai. It should be superior for massively parallel applications though.
As we move forward into the future, adding more cores, using a shared L2 like Penryn's will be too slow. However, can't help but wonder if we have enough cores (ie 4) already to justify the move to this slower L3 cache heirarchy. Using a smaller amount of private L2 for each core will keep access latencies low. Especially with 4 cores. 4 cores accessing the same L2 could theoretically increase the latency 4 times. Considering L3 is 'only' 3 times slower, I guess it makes sense to change when we get to 4 cores. However, the lack of multithreaded software that uses smallish programs fitting into the L2 caches means buying a X4 or Nehalem is a very forward looking solution. And even with 4 cores (or 8 virtual) on a L3 cache design, the performance increase over a 4 core Penryn design would be minimal. I wouldn't invest in a Nehalem or Phenom unless there was enough software support and enough cores (12 cores sounds tempting).
EvilBlitz - Wednesday, May 14, 2008 - link
Its nice that they focused on the server segment for greater margins, but you can not ignore the rest that makes up 80% of your revenue.I still think they should have released a dual core 65nm K8 with 2megs of L2. It would have been much more competitive in single threaded apps, esp games, and they wouldnt have to sell their cpus for such a crazy low price.
I think it would have been a good enough stop gap given their limited resources.
Sunrise089 - Wednesday, May 14, 2008 - link
Yes you can, if that 80% of your revenue only accounts for a small percentage of your profits.I love correcting the same mistakes I did 40 posts up.
Kaleid - Tuesday, May 13, 2008 - link
the P4 was faster than what AMD had available they were still far too overpriced.rogerpjr - Tuesday, May 13, 2008 - link
Year 2000. WInter. Dual cpu.com came out with an interesting observation. The then new cpu on the block from AMD was the Duron 1gighz. NEW architecture allowed it to mimic the Athlon. ..... Remember? Heres howIt was suggested that yes they could be OC´d. Sure. No biggie. BUT they could ( assuming proper cooling) be used with the dual cpu mobos from Tyan. The 216?(4). YES. My hotrod with a twist. My son and I made a living out of NOT OCíng them, BUT something even more twisted. Telling our customers what we had discovered, and then having at it. We did this with the 1.0 thru the 1.2 ghz durons. Of course if said customer said they wanted the ¨REAL¨ dual core we´d supply that. But that Duron 1.0 ( we tested them 1st, natch) thru the 1.2 was a real workhorse.
Just a blast from the past.
rogerpjr
EclipsedAurora - Tuesday, May 13, 2008 - link
As the bandwidth advantage of Hypertransport interface using by AMD processors, Operaton still have an absolute dominance in the SAN/NAS storage/disk array processor market. Intel as well as other processor manufacturers can't stand in the water in this market!EclipsedAurora - Tuesday, May 13, 2008 - link
As the bandwidth advantage of Hypertransport interface using by AMD processors, Operaton still have an absolute dominance in the SAN/NAS storage/disk array processor market. Intel as well as other processor manufacturers can't stand in the water in this market!jap0nes - Tuesday, May 13, 2008 - link
I have to admit, Intel was very humble admitting they've had their asses kicked by a 2GHz barcelona!3GHz, 12MB L2 being crushed by a 2GHz 2MB L3