Original Link: https://www.anandtech.com/show/2022
Intel Woodcrest, AMD's Opteron and Sun's UltraSparc T1: Server CPU Shoot-out
by Johan De Gelas on June 7, 2006 12:00 PM EST- Posted in
- IT Computing
Introduction
In Q1 of 2006, AMD-based systems accounted for over $1 billion, or one sixth of the x86 server space. The Opteron grew from a 6% market share to 15% market share in the astonishingly short period of only one year. In four socket servers, the Opteron grabbed 48% of the US market, up from 23% last year. What's more, this is not a "US only" phenomenon: the Opteron has a firm grip on 36% of the worldwide four socket market. Bear in mind that less than 4 years ago, AMD was nothing more than a blip on the server CPU radar.
Sun, which was getting strangled by the high volume Intel Xeon and the mighty Itanium, has also made a big comeback. An attractive UltraSparc IV+ with a fast, integrated L2 cache and massive L3 cache keeps the traditional Sparc buyers loyal, while the well-designed Galaxy Opteron based servers are pretty popular and the UltraSparc T1 "throughput CPU" attacks the midrange x86 market.
It's high time for Intel to find a proper response, as the competition is taking the wind out of Intel's server CPU sails. What's the answer? A Xeon based on the Core architecture: Woodcrest. We compared the Core and K8 architectures just a month ago. Memory disambiguation, large OOO buffers and a large but low latency shared L2 cache should make the Core architecture more efficient in server related tasks than any other x86 CPU.
This article compares a Woodcrest based Intel server with its closest rivals: AMD Opteron based servers such as the HP DL385 and MSI K2-102A2M and the UltraSparc T1 based Sun T2000 server.
The New Intel Platform
The biggest advantage of Intel's newest Bensley platform is longevity: the Dempsey, Woodcrest and quad-core Clovertown Xeon all use the same socket and platform.
Bensley also eliminates the shared Xeon bus by giving each CPU an independent bus running at 1333 MHz. This is somewhat similar to the old Athlon MP platform, and it should be noted that this makes the Blackford Northbridge or MCH a pretty complex chip. Blackford also offers up to 4 memory channels and 24 PCI Express lanes.
The Dual Independent Bus (DIB) will not make much difference for Woodcrest and Dempsey as only some HPC applications are really limited by the FSB bandwidth. Three years of benchmarking tell us that most server and workstation application are not bottlenecked by the modern FSB speeds. The Opteron platform does not scale so much better thanks to NUMA in dual and quad core configurations. No, in most applications, the low latency integrated memory controller makes the difference, not FSB/NUMA bandwidth. Of course, with Clovertown, or two Woodcrests on one chip, a shared FSB might become a bottleneck, and in that case a DIB is a good idea.
The biggest innovation of Blackford is the introduction of fully buffered DIMMs (FB-DIMMs). On the FB-DIMM PCB we still find parallel DDR-2, but the Advanced Memory Buffer (AMB) converts this parallel data stream into a serial one to the Blackford chip. The serial links between the memory subsystem and the chipset not only eliminate skew problems but they also greatly simplify the routing on the motherboard. Routing quad-channel DDR-2 would be a nightmare.
The AMB, which you see under the heatsink in the middle of the DIMM, solves the skew and routing problems, and it comes with a relatively small price premium. The AMB also allows full duplex operation from the chipset to the AMB, where other memory bus designs are half duplex and introduce extra latency when alternating between send and receive modes. However, the AMB dissipates about 5 Watt and increases latency. This means that with 8 DIMMs or more, the advantage of using 65 Watt Woodcrest CPUs over 89-92 W Opterons will be gone.
The Blackford chipset uses X8 PCI Express links to talk to other various chips such as the ESB-2 I/O bridge, or "Southbridge" to keep it simple. The other PCI Express links can be used for 10 Gbit Ethernet or a SATA or SAS controller. A workstation version of Blackford, Greencreek will offer dual X16 PCI Express for running multiple workstation graphic cards.
Words of Thanks
A lot of people gave us assistance with this project, and we like to thank them of course.
Waseem Ahmad, Intel US
Matty Bakkeren, Intel Netherlands
Trevor E. Lawless, Intel US
(www.intel.com)
Chhandomay Mandal, Sun US
Luojia Chen, Sun US
Peter A. Wilson, Sun US
(www.sun.com)
Peter Zaitsev, Elite MySQL Guru
(www.mysql.com)
Damon Muzny, AMD US
(www.amd.com)
Steve Olson, Sybase US
(www.sybase.com)
Erwin Vanluchene, HP Belgium
(www.hp.be)
Ilona van Poppel, MSI Netherlands
Ruudt Swanen, MSI Netherlands
(www.msi-computer.nl)
Alexander Goodrich, Assembler Guru
Bert Devriese, Developer of MySQL & PHP benchmark
Dieter Saeys, Gentoo/Linux support
Brecht Kets, Development of Improved Bench program
Tijl Deneut, Solaris, PostGreSQL and MySQL support
I also like to thank Lode De Geyter, manager of the University College of West-Flanders. Further information about our server research is available on our website.
Benchmark Configuration
We used Solaris 10 for the Sun T2000, as the only supported OS for the T2000 right now is Solaris 10 3/05 HW2 (and upwards). The T1 is fully binary compatibility with the existing SPARC binaries but needs this version of Solaris.
Below is a picture of our Server lab at the University College of West-Flanders. You can see Bert and Dieter standing next to our brand new rack of the server research lab.
From top to bottom, we have the Supermicro SuperServer 6014P-32, the MSI K2-102A2M, the Sun T2000, our own PIII based Linux gateway and firewall, and at the bottom, the Promise JBOD300s and the HP DL585. Yes, we still have a lot of benchmarking to do. The other Intel based machines are in towers, so you won't find them in our rack.
All benchmarking is monitored: CPU load, network and disk I/O are watched using CPU graph, top, vmstat and prstat. This way we can determine whether or not the CPU or another component is the bottleneck.
Our web server tests are performed on Apache2 2.0.55, including the mod_deflate module for gzip compression, PHP4.4.1 and Mysql 5.0.21.
Hardware Configurations
Here is the list of the different server configurations:
Sun T2000:
Sun UltraSparc T1 1 GHz, 8 cores, 32 threads
Sun Solaris 10
32 GB (16x2048 MB) Crucial DDR-2 533
NIC: 1 Gb Intel RC82540EM - Intel E1000 driver.
You can find much more information about the T1 CPU in our previous article.
Intel Server 1:
Dual Intel Xeon "Woodcrest" 3 GHz Shared 4 MB L2 cache, 1333 MHz FSB (4 cores total)
Blackford Chipset
64 bit Gentoo Kernel 2.6.15-gentoo-r7
Intel Server Board S5000
4 GB (4x1024 MB) Micron FB-DIMM Registered DDR2-533 CAS 4, ECC enabled
NIC: Dual Intel PRO/1000 Server NIC
2x Western Digital Raptor 36 GB SATA
Intel Server 2:
Dual Intel Xeon "Irwindale" 3.6 GHz 2 MB L2 cache, 800 MHz FSB - Lindenhurst
64 bit Gentoo Kernel 2.6.15-gentoo-r7
Intel Server Board SE7520AF2
8 GB (8x1024 MB) Micron Registered DDR2-400 CAS 3, ECC enabled
NIC: Dual Intel PRO/1000 Server NIC (Intel 82546GB controller)
2x Western Digital Raptor 36 GB SATA
Opteron Server 1: Dual Opteron 275 2.2 GHz 2x1MB L2 cache (4 cores total)
64 bit Gentoo Kernel 2.6.15-gentoo-r7
Solaris x86 10
MSI K8N Master2-FAR
4 GB: 4x1GB MB Crucial DDR-400 (3-3-3-6)
NIC: Broadcom BCM5721 (PCI-E)
2x Western Digital Raptor 36 GB SATA
Opteron Server 2: MSI K2-102A2M
ServerWorksHT2000 Chipset
64 bit Gentoo Kernel 2.6.15-gentoo-r7
4 GB: 4x1GB MB Crucial DDR-400 (3-3-3-6)
NIC: Broadcom BCM5721 (PCI-E)
2x Western Digital Raptor 36 GB SATA
Opteron Server 3: HP DL385
Solaris x86 10
AMD 81xx chipset
64 bit Gentoo Kernel 2.6.15-gentoo-r7
4 GB: 4x1GB MB Crucial DDR-400 (3-3-3-6)
NIC: Broadcom BCM5721 (PCI-E)
2x Seagate Cheetah 36 GB - 15000 rpm - SCSI 320 MB/s
Client Configuration: Dual Opteron 850
MSI K8T Master1-FAR
4x512 MB Infineon Registered DDR-333, ECC
NIC: Broadcom 5705
Common Software
64 bit Gentoo Kernel 2.6.15-gentoo-r7
Apache2 2.0.55 + mod_deflate module for gzip compression.
PHP4.4.1
Mysql5.0.21
The Official SPEC Numbers
SPEC FP and Int 2000 are the standard benchmarks to evaluate CPU performance. However, the benchmark numbers are highly dependant on the compiler. SPEC fp and Integer show the best case performance as the CPU runs on the aggressively compiled and highly optimized code. In the real world, code is compiled in a more conservative/less optimized way.
In practice this means that Intel's SPEC numbers - thanks to it's highly capable compiler team - are (slightly) higher than in real applications. Nevertheless, SPEC CPU 2000 is a good starting point to understand what a CPU is capable off. As mentioned earlier, the Xeon 5100 is the Xeon Woodcrest, based on the new core architecture.
SPECfp | ||
Clockspeed | SPEC fp 2000 | |
POWER5+ | 2200 | 3271 |
Itanium 2 | 1666 | 2851 |
Xeon 5160 | 3000 | 2783 |
Opteron | 2800 | 2256 |
Pentium 4 E | 3733 | 2232 |
The new Woodcrest is about 20-25% faster than the fastest dual-core Opteron. The 7% clockspeed advantage is most likely a result of the fact that the Woodcrest was baked with a newer 65nm process. If AMD manages to keep up with Intel when it comes to clockspeed, the advantage of their newest CPU might shrink to 15% or less. However, Intel's Woodcrest will have a much bigger advantage in all applications that make heavy use of 64 and 128-bit SSE.
SPECint | ||
Clockspeed | SPEC Int 2000 | |
Xeon 5160 | 3000 | 3057 |
Pentium 4 E | 3733 | 1870 |
Opteron | 2800 | 1837 |
Pentium 4 Xeon | 3733 | 1813 |
POWER5+ | 2200 | 1705 |
Itanium 2 | 1666 | 1502 |
When it comes to integer performance, the Woodcrest numbers are simply stunning and vastly superior to any other architecture. Let us find out if this vastly superior integer performance in SPEC Int 2000 pays off in server applications.
Latencies...
LMBench is a set of micro-benchmarks which can be helpful for determining memory latency and instruction latencies. We tested with LMBench 3.0a-5. It must be said that LMBench is usually right, but not always. If the benchmark is not aware of some of the particularities of a certain architecture, it can measure wrong values. So we have to double check if the values measured make sense.
LMBench | |||||||
Clockspeed | L1 (ns) | L1 (cycles) | L2 (ns) | L2 (cycles) | RAM (ns) | RAM (cycles) | |
Xeon 5160 3 GHz | 3000 | 1.01 | 3 | 4.7 | 14 | 117.3 | 345 |
Pentium- M 1.6 GHz | 1593 | 2 | 3 | 6 | 10 | 92.1 | 147 |
Sun T1 1 GHz | 980 | 3 | 3 | 22.1 | 22 | 107.5 | 105 |
Opteron 275 | 2209 | 1 | 3 | 5.5 | 12 | 73 | 161 |
Xeon Irwindale 3.6 GHz | 3594 | 1 | 4 | 8 | 28 | 48.8 | 175 |
The massive 4 MB L2 cache has an amazingly low latency of 14 cycles. This seems to be the worst case, as we have measured 12 cycles with other benchmarking tools such as ScienceMark. Nevertheless, even 14 cycles at 3 GHz is pretty amazing. The Core Duo, a.k.a. Yonah, accesses a shared cache that's half as large in 14 cycles at a substantially lower 2.33 GHz.
On the other hand, the memory latency very high; luckily the 4 MB L2 cache will minimize that effect. The problem seems to be the FB-DIMMs. The Advanced Memory Buffer introduces extra latency, and of course the registered DDR-2 533 chips with a CAS latency of 4 have a higher latency by themselves. This results in a memory subsystem with pretty high 115 ns latency, while the Opteron has access to the RAM in only 73 ns
ScienceMark didn't agree completely and reported about 65-70 ns latency on the Opteron system and 70-76 ns (230 cycles) on the Woodcrest system. We have reason to believe that Woodcrest's latency is closer to what LMBench reports: the excellent prefetchers are hiding the true latency numbers from Sciencemark. It must also be said that the measurements for the Opteron on the Opteron are only for the local memory, not the remote memory.
Secure Socket Layers RSA Performance
Secure Web communication is possible through the utilization of the Secure Sockets Layer (SSL) protocol. Using the command
"openssl speed rsa"
we can measure the number of RSA public key operations (signs) that a system can perform per second.While
"openssl speed rsa"
is sufficient to test the Xeons and Opterons, the Sun T1 can speed up the Rivest Shamir Adleman (RSA) and Digital Signal Algorithm (DSA) encryption and decryption operations needed for SSL processing, thanks to a modular arithmetic unit (MAU) that supports modular exponentiation and multiplication. Each T1 core has a MAU, thus one 8 core T1 has 8 MAUs. To make use of those 8 MAUs, you have run the SSL calculations through the Solaris Cryptographic Framework (SCF). To test the T1 with the MAU crunching at full speed we used the command: "openssl speed -engine pkcs11 rsa"
. The Solaris 10 OS also provides in-kernel SSL termination, offering greater security than SSL termination outside the kernel.We included the HP DL585 to see whether 8 cores of complex general purpose CPUs (Opteron 880) can keep up with the 8 MAU of the Sun T1. If you want to compare Woodcrest and the Opteron, you should check the 2 and 4 concurrency numbers. You can find our 1024-bit numbers in the graph below. One thread per core is optimal, so we tested the DL585 with a maximum of 16 threads, to show you that the peak is attained at 8 threads. The Xeon Irwindale was tested with 8 threads to show you that 4 threads (4 logical cores) is optimal and so on.
Notice that the 8 MAUs of the Sun T1 can only get in full action if we fire off 32 "SSL RSA signing" threads. Once that happens, the little 1 GHz T1 is able to keep up with the massive 2.4 GHz 8 core DL585. Without MAU, the T1 is as fast as a 1.8 GHz Xeon Irwindale. It is thus very important to check that your favorite web server works with SCF if you want to run your secure web services on the Sun T2000.
It looks like we've discovered the first - but rather insignificant to most people - "weakness" of the new Core architecture: decryption and encryption. The Opteron at 2.4 GHz has no trouble keeping up with the 3 GHz Woodcrest. This might be a result of the fact that the Woodcrest can only perform one rotate per cycle, while the Opteron can do 3. Although the RSA algorithm doesn't really use rotations, the hash algorithms needed to sign or encrypt a key make use of rotations. However, the most important reason is probably that the Opteron can sustain 2 ADC (Add with Carry) instructions per clock cycle, while Woodcrest can only do one. As ADC is good for about 17% of the instruction mix of the RSA algorithm, this might be enough to negate the extra integer power (Memory disambiguation, 4 wide decode ...) that the Woodcrest has.
Also notice that the previous NetBurst architecture, represented by the Xeon Irwindale, does very badly. The reason is that the P4 doesn't have a barrel shifter, a circuit in the chip which can shift or rotate any number in one clock cycle. Without this shifter, rotates and shifts take much longer, resulting in high latency. Most x86 code couldn't care less, but most encrypting code makes heavy use of rotates or shifts or both. We also did a quick test with Hyper-Threading on and off. In this case Hyper-Threading sped up the encryption (signs/s) with 20 to 28%.
To end the RSA sign/s benchmark, we'll make a quick comparison between quad core AMD Opteron 2.4 GHz, quad-core Intel Xeon Woodcrest and Sun's T1 with MAU enabled across different RSA bit lengths.
RSA Encryption (Signs/s) | |||
Opteron 2.4 GHz 4 threads |
Xeon 5160 3 GHz 4 threads |
SUN T1 with MAU 32 threads |
|
512 bit | 19003 | 21194 | 35613 |
1024 bit | 6098 | 6240 | 10722 |
2048 bit | 1145 | 1087 | 1918 |
4096 bit | 185 | 164 | 1 |
Notice that the hardware acceleration of the T1 does not work beyond 2048-bit keys. Considering that most secure applications use 1024-bit and only a few "high security" ones use 2048-bit, this is not an issue.
In case of doing verifies as opposed to signs, the server has to authenticate the identity of the client. This is a lot less intensive, and we'll show you the verifies per second numbers at 2048-bits. At 1024-bits length, both the Woodcrest and Opteron were able to verify more than 50000 keys per core, and that is a hard limit of the OpenSSL benchmark.
Again, the Opteron takes the lead. The Sun T1 even with the 8 MAUs is half as slow as four Opterons or Woodcrests, but this is hardly an issue. Encrypting or signing will slow down a server much quicker than verifying keys.
Both verifies/s and signs/s benchmark are rather synthetic. It is much more realistic to test with a real web server running SSL, and that is what we are currently doing. We followed Sun's instructions to enable RSA hardware acceleration for Apache, but for some reason, the Apache web server is still not making use of the Solaris Cryptographic Framework. So our Web server SSL test is work in progress.
Apache/PHP/MySQL Performance
In our first review of the T2000, we took a look at the T2000 as a heavy Apache, MySQL and PHP web server (or SAMP web server) using a pretty complex weather report system. The PHP test script retrieves hourly-stored weather information out of a MySQL database, that can be overviewed by month. An 'opening page' displays all months that are stored in the database, and if you open a 'detail page', the month you have selected is submitted by query string parameters. Additional details about this test application are available if you would like to know more.
The problem with our first test was that with the caching file we are taking MySQL and PHP out of the equation most of the time, and emphasizing TCP/IP handling and Apache too much. As we want to get also an idea of the PHP/MySQL speed of the different CPUs, we decided to test with an uncached version, simulating the worst case of the application.
However, running the uncached version only means that we regenerate the PHP page with each request. We did enable the query cache in MySQL. A good webmaster knows that too many accesses to the database can completely wreck web server performance, thus, it is important to "shield" the database backend from too many concurrent accesses. The mod_deflate module was enabled to make gzip compression happen.
For benchmarking, httperf was used in conjunction with autobench, a Perl script written by Julian T. J. Midgley, designed to run httperf against a server several times, with the number of requests per second increasing with each iteration. The output from the program enables us to see exactly how well the system being tested performs as the workload is gradually increased until it becomes saturated. In each case, the server was benchmarked with 5 requests per connection. The client was connected via a gigabit connection to the server.
To interpret the graphs below precisely, you must know that the X-axis gives you the number of demanded requests and the Y-axis gives you the actual reply rate of the server. The first points all show the same performance for each server, as each server is capable of responding fast enough. Only one CPU with 2 (Opteron, Xeon) or 8 cores (Sun UltraSparc T1) was present in each server.
Intel's new Xeon wipes the floor here with the competition. Up to 75% faster than the 2.4 GHz Opteron, the new Xeon won't have any trouble with a 3 GHz Opteron. We have to investigate this further, but it seems that this is the result of massive 4 MB L2 cache and intrinsically better integer performance of Woodcrest. Additional tuning might push the T1 higher, but we are pretty sure it is not going to be a screamer in this benchmark.
Java Webserving
As promised, we are also introducing a real world web server based on Java Server Pages (JSP). The next benchmark is based on the production Ace's Hardware message board, written by Brian Neal and Chris Rijk. This highly optimized jsp real world application uses a 2 GB object cache to minimize database access. As optimized as it may be, building up the message tree or index of the message boards and compressing it with gzip requires quite a bit of CPU power.
The benchmarked software includes:
- Caucho Technology's Resin 2.1.17
- Java Virtual Machine: Java HotSpot(TM) Server VM (build 1.5.0_04-b05)
- Sybase ASE 15.0 for Solaris / Linux
Although this should be Sun's favored benchmark, the new Xeon Woodcrest is a real party pooper for Sun. A single 80 Watt Woodcrest 3 GHz delivers almost the performance of one T1 at 1 GHz. Luckily for Sun, it is only fair to compare the top model of Intel to Sun's own top model at 1.2 GHz, and Sun should still have a decent advantage when it comes to performance/Watt: the T1 1.2 GHz is about 20% faster than the fastest Woodcrest. However, the days where one 72 W T1 could outperform four Xeon cores while consuming about 4 times less power are over.
The new Xeon 5160, a.k.a. Woodcrest, is making it very hard for Sun to compete on price/performance: four Woodcrest cores are about twice as fast as the 8 core T1. It is interesting to note that the simple T1 core is almost doing as much work per cycle as the massive Opteron. It has twice as many cores, but they are running at half the clockspeed of the Opteron and offering - on average - only 13% lower performance. If we compare the fastest Opteron (2.6 GHz Dual core) with the fastest T1 (1.2 GHz), this proportion shouldn't change much. So a simple 1-way core with 4 threads can do as much work as pretty complex 3-way core with one thread. However, the Woodcrest CPU does not only perform better per clock, it also reaches a 3 GHz clock. Intel beats Sun here in their home territory.
AMD is also in quite a bit of trouble too. If we extrapolate our 2.4 and 2.2 GHz numbers, an Opteron at 3 GHz would still be about 25% slower than our Woodcrest at 3 GHz. Impressive!
MySQL Configuration
We spent weeks on tweaking our MySQL database to the maximum. The results were encouraging: performance was up to 3 times higher on our Opteron machines than out of the box. On the Sun machine, the results were even more impressive, especially when we started using MySQL 5.0. MySQL 5 runs horribly slow on the T1 out of the box, but we got up to 5 times more performance out of our T2000 server after getting some excellent tweaking tips from Peter Zaitsev (MySQL) and Luojia Chen (Sun).
All testing was done with InnoDB as our storage engine in MySQL 5.0.21. We optimized for a server with 4 GB of RAM. Here is our MySQL configuration:
[mysqld]
port3306
socket= /tmp/mysql.sock
skip-locking
key_buffer = 1G
max_allowed_packet = 1M
table_cache = 1024
sort_buffer_size = 2M
read_buffer_size = 2M
read_rnd_buffer_size = 8M
thread_cache = 125
max_user_connections = 450
max_connections = 450
thread_concurrency = 16
The "query cache" was off, as we wanted to test worst case performance. Our test database is still the same as in previous articles, about 1GB in size. The workload consists of more than 90% selects, thus this is mostly a "read intensive" workload.
MySQL Results
All numbers are expressed in queries per second.
Notice that the T1 needs about 20-30 MySQL threads to run at full speed; this is clearly a result of it's 8 core "4 thread Gatling gun core" architecture. It must also be noted that the out-of-the box MySQL performance is simply horrible, about 4-5 times lower than the well optimized numbers you see above. There is no escaping the face that you must take the time to read Sun's tunings tips well.
Once you do, the 1 GHz T1 is capable of performing like an Opteron 2.2 GHz, which is pretty amazing. Kudos to Luojia Chen and Peter Zaitsev for a job well done. While the old Xeon which consumed 4 times more power than the T1 to give the same performance looked pretty silly, the new Xeon 5160 easily outperforms the T1. The performance/Watt title will probably go to the low power Woodcrest versions, which we haven't tested yet.
Let us see what a single dual-core Woodcrest can do versus a dual-core Opteron and quad-core Sun T1.
As we were testing with only two cores, we brought back the Dual Xeon Irwindale for the test. We did a few extra tests on this platform as we also had the older Nocona platform in the labs. The additional 1MB cache of Irwindale improved our benchmark numbers by 7-8%, which is quite impressive. Our time investment in tweaking our MySQL database also made the caches and memory system more important. Finally, Hyper-Threading still doesn't pay off in MySQL: we noticed a small slowdown of about 7%.
MySQL Results: Scaling
Back to our main subject, our astute readers have probably already noticed a weird anomaly. Let us analyze this further. If you look closely at both our measurements, Quad-core and Dual-core x86, you'll notice that the scaling is negative. To make it more clear, we made an average of all concurrency numbers from 5 and higher.
MySQL Linux (Queries/s) | |||||
Sun T1 4/8 cores 1 GHz |
MSI K2-102A2M Opteron 275 |
Xeon 5160 Woodcrest 3 GHz |
MSI K2-102A2M Opteron 280 |
||
Average Dual-core (T1: quad-core) |
362 | 749 | 996 | 805 | |
Average Quad-core (T1: octal-core) |
433 | 590 | 904 | 622 | |
Speedup Dual to Quad | 20% | -21% | -9% | -23% |
This is nothing short of amazing. It seems like an anomaly, but this is not the case. These benchmarks have been checked, verified and checked again. They are accurate. The x86 cores running on Linux perform better with two cores than with four cores, but the T1 running Solaris actually improves performance going from 4 to 8 cores.
So who is guilty? Linux or the Opteron system? We had to test with Solaris on the Opteron to be sure. However, the Serverworks chipset of our MSI 1U server was not supported by x86 Solaris. So we went back to our homebuilt server, based on the MSI K8N Master2-FAR.
MySQL Solaris (Queries/s) | |||
Sun T1 4/8 cores 1 GHz | Opteron 280 Solaris | Opteron 280 Linux | |
Average Dual-core (T1: quad-core) |
362 | 456 | 799 |
Average Quad-core (T1: octal-core) |
433 | 605 | 625 |
Speedup Dual to Quad | 20% | 33% | -22% |
And this puts the performance of our UltraSparc T1 in a whole different perspective. First of all, it is clear that while MySQL might not be the most scalable database, the current kernel of Linux is not helping matters. We did tweak the Linux kernel two ways: the 2.6.15 kernel was optimized for either Intel's or AMD's architecture and the AMD architecture also got NUMA support.
So what is going on here? After talking to our MySQL guru (P. Zaitsev), it turns out that in some circumstances, MySQL might cause trouble for the Linux mutex (mutual exclusion) implementation: "mutex ping-pong". The mutex implementation makes sure that two threads cannot access data in the main memory that is locked by another thread.
It seems however more a MySQL problem than a Linux one, as other databases like DB2 scale very well in Linux. For DB2 under the same load we noticed a performance increase of no less than 80-85% when going from two to four cores. Also, with some loads, the bad scaling kicks in later than our "Select dominated" load. Intel's performance labs told us that they also ran into the same problem.
These issues are not as severe as the problems we encountered with MySQL in Mac OSX. Note that Apple seems to have recognized the problem and seems to offer a workaround. We'll report back with other MySQL workloads to investigate the MySQL scaling problem further.
PostGreSQL Results
PostgreSQL 8.0.7, another open source database, uses processes and not threads to deal with connections. The consequence is that the benchmark numbers are a lot more stable: once each core is busy with it's process, you almost get maximum performance. In other words, the results didn't change much from 5, 10 or 25 concurrent users. To keep things simple, we only list the numbers with 20 users, which results in peak performance. The queries per second numbers at 5 and 25 were only a few percent lower. We did not include the T2000 Sun Server as the optimal PostGreSQL configuration is still under investigation.
PostgreSQL 8.0.7 (Queries/s) | |
DL385 1 x Opteron 280 | 517 |
Intel 2 x Xeon "Irwindale" 3.6 GHz | 448 |
MSI 1U 1 x Opteron 275 | 490 |
MSI 1U 1 x Opteron 280 | 524 |
Intel 1 x Xeon 5160 WC 3 GHz | 673 |
Another clear victory for Woodcrest. On the Opteron, every 10% in clockspeed increase seems to result in a 7% performance increase. So if we extrapolate, an Opteron 3 GHz would arrive at 616 queries per second.
Database Performance Analysis
To make sense out of all these numbers, we summarized our findings below.
Database Performance (Linux) | |||||
MSI K2-102A2M Opteron 275 | MSI K2-102A2M Opteron 280 | Opteron 280 vs. Opteron 275 |
Extrapolated Opteron 3 GHz | Xeon 5160 3 GHz |
|
MySQL - Dual-core | 749 | 805 | 7% | 946 | 996 |
MySQL - Quad-core | 590 | 622 | 5% | 703 | 904 |
PostgreSQL | 490 | 524 | 7% | 616 | 673 |
As the Xeon 5160 is not yet released, and it is unclear what AMD will do in response, we were curious how a 3 GHz Opteron would compare to our 3 GHz Woodcrest. Both architectures have similar pipeline lengths and will probably attain more or less the same clockspeeds under the same process technology, though of course Intel is ahead when it comes to process technology. It is interesting to see how the Opteron compares clock for clock with the new Xeon.
Database Scaling (Extrapolated) | ||
Xeon 5160 vs. Opteron 280 |
Xeon 5160 vs. Extrapolated Opteron 3 GHz |
|
MySQL - Dual-core | 24% | 5% |
MySQL - Quad-core | 45% | 29% |
PostgreSQL | 28% | 9% |
The Xeon's advantage in Open source databases is significant but not as spectacular as the Spec 2000 Integer numbers. The fact that Woodcrest scales better, or should we say "less bad", is most likely a result of the massive 4 MB L2 cache. As said before, increasing the cache of the previous Xeon generation from 1 to 2 MB results in about 7-8% higher performance. While we cannot be sure that those number are also applicable to Opteron or Woodcrest, it is pretty clear that the 4 MB cache does give the newest Xeon a performance boost.
Despite the fact that Woodcrest is a behemoth when it comes to integer performance, it does not outperform the Opteron by a large margin in MySQL on clock for clock basis. The problem seems to be the FB-DIMM latency. A quick test with higher latency RAM on the Opteron showed that increasing the latency of the RAM subsystem by 20% resulted in a 20 to 25% decrease of MySQL performance. Although this doesn't allow us to get a precise idea of how memory latency influences Woodcrest's MySQL performance, it shows us clearly that memory latency has a big impact on MySQL's performance in our tests.
Web Server Performance Analysis
Below is our summary of web server performance. While we averaged the database numbers, we took the peak numbers of our web server tests. The reason is that at lower request rates, all systems perform the same. "Jsp" gives you the Java Server Page performance, AMP stands for Apache/MySQL/PHP.
Webserver Performance | |||||
MSI K2-102A2M Opteron 275 | MSI K2-102A2M Opteron 280 | Opteron 280 vs. Opteron 275 |
Extrapolated Opteron 3 GHz | Xeon 5160 3 GHz |
|
Jsp - Peak | 144 | 154 | 7% | 182 | 230 |
AMP - Peak | 984 | 1042 | 6% | 1178 | 1828 |
Extrapolating the performance of our 2.4 GHz Opteron 280 to 3 GHz again makes it for an interesting comparison.
Webserver Scaling (Extrapolated) | ||
Xeon 5160 vs. Opteron 280 |
Xeon 5160 vs. Extrapolated Opteron 3 GHz |
|
Jsp - Peak | 49% | 26% |
AMP - Peak | 75% | 55% |
When it comes to web server performance, the newest Xeon is unbeatable and crushes the competition. A 3 GHz Opteron is not going to help.
Power
As our Woodcrest test system did not have DBS enabled, we decided to test only under full load. Again, take the results with a grain salt, as it is impossible to make everything equal. We tested all machines with only one power supply powered on, and we also tried to have a similar amount and type of fans (excluding the CPU fan, where the T1 doesn't have one). There are still differences between the motherboards, and the Sun uses 2.5 inch disks.
Max Power usage (100% CPU load - Watts) | ||
Configuration | Power | |
Sun T2000 | 1CPU / 8 Cores - 8 GB RAM | 188 |
Dual Opteron 275 HE | 2CPU's (275HE) - 4 GB RAM | 192 |
Dual Opteron 275 | 2CPU's - 4 GB RAM | 239 |
Dual Xeon 5160 3 GHz | 2 CPU's - 4 GB RAM | 245 |
Dual Xeon "Irwindale" 3.6 GHz | 2CPU's - 8 GB RAM | 374 |
Simply looking at the power numbers, the T2000 server beats the rest. We were informed that the current T2000 Servers now ship with high efficiency 450W Power supplies (our T2000 uses a 550 Watt one), which will further reduce power consumption 10 Watts or more. From a performance/Watt point of view, the new Woodcrest CPU is the winner in most workloads.
Conclusion
Two months of testing and tweaking allowed us to gather a lot of information. Our Sybase and DB2 tests still need a bit of tweaking before we can publish result on them, but with tests on SSL, JSP, LAMP, MySQL and PostGreSQL, what can we conclude so far?
Sun's T2000 server and it's 32 thread T1 CPU turned out very variable results. It is not the best choice for open source databases. PostGreSQL and MySQL scale better on Solaris than they do on Linux, but both RDBMS have trouble scaling over multiple cores. It is likely that the DB2 and Sybase results will be much better on the T2000. The SAMP web performance of the T2000 was good when we cached the PHP pages and we had few accesses to the MySQL database. When PHP pages had to regenerated with every access and the query cache of MySQL was used, performance was pretty bad compared to the x86 competition. The best purpose for the T2000 is a JSP server with SSL authentication.
The Intel Xeon 5160, a.k.a. Woodcrest, will simply be the most powerful server CPU this year (though it's not yet available for purchase of course). As our extrapolated calculations show, even a 2.6 GHz Woodcrest will outperform the current Opteron 285 with a 5 to 55% margin, nothing short of impressive. The new Xeon is however not invincible: the Opteron can still give some serious resistance when running some instruction mixes with lots of rotates, add-carry or load effective address instructions. RSA, AES and other benchmarks clearly show this. Intel will still have to convince some software vendors to port to SSE if it wants Woodcrest to be the completely superior CPU. The advantage in MySQL is also rather small, a result of the relatively high latency of the FB-DIMMs. But we are nitpicking: Intel's newest Xeon has taken back the performance/Watt crown. In one word: Woodcrest rocks!
And what about AMD? The Opteron remains a powerful architecture with a flexible platform. It is quickly becoming the most popular platform for 4 sockets and the upcoming Tulsa CPU is most likely not going to change that. However part of AMD's success has been Intel's Prescott/Nocona failure. In the K6 and Athlon (K7) years, AMD managed to improve the architecture every two years. In 1999 we had the original Athlon, in 2000 we got Thunderbird (integrated L2 cache) and in 2002 we got the Athlon XP. For the few past years, the Opteron architecture has made the move to dual-core and received a better memory controller, but the necessary IPC improvements and cache enlargements have not materialized. "Only the Paranoid survive", remember?
The Intel P-M architecture went from 1.7 GHz Single Core (Banias) in 2003 to 3 GHz (Conroe, Woodcrest) in 2006, while it quadrupled the L2 cache and significantly improved the IPC. At the same time, AMD's K8 series went from 1.8 GHz to 2.8 GHz dual-core, with the same amount of cache, and almost equal IPC. The result is that AMD will not be able to regain the performance crown in the dual and quad-core market until the K8L arrives. The future looks bright in the quad socket market however.
In summary:
Intel Xeon 5160 (Woodcrest)
Advantages:
- Best server performance across all applications
- Best Performance/Watt in the high end
- Absolutely stunning web server performance
- FB-DIMM enables high RAM capacity and bandwidth (quad channel)
- Needs SSE optimized code for some special case code (RSA, AES)
- FB-DIMM adds extra latency, cost (small) and power
Advantages:
- Superb SSL performance
- Excellent Performance/Watt with SSL and Java code
- Solaris, a robust and well scaling OS
- Quad channel enables high RAM capacity
- Heavy optimizing is necessary; out of box software performance is low
- Low single threaded performance; also results in low performance in server software that scales badly
- Price/Performance compared to Woodcrest
Advantages:
- Well rounded CPU: performs well even with non optimized code; still excellent MySQL server results
- Excellent Quad socket platform
- Does not need FB-DIMM for high capacity thanks to NUMA
- Web server performance compared to Woodcrest
- Power at higher clockspeeds (110 W vs. 80 W)