Intel Woodcrest, AMD's Opteron and Sun's UltraSparc T1: Server CPU Shoot-out
by Johan De Gelas on June 7, 2006 12:00 PM EST- Posted in
- IT Computing
Secure Socket Layers RSA Performance
Secure Web communication is possible through the utilization of the Secure Sockets Layer (SSL) protocol. Using the command
While
We included the HP DL585 to see whether 8 cores of complex general purpose CPUs (Opteron 880) can keep up with the 8 MAU of the Sun T1. If you want to compare Woodcrest and the Opteron, you should check the 2 and 4 concurrency numbers. You can find our 1024-bit numbers in the graph below. One thread per core is optimal, so we tested the DL585 with a maximum of 16 threads, to show you that the peak is attained at 8 threads. The Xeon Irwindale was tested with 8 threads to show you that 4 threads (4 logical cores) is optimal and so on.
Notice that the 8 MAUs of the Sun T1 can only get in full action if we fire off 32 "SSL RSA signing" threads. Once that happens, the little 1 GHz T1 is able to keep up with the massive 2.4 GHz 8 core DL585. Without MAU, the T1 is as fast as a 1.8 GHz Xeon Irwindale. It is thus very important to check that your favorite web server works with SCF if you want to run your secure web services on the Sun T2000.
It looks like we've discovered the first - but rather insignificant to most people - "weakness" of the new Core architecture: decryption and encryption. The Opteron at 2.4 GHz has no trouble keeping up with the 3 GHz Woodcrest. This might be a result of the fact that the Woodcrest can only perform one rotate per cycle, while the Opteron can do 3. Although the RSA algorithm doesn't really use rotations, the hash algorithms needed to sign or encrypt a key make use of rotations. However, the most important reason is probably that the Opteron can sustain 2 ADC (Add with Carry) instructions per clock cycle, while Woodcrest can only do one. As ADC is good for about 17% of the instruction mix of the RSA algorithm, this might be enough to negate the extra integer power (Memory disambiguation, 4 wide decode ...) that the Woodcrest has.
Also notice that the previous NetBurst architecture, represented by the Xeon Irwindale, does very badly. The reason is that the P4 doesn't have a barrel shifter, a circuit in the chip which can shift or rotate any number in one clock cycle. Without this shifter, rotates and shifts take much longer, resulting in high latency. Most x86 code couldn't care less, but most encrypting code makes heavy use of rotates or shifts or both. We also did a quick test with Hyper-Threading on and off. In this case Hyper-Threading sped up the encryption (signs/s) with 20 to 28%.
To end the RSA sign/s benchmark, we'll make a quick comparison between quad core AMD Opteron 2.4 GHz, quad-core Intel Xeon Woodcrest and Sun's T1 with MAU enabled across different RSA bit lengths.
Notice that the hardware acceleration of the T1 does not work beyond 2048-bit keys. Considering that most secure applications use 1024-bit and only a few "high security" ones use 2048-bit, this is not an issue.
In case of doing verifies as opposed to signs, the server has to authenticate the identity of the client. This is a lot less intensive, and we'll show you the verifies per second numbers at 2048-bits. At 1024-bits length, both the Woodcrest and Opteron were able to verify more than 50000 keys per core, and that is a hard limit of the OpenSSL benchmark.
Again, the Opteron takes the lead. The Sun T1 even with the 8 MAUs is half as slow as four Opterons or Woodcrests, but this is hardly an issue. Encrypting or signing will slow down a server much quicker than verifying keys.
Both verifies/s and signs/s benchmark are rather synthetic. It is much more realistic to test with a real web server running SSL, and that is what we are currently doing. We followed Sun's instructions to enable RSA hardware acceleration for Apache, but for some reason, the Apache web server is still not making use of the Solaris Cryptographic Framework. So our Web server SSL test is work in progress.
Secure Web communication is possible through the utilization of the Secure Sockets Layer (SSL) protocol. Using the command
"openssl speed rsa"
we can measure the number of RSA public key operations (signs) that a system can perform per second.While
"openssl speed rsa"
is sufficient to test the Xeons and Opterons, the Sun T1 can speed up the Rivest Shamir Adleman (RSA) and Digital Signal Algorithm (DSA) encryption and decryption operations needed for SSL processing, thanks to a modular arithmetic unit (MAU) that supports modular exponentiation and multiplication. Each T1 core has a MAU, thus one 8 core T1 has 8 MAUs. To make use of those 8 MAUs, you have run the SSL calculations through the Solaris Cryptographic Framework (SCF). To test the T1 with the MAU crunching at full speed we used the command: "openssl speed -engine pkcs11 rsa"
. The Solaris 10 OS also provides in-kernel SSL termination, offering greater security than SSL termination outside the kernel.We included the HP DL585 to see whether 8 cores of complex general purpose CPUs (Opteron 880) can keep up with the 8 MAU of the Sun T1. If you want to compare Woodcrest and the Opteron, you should check the 2 and 4 concurrency numbers. You can find our 1024-bit numbers in the graph below. One thread per core is optimal, so we tested the DL585 with a maximum of 16 threads, to show you that the peak is attained at 8 threads. The Xeon Irwindale was tested with 8 threads to show you that 4 threads (4 logical cores) is optimal and so on.
Notice that the 8 MAUs of the Sun T1 can only get in full action if we fire off 32 "SSL RSA signing" threads. Once that happens, the little 1 GHz T1 is able to keep up with the massive 2.4 GHz 8 core DL585. Without MAU, the T1 is as fast as a 1.8 GHz Xeon Irwindale. It is thus very important to check that your favorite web server works with SCF if you want to run your secure web services on the Sun T2000.
It looks like we've discovered the first - but rather insignificant to most people - "weakness" of the new Core architecture: decryption and encryption. The Opteron at 2.4 GHz has no trouble keeping up with the 3 GHz Woodcrest. This might be a result of the fact that the Woodcrest can only perform one rotate per cycle, while the Opteron can do 3. Although the RSA algorithm doesn't really use rotations, the hash algorithms needed to sign or encrypt a key make use of rotations. However, the most important reason is probably that the Opteron can sustain 2 ADC (Add with Carry) instructions per clock cycle, while Woodcrest can only do one. As ADC is good for about 17% of the instruction mix of the RSA algorithm, this might be enough to negate the extra integer power (Memory disambiguation, 4 wide decode ...) that the Woodcrest has.
Also notice that the previous NetBurst architecture, represented by the Xeon Irwindale, does very badly. The reason is that the P4 doesn't have a barrel shifter, a circuit in the chip which can shift or rotate any number in one clock cycle. Without this shifter, rotates and shifts take much longer, resulting in high latency. Most x86 code couldn't care less, but most encrypting code makes heavy use of rotates or shifts or both. We also did a quick test with Hyper-Threading on and off. In this case Hyper-Threading sped up the encryption (signs/s) with 20 to 28%.
To end the RSA sign/s benchmark, we'll make a quick comparison between quad core AMD Opteron 2.4 GHz, quad-core Intel Xeon Woodcrest and Sun's T1 with MAU enabled across different RSA bit lengths.
RSA Encryption (Signs/s) | |||
Opteron 2.4 GHz 4 threads |
Xeon 5160 3 GHz 4 threads |
SUN T1 with MAU 32 threads |
|
512 bit | 19003 | 21194 | 35613 |
1024 bit | 6098 | 6240 | 10722 |
2048 bit | 1145 | 1087 | 1918 |
4096 bit | 185 | 164 | 1 |
Notice that the hardware acceleration of the T1 does not work beyond 2048-bit keys. Considering that most secure applications use 1024-bit and only a few "high security" ones use 2048-bit, this is not an issue.
In case of doing verifies as opposed to signs, the server has to authenticate the identity of the client. This is a lot less intensive, and we'll show you the verifies per second numbers at 2048-bits. At 1024-bits length, both the Woodcrest and Opteron were able to verify more than 50000 keys per core, and that is a hard limit of the OpenSSL benchmark.
Again, the Opteron takes the lead. The Sun T1 even with the 8 MAUs is half as slow as four Opterons or Woodcrests, but this is hardly an issue. Encrypting or signing will slow down a server much quicker than verifying keys.
Both verifies/s and signs/s benchmark are rather synthetic. It is much more realistic to test with a real web server running SSL, and that is what we are currently doing. We followed Sun's instructions to enable RSA hardware acceleration for Apache, but for some reason, the Apache web server is still not making use of the Solaris Cryptographic Framework. So our Web server SSL test is work in progress.
91 Comments
View All Comments
duploxxx - Monday, June 19, 2006 - link
2 weeks have past the way, still no word from anand about the microsoft benches? (i recieved a coment that it should only take 1 week to finish......reason????? don't make us guess why you don't post these benches
severian64 - Tuesday, June 13, 2006 - link
I've be reading Anandtech articles for a long time and i have to say that this article is so biased that i think it should be retracted. I can't wait for the next pro intel article.The MySQL and Sun combination attained a result of 712.87 SPECjAppServer2004 JOPS@Standard running a 64-bit version of MySQL 5.0 and SJSAS 9.0 on Sun Microsystems' Sun Fire(TM) X4100 servers powered by Dual-Core AMD Opteron(TM) processors(1). The result demonstrates superb scalability of the whole solution, as compared to the previous result of 266 SPECjAppServer2004 JOPS@Standard that was achieved with Single-Core AMD Opteron processors (2). This solution also demonstrated the best database performance, measured in SPECjAppServer2004 JOPS@Standard per database core (SPECjAppServer2004 JOPS@Standard /DB core), of any competitive submission using less than 20 total cores in database and application tiers. MySQL's SPECjAppServer2004 JOPS@Standard /DB core metric surpassed an Oracle-powered result by over 30 percent (3).
MySQL Helps Set World Record in Java Application Server Benchmarks; High-Speed Open Source Software Blaze Past Proprietary Solutions
CUPERTINO, Calif.--(BUSINESS WIRE)--June 12, 2006--A popular application server benchmark, featuring a complete open source software stack with MySQL 5.0 database, the Solaris(TM) 10 Operating System, and Sun Java(TM) Systems Application Server 9.0 Platform Edition (Project GlassFish(SM)) has shattered the competition by offering up to 8.6 times lower cost of acquisition than the comparable solution (1,4), according to the benchmark test results published at http://www.spec.org/jAppServer2004/results/jAppSer...">http://www.spec.org/jAppServer2004/results/jAppSer....
Maintained by the Standard Performance Evaluation Corp. (SPEC(R)), the SPECjAppServer(R)2004 test is a recognized industry standard benchmark used to measure performance of Java EE application server platforms and each of the components that make up the application environment -- including hardware, database software, JDBC drivers, JVM software and the system network. It is designed to model a real-world automotive dealership application, including manufacturing, supply-chain management and an order/inventory system.
"Open Source software can provide dramatic benefits for enterprise IT applications - especially in terms of real performance and TCO," said Ethan O'Rafferty, director of Strategic Alliances for MySQL AB. "We are proud that Solaris 10 is an ideal deployment platform for MySQL 5.0."
The MySQL and Sun combination attained a result of 712.87 SPECjAppServer2004 JOPS@Standard running a 64-bit version of MySQL 5.0 and SJSAS 9.0 on Sun Microsystems' Sun Fire(TM) X4100 servers powered by Dual-Core AMD Opteron(TM) processors(1). The result demonstrates superb scalability of the whole solution, as compared to the previous result of 266 SPECjAppServer2004 JOPS@Standard that was achieved with Single-Core AMD Opteron processors (2). This solution also demonstrated the best database performance, measured in SPECjAppServer2004 JOPS@Standard per database core (SPECjAppServer2004 JOPS@Standard /DB core), of any competitive submission using less than 20 total cores in database and application tiers. MySQL's SPECjAppServer2004 JOPS@Standard /DB core metric surpassed an Oracle-powered result by over 30 percent (3).
ChuaChua - Saturday, June 10, 2006 - link
I'm confused about the charts.What are the numbers on the X and Y axes?
JohanAnandtech - Sunday, June 11, 2006 - link
"To interpret the graphs below precisely, you must know that the X-axis gives you the number of demanded requests and the Y-axis gives you the actual reply rate of the server. The first points all show the same performance for each server, as each server is capable of responding fast enough. "http://www.anandtech.com/IT/showdoc.aspx?i=2772&am...">http://www.anandtech.com/IT/showdoc.aspx?i=2772&am...
JohanAnandtech - Friday, June 9, 2006 - link
1. "you use workstaion/budget motherboard against the intel server board. use a sun galaxy or hp proliant. "No, we do not. We used the MSI K2-102A2M for ALL opteron testing except the one where we tested MySQL with Solaris as the serverworks chipset was not supported by Solaris x86.
Note that this server performed better than the HP DL385, which uses slower memory timings. Using the HP proliant would have resulted in slightly LOWER Opteron numbers not faster!
I don't get that a few people make a big fuzz about the MSI K8N Master2-FAR, as we only use it once, out of necessity as it worked under Solaris. Know that Solaris x86 supports only a limited amount of x86 hardware.
2. About our testing methods: yes, we use our own benchmarks. We'll add some industry standard benchmarks to the mix later. However, Industry benchmarks are what manufacturers optimize for, while our benches come straight out of the realworld, and are what real people are using. The same tests showed the Opteron beating the old Xeon by a pretty big margin, check our previous MySQL results. I don't see why now all of a sudden our tests should be changed
If you feel there are other issues, feel free. I will definitely try to answer any concerns you have.
duploxxx - Friday, June 9, 2006 - link
hmm thx for the reply. thats clear now. seeing all the reactions here on your review.it seems that the way its build up is far from structured and people do have problems reading it.
you didn't answer my question why some benches are single sock some dual sock... but i quess you are rather busy.
the way you talk as this core is the best thing (not released yet) against a new platform from competition that will be launched at same time... still does make me wonder but anyhow.
your own benchmarks and rather strange OS for benches (with personal tweaking) is still not relevant. Giving results on a far more used platform would be much nicer to compare... but i already have a good idea of the few benches you will be showing in this review on a wintel OS (let's hope i am wrong). youre benches on linux might be straight out of real world, but impossible to verify.
The only comparisson you make are 2 Spec benches and they were probably done also on that nice linux platform.... looking at the figures. but how comes that for one or another reason the opteron is clocked now at 2800 while you have 2400 and 2600 systems?
JohanAnandtech - Friday, June 9, 2006 - link
Ok, addressing the other issues:1. Why no dual socket, quadcore in some benches:
the reason is that with the LAMP tests we ended up with a limit in our httperf benchmark: it couldn't measure anything above more than 3000 req/s for some reason. So there is another bottleneck kicking in. So we avoided the bottleneck for now by not testing with quad core. This was happening on both the Opteron as the Woodcrest system.
2. The Gentoo numbers of the previous review that gained 9-10% was a comparison between 1 Dualcore and two dual single Core CPUS. Note that the same review shows only 38% performance increase from one to two CPUs.
3. A few people try to discredit this review in every way possible, I well aware of that. However, even though these benchmarks can't be repeated by other people for obvious reasons (the databases are not available to the public), the benches are in line with what other people have found.
http://www.mysqlperformanceblog.com/2006/06/08/int...">http://www.mysqlperformanceblog.com/200...el-woodc...
P. Zaitsev is one of the most respected people when it comes to MySQL performance and is head of performance tuning of MySQL.
ashyanbhog - Monday, June 12, 2006 - link
<Quote>The Gentoo numbers of the previous review that gained 9-10% was a comparison between 1 Dualcore and two dual single Core CPUS. Note that the same review shows only 38% performance increase from one to two CPUs.</Quote>Correct me if I am wrong. Follow the link below to Anandtech's own earlier benchmarks. Goto the last table on the page and check results of "Dual Dual Core 875" and "Dual Opteron 248" from 5 Concurrencies onwards. The increase is slight, but there is definetly no performance degradation. The earlier review too uses Opterons+Linux+MySQL+InnoDB, the same as this setup used. Why do you get totally different results sets this time around?
http://www.anandtech.com/IT/showdoc.aspx?i=2447&am...">http://www.anandtech.com/IT/showdoc.aspx?i=2447&am...
In the next page of the same review, DB2 shows fanastic gains from 5 Concurrencies onwards when going from Dual Cores to Four Cores. Check the first table under "Benchmarks IBM DB2: Single core versus Dual core". Note results of "Dual Opteron 2.2 hz" and "Dual Dual Core AMD 2.2 Ghz". This too is on linux. DB2 is definetly better suited than MySQL to reflect gains when moving from two core to four core setup.
Was publishing the obviously wrong MySQL results that you got this time necessary?
Despite choosing Gentoo, for the optmization capabaility, you have not chosen to publish the optimizations options used. Gentoo gurus would have verified that the optimizations for each of the processors where fairly chosen. This is not a allegation that you have taken sides, but why hold back some specific info when they are not secretive or proprietery in nature? And yes, I know you have used Gentoo in many previous benchmarks without specifiying the optimizations, but as this review seems to have become conterversial, it will help clear the air a bit. Specifying the options used when you have deviated from the default settings will surely increase credibility of review.articles.
Using a standard, preconfigured and widely available package like Debian, RHEL or SLES in their default settings was another option to ensure a neutral platform.
Cant comment about other parts of the review as I was only interested in the database performance
<Quote>A few people try to discredit this review in every way possible, I well aware of that.</Quote>
Why cant readers of Anandtech question the process used in a review? Afterall its their page views / site visit that brings the ad revenue. Readers tend to have limited time to go thru and comment if they want to. Exaggration may happen when we dont have time to express our doubts in detail. Your comment above was cheap shot.
Still, nice to see that Intel finally has something that can be compared to a Opteron. Good to have a choice, but the 3 year wait was too long.
And thanks for replying about the hardware used.
zsdersw - Monday, June 12, 2006 - link
Apparently you have no idea how long it takes to bring a totally new chip to market. This generally takes approximately 5 years.
BasMSI - Saturday, June 10, 2006 - link
Hahahaha, what balony.....Aceshardware didn't have problem reaching above 8000 requests on the Dual 844 back in 2003.
The article can be found here: http://www.aceshardware.com/read.jsp?id=60000279">>>Click here<<
And you where unable?
Get real.
If you don't know how to setup a server, then stay away from trying to do such.