Sun’s T2000 “Coolthreads” Server: First Impressions and Experiences
by Johan De Gelas on March 24, 2006 12:05 AM EST- Posted in
- IT Computing
The T2000 as a heavy SAMP web server: the results
To interpret the graphs below precisely, you must know that the x-axis gives you the number of demanded requests, and the y-axis gives you the actual reply rate of the server. So, the first points all show the same performance for each server, as each server is capable of responding fast enough.
We tested the Opteron machines with both Linux on Solaris to get an idea of the impact of the OS.
The Sun T2000 isn't capable of beating our Quad core Opteron, but there are few remarks that I should make.
First of all, we tested the 1 GHz T1, which is, of course, about 20% slower than the best T1 at 1.2 GHz. The T2000 peaked at 950 req/s, the quad core Opteron at 1368 (Linux) and 1244 (Solaris) req/s. However, the T2000 was capable of delivering 935 req/s without any error (request timeout) while the Quad Opteron delivered 1100 (Solaris) and 1250 (linux) req/s without any errors. So, given the criteria that there cannot be any time-out, the difference gets a little bit smaller.
In defense of the Opteron and Xeon: the average response time for one particular request was (most of the time) between 5 and 15 ms. Once the server came close to its saturation point, we noted a maximum of 70 ms. With the T2000, the response time was at least 20 ms, typically 40 ms, with peaks of up to 220 ms when we came close to the peak throughput.
Of course, this is the disadvantage of the lower single-threaded performance of this server CPU: the individual response times are higher. In case of OLTP and web server, this is hardly a concern; in case of a decision support system, it might be one.
There is a better way to do this test, of course: enable the mod_deflate module and get some decent gzip compression. Once we enabled compression, our network I/O, which peaked at up to 12 MB/s, came down to a peak network I/O of 1.8 MB/s. Let us see the next benchmark, where we measured with gzip compression on.
The Sun T1 starts to show what it can do: performance has hardly decreased. Gzip compression is almost free on the T1; compression lowers performance by only 2%. The Opteron sees its performance lowered by 21% (977 vs 1244), and the Xeon by 19% (730 vs 899).
On Solaris, the T1 performs like a quad Opteron. Linux, which has probably slightly better drivers for our Opteron server, gives the quad Opteron the edge.
Let analyse this a little further.
As you can see, our application should be a prime example of an application where multi-core server CPU feels at home. With Gzip compression enabled, performance is still almost perfect at 96% going from 4 to 8 T1 cores.
So, why aren't we seeing the performance that the Sun claims regarding, for example, Spec Web2005, where the T1 has no problem outperforming quad core x86 CPUs? We are not sure, as we measured that 97% of the processing was done in the OS code (97% "system") and only 2-3% of the CPU load was done in the actual application ("user"). We suspect that the relatively light load of FP operations might have lowered the T1's performance. Depending on the tool that we used, we saw 0.66 to 1% of FP operations in our instruction mix, with peaks to 2.4%. Those FP operations are a result of our script calculating averages most likely.
To interpret the graphs below precisely, you must know that the x-axis gives you the number of demanded requests, and the y-axis gives you the actual reply rate of the server. So, the first points all show the same performance for each server, as each server is capable of responding fast enough.
We tested the Opteron machines with both Linux on Solaris to get an idea of the impact of the OS.
The Sun T2000 isn't capable of beating our Quad core Opteron, but there are few remarks that I should make.
First of all, we tested the 1 GHz T1, which is, of course, about 20% slower than the best T1 at 1.2 GHz. The T2000 peaked at 950 req/s, the quad core Opteron at 1368 (Linux) and 1244 (Solaris) req/s. However, the T2000 was capable of delivering 935 req/s without any error (request timeout) while the Quad Opteron delivered 1100 (Solaris) and 1250 (linux) req/s without any errors. So, given the criteria that there cannot be any time-out, the difference gets a little bit smaller.
In defense of the Opteron and Xeon: the average response time for one particular request was (most of the time) between 5 and 15 ms. Once the server came close to its saturation point, we noted a maximum of 70 ms. With the T2000, the response time was at least 20 ms, typically 40 ms, with peaks of up to 220 ms when we came close to the peak throughput.
Of course, this is the disadvantage of the lower single-threaded performance of this server CPU: the individual response times are higher. In case of OLTP and web server, this is hardly a concern; in case of a decision support system, it might be one.
There is a better way to do this test, of course: enable the mod_deflate module and get some decent gzip compression. Once we enabled compression, our network I/O, which peaked at up to 12 MB/s, came down to a peak network I/O of 1.8 MB/s. Let us see the next benchmark, where we measured with gzip compression on.
The Sun T1 starts to show what it can do: performance has hardly decreased. Gzip compression is almost free on the T1; compression lowers performance by only 2%. The Opteron sees its performance lowered by 21% (977 vs 1244), and the Xeon by 19% (730 vs 899).
On Solaris, the T1 performs like a quad Opteron. Linux, which has probably slightly better drivers for our Opteron server, gives the quad Opteron the edge.
Let analyse this a little further.
PHP/MySQL No Gzip | |||
Single Opteron 275 | 665 | 4-core T1 | 535 |
Dual Opteron 275 | 1244 | 8-core T1 | 949 |
Scaling 2 Opteron cores to 4: | 87% | Scaling 4 to 8 T1 cores: | 77% |
PHP/MySQL Gzip | |||
Single Opteron 275 | 538 | 4-core T1 | 477 |
Dual Opteron 275 | 977 | 8-core T1 | 933 |
Scaling 2 Opteron cores to 4: | 82% | Scaling 4 to 8 T1 cores: | 96% |
Gzip performance vs no Gzip | |||
Opteron 275 | 79% | Sparc T1 | 98% |
As you can see, our application should be a prime example of an application where multi-core server CPU feels at home. With Gzip compression enabled, performance is still almost perfect at 96% going from 4 to 8 T1 cores.
So, why aren't we seeing the performance that the Sun claims regarding, for example, Spec Web2005, where the T1 has no problem outperforming quad core x86 CPUs? We are not sure, as we measured that 97% of the processing was done in the OS code (97% "system") and only 2-3% of the CPU load was done in the actual application ("user"). We suspect that the relatively light load of FP operations might have lowered the T1's performance. Depending on the tool that we used, we saw 0.66 to 1% of FP operations in our instruction mix, with peaks to 2.4%. Those FP operations are a result of our script calculating averages most likely.
26 Comments
View All Comments
phantasm - Wednesday, April 5, 2006 - link
While I appreciate the review, especially the performance benchmarks between Solaris and Linux on like hardware, I can't help but feel this article falls short in terms of an enterprise class server review which, undoubtedly, a lot of enterprise class folks will be looking for.* Given the enterprise characteristics of the T2000 I would have liked to see a comparison against an HP DL385 and IBM x366.
* The performance testing should have been done with the standard Opteron processors (versus the HE). The HP DL385 using non HE processors have nearly the same power and thermal characteristics as the T2000. DL385 is a 4A 1615 BTU system whereas the T2000 is a 4A 1365 BTU system.
* The T2000 is difficient in serveral design areas. It has a tool-less case lid that is easily removable. However, our experience has been that it opens too easily and given the 'embedded kill switch' it immediately shuts off without warning. Closing the case requires slamming the lid shut several times.
* The T2000 only supports *half height* PCI-E/X cards. This is an issue with using 3rd party cards.
* Solaris installation has a nifty power savings feature enabled by default. However, rather than throtteling CPU speed or fans it simply shuts down to the OK prompt after 30 minutes of a 'threshold' not being met. Luckily this 'feature' can be disabled through the OS.
* Power button -- I ask any T2000 owner to show me one that doesn't have a blue or black mark from a ball point pen on their power button. Sun really needs to make a more usable power button on these systems.
* Disk drives -- The disk drives are not labeled with FRU numbers or any indication to size and speed.
* Installing and configuring Solaris on a T2000 versus Linux on an x86 system will take a factor of 10x longer. Most commonly, this is initially done through a hyperterm access through the remote console. (Painful) Luckily subsequent builds can be done through a jumpstart server.
* HW RAID Configuration -- This can only be done through the Solaris OS commands.
I hope Anandtech takes up the former call to begin enterprise class server reviews.
JohanAnandtech - Thursday, April 6, 2006 - link
DL385 will be in our next test.All other issues you adressed will definitely be checked and tested.
That it falls short of a full review is clearly indicated by "first impressions" and it has been made clear several times in the article. Just give us a bit more time to get the issues out of our benchmarks. We had to move all our typical linux x86 benchmarks to Solaris and The T1 and keep it fair to Sun. This meant that we had to invest massive amounts of time in migrating databases and applications and tuning them.
davem330 - Friday, March 24, 2006 - link
You aren't seeing the same kind of performance that Sun is claimingregarding Spec Web2005 because Sun specifically choose workloads
that make heavy use of SSL.
Niagara has on-chip SSL acceleration, using a per-core modular
arithmetic unit.
BTW, would be nice to get a Linux review on the T2000 :-)
blackbrrd - Saturday, March 25, 2006 - link
Good point about the ssl.I can see both ssl and gzip beeing used quite often, so please include ssl into the benchmarks.
As mentioned in the article 1-2% of FP operations affect the server quite badly, so I would say that getting one FPU per core would make the cpu a lot better, looking forward to seeing results from the next generation.
.. but then again, both Intel and AMD will probably have launched quad cores by then...
Anyway, its interesting seeing a third contender :)
yonzie - Friday, March 24, 2006 - link
Nice review, a few comments though:I think that should have been , although you might mean dual channel ECC memory, but if that's the case it's a strange way to write it IMHO.
No mention of the Pentium M on page 4, but it shows up in benchmarks on page 5 but not further on... Would have been interesting :-(
And the second scenario is what exactly? ;-) (yeah, I know it's written a few paragraphs later, but...)
Oh, and more pretty pictures pls ^_^
sitheris - Friday, March 24, 2006 - link
Why not benchmark it on a more intensive application like Oracle 10gJohanAnandtech - Friday, March 24, 2006 - link
We are still tuning and making sure our results are 100% accurate. Sounds easy, but it is incredible complex.But they are coming
Anyway, no Oracle, we have no support from them so far.
JCheng - Friday, March 24, 2006 - link
By using a cache file you are all but taking MySQL and PHP out of the equation. The vast majority of requests will be filled by simply including the cached content. Can we get another set of results with the caching turned off?ormandj - Friday, March 24, 2006 - link
I would agree. Not only that, but I sure would like to know what the disk configuration was. Especially reading from a static file, this makes a big difference. Turn off caching and see how it does, that should be interesting!Disk configurations please! :)
kamper - Friday, March 31, 2006 - link
No kidding. I thought that php script was pretty dumb. Once a minute you'll get a complete anomaly as a whole load of concurrent requests all detect an out of date file, recalculate it and then try to dump their results at the same time.How much time was spent testing each request rate and did you try to make sure each run came across the anomaly in the same way, the same number of times?