Original Link: https://www.anandtech.com/show/1481
Linux Desktop CPU Roundup: Cutting Edge Penguin Performance
by Kristopher Kubicki on September 19, 2004 8:00 PM EST- Posted in
- Linux
Introduction
Although we have performed a few interesting benchmarks of processors on Linux in our past benchmarks, whenever I get cornered by a professor on campus or guest speak at a Linux Users Group, the first question anyone asks me is, "Which processor should I get for my new workstation?" Although the possibilities are totally limitless, the Linux users whom I have met generally have the mentality of "build something out of complete new parts, so it lasts" or "build something out of stuff that I find for free." Generally, the latter doesn't present many options, so today, we will address the first scenario - which new components make the ultimate Linux workstation. We found a few high end AMD and Intel processors to pit against our comprehensive Linux benchmark suite. Of course, don't forget to check out some of our other benchmarks including AMD Sempron, Opteron 150 and Nocona 3.6 from last month.With so many socket, memory and processor configurations, recent computer configurations can be extremely confusing. DDR2 or DDR1? AMD or Intel? 1MB L2 cache or 512KB? HyperThreading on or off? None of these are easy questions, particularly if we throw an alternative opterating system in the mix. We set up all of our benchmarks so that they can be replicated easily by anyone using a similar configuration. Below, you can see which configurations were used for the benchmark analysis.
Performance Test Configuration | |
Processor(s): | AMD Athlon FX-53 (130nm, 2.4GHz, 1MB L2 Cache, Socket 939) AMD Athlon 64 3800+ (130nm, 2.4GHz, 512KB L2 Cache) AMD Athlon 64 3500+ (130nm, 2.2GHz, 512KB L2 Cache) Intel Pentium 4 Extreme Edition 3.4GHz (130nm, 512KB L2 Cache, 2MB L3 Cache) Intel Pentium 4 560 3.6GHz (90nm, 1MB L2 Cache) Intel Pentium 4 530 3.0GHz (90nm, 1MB L2 Cache) |
RAM: | 2 x 512MB Mushkin PC-3200 CL2 (400MHz) 2 x 512MB Corsair PC2-5400 CL3 (475MHz) |
Motherboards: | DFI LanParty 915P-T12 (Socket 775) MSI K8T Neo2 (Socket 939) |
Memory Timings: | Default |
Operating System(s): | SuSE 9.1 Professional Kernel 2.6.5-7.108 |
Compiler: | linux:~ # gcc -v Reading specs from /opt/gcc-mainline/lib/gcc/i586-suse-linux/3.4.1/specs Configured with: ../configure --enable-threads=posix --prefix=/opt/gcc-mainline --with-local-prefix=/usr/local --infodir=/opt/gcc-mainline/share/info --mandir=/opt/gcc-mainline/share/man --libdir=/opt/gcc-mainline/lib --libexecdir=/opt/gcc-mainline/lib --enable-languages=c,c++,f77,objc,java,ada --enable-checking --enable-libgcj --with-gxx-include-dir=/opt/gcc-mainline/include/g++ --with-slibdir=/lib --with-system-zlib --enable-shared --enable-__cxa_atexit i586-suse-linux Thread model: posix gcc version 3.4.1 20040508 (prerelease) (SuSE Linux) |
For the majority of the benchmark analysis, we leave the HyperThreading capabilities of the Intel processors off. Unfortunately, most workstation applications are not capable of multi-threading applications, and running HyperThreading penalizes the Intel processors when it isn't needed. We do run some benchmarks where multiple threads are utilized, and in those instances, we take some special consideration with multiple benchmarks. For most of our tests, you will see 32-bit binaries on 32-bit Linux kernels. Moving the mouse over these benchmark graphs will actually reveal the 64-bit tests that we have done with our Athlon 64 processors. The Intel processors in this analysis do not have 64-bit capabilities.
We also have a small DDR2 versus DDR1 comparison near the end of this article. For the Intel processors, we use the DDR2 memory provided by Corsair exclusively except for the DDR2 versus DDR1 comparison. We chose the MSI K8T board for our AMD tests, since it was one of the most stable and reasonabily priced motherboards for the 939 architecture. DFI won our spot as the Socket 775 test bed for its DDR2/DDR1 support and solid stability. Testing DDR2 versus DDR1 was extremely relevant to this motherboard, since we could just swap memory modules without changing motherboards. Let's jump right into benchmarking.
Generally, all of our benchmarks are taken three times and then the highest marks are recorded unless stated otherwise. Note that we have updated to the more current GCC 3.4.1.
Database Tests
MySQL 4.0.20d has been a staple of our Linux tests since its inception. Even though it does not carry high relevance for a workstation test, we still regard it as the de facto free, open sourced benchmark for Linux. Below, you can see our results for sql-bench on both the 64-bit and 32-bit kernels for SuSE 9.1.We already see some exciting trends with this benchmark. For one, the 64-bit MySQLd appears to be much faster than the 32-bit one (mouse over the graphs to see the difference). Since the above benchmarks were done without HyperThreading, we enabled it in the graph below.
We expect to see a performance increase with HyperThreading - SQL servers must thread well. Unfortunately, the sql-bench benchmark is more to blame than anything else, and it does not thread realistically. As we validate a new benchmark for this portion of our Linux benchmark suite, sql-bench will do, but keep in mind that its extremely synthetic behavior.
Rendering Benchmarks
Below, we use Mental Ray 3.3.1 to render a particularly intensive benchmark scene (which you can download here). Maya exists in 64-bit binaries in various circles, but we have only been able to obtain a 32-bit license and thus a 32-bit version of MentalRay. Below, you can see how the 32-bit binaries perform on both 64-bit and 32-bit versions of SuSE 9.1 Pro.POV-Ray shows almost identical scaling to the MentalRay benchmark. We also noticing a trend between the Athlon 3800+ and the Athlon FX-53. Even though they have 512KB difference in cache, many of our benchmarks aren't showing that the processor utilizes that additional cache to its advantage. There are enormous performance benifits by under 64-bit operation with POV-Ray.
Chess Benchmarks
Although TSCP is neither a model of practical application nor synthetic benchmark, it does provide us with some valuable data for different breakdowns of compiler flags. As we have mentioned in past Linux analyses, compiler flags can show large differences between processors if they are used incorrectly. Below, you can see the 32-bit and 64-bit binaries as they are compiled via GCC 3.4.1.The difference in optimizations does not appear as dramatic with GCC 3.4.1. If you recall some of our previous benchmarks, we were getting differences as much as 20% with -O2 and -O3. The Extreme Edition processor really pulls ahead in this benchmark, which surprised us at first; we don't see the FX-53 performing better over the Athlon 64's with its additional L2 cache.
Compiling Benchmarks
We get a lot of requests to show some compiling benchmarks. Those playing Gentoo at home should be paying particular attention to this portion of the benchmark. We took the standard Linux 2.6.4 release from kernel.org and compiled it under our 32-bit test bed. We did not cross-platform compile for simplicity, so we are only looking at the 32-bit vanilla kernel. We used the commands as below.# yes "" | make config
# time make
We are greeted with a nice slow performance curve. We have the slower 3.4GHz P4EE overtake its faster 90nm counterpart on several occasions now - a definite trend has set in. Below, you can see how the Intel processors favored when we enabled HyperThreading.
Keep in mind make is not actually threading, we are just determining what kind of a performance hit occurs by enabling the two virtual processors rather than keeping just one active. Obviously running two applications at once will receive performance benefits. The unfortunate fact remains that workstation software continues to remain largely linear. We receive some benfiit by running multiple applications at the same time, like rendering a file and playing an MP3, but there are very few Linux workstation programs that fully utilize multiple threads. Fortunately, we have an article coming up that deals with just how to receive the best performance out of multiple threads (and HyperThreading/SMP configurations) in the works.
We two additional tests with the Prescott processors calculating the time to make the kernel while forcing make to run parallel jobs.
It is very easy to use make -j* incorrectly. There are small perecentage gains by using make -j3 over make -j2 using HyperThreading.
Synthetic Benchmarks
Synthetic benchmarks can still give us a good idea of what our processors should be doing. However, since they are theoretical and not good real world demonstrations of the technology, we generally rely on them only to prove or disprove if our testbeds are operating correctly. Below is the Scalar Product Opstone 04q2 as described by the author:"The 'SP' benchmark calculates the scalar product (dot product) of 2 vectors ranging in size from 16 elements to 1048576 elements for both single and double-precision floats. Although the Gflops/sec. for every vector length is recorded (in the resulting output log file), the average of all these values is reported. This benchmark is indicative of the performance of many raw floating-point data processing apps (movie format conversion, MP3 extraction, etc.)"We used the Athlon 64 binaries under SuSE x86_64, and the Pentium4 binaries under SuSE x86.
The Intel processors score very high marks on our Opstone benchmarks. Unfortunately, we find that this is not entirely indicative of good performance, and in fact, the Opstone benchmark does not scale well with the rest of our test suite.
Content Creation
On the other end of Synthetic Benchmarks, we have content creation benchmarks, which are extremely difficult to replicate and convey little information if interpreted incorrectly. Below, we compiled lame 3.96.1 without any additional optimizations and then used the following command on a 800mb .wav file.# lame sample.wav -b 192 -m s -h - >/dev/null
The file is sent to stdout, which is then directed to /dev/null. We do not want the hard drive to throttle our MP3 encoding if possible, even if we are just immediately destroying it.
Now on to our MEncoder test. We compiled 1.0pre5 from source without any optimizations. We had difficulty getting MPlayer to compile on x86_64, and thus, that portion was omitted. The benchmark command that we ran is below:
# time mencoder sample.mpg -nosound -ovc lavc vcodec=mpeg4:vpass=2 -o sample.avi
Again, we saw vast differences between the AMD and Intel processors on these content specific benchmarks.
Encryption Benchmarks
Finally, our favorite part of any Linux benchmark - hashing and encryption tests. John the Ripper has multiple optimizations for Intel and AMD hardware, including hand-coded ASM functions. For the 64-bit versions of John, we compiled using the make linux-x86-64-elf target. As you can see, this produces slower binaries than the 32-bit versions of John. Oddly, the 64-bit binaries were heavily dependent on compiler flags, while the 32-bit versions showed very little difference from one compile flag to another.OpenSSL's crypt libraries are probably heavily optimized for 64-bit operation; we see the difference in the two architectures very clearly. The RSA functionality is extremely crippled on the Pentium4 platform. Although this is an extreme example of one hardware platform dominating another, we consider this to be a relevant real world example.
An unusual problem occured while running the OpenSSL benchmark. Even though we are using Intel validated heatsink/fan combos we recieved a continuous stream of errors from the operating system regarding thermal temperatures. An example syslog can be seen below:
Message from syslogd@linux at Mon Sep 18 01:57:27 2004 ...
linux kernel: CPU#0: Temperature above threshold
This does not bode well for the processor. Our processor test bed is completely caseless, and if we have issuse with our 3.6GHz processor out of a normal case, we can't imagine what issues might exist in a full enclosure.
DDR2 versus DDR1
There are very few real-world benchmarks that completely demonstrate total memory bandwidth saturation. For real world applications, latency becomes critical - or at least logic dictates. Below, you can see what happens to our Intel processors when we run them on DDR2 and DDR1. We used the same low latency Mushkin PC-3200 (2-2-2) that we used for the AMD testbed as our DDR1 kit. Low latencies are something that DDR2 modules lack right now; and even though our Corsair XMS2 memory operates at ~675MHz, the timings are less than spectacular 4-4-4.Let's take a look at a kernel compile - a typical memory and CPU hog.
Below are the two different configurations of Intel processors running DDR1 and DDR2 during our Mental Ray 3.3.1 benchmark.
We are not really given a clear indication of which configuration runs better. The P4EE with DDR2 smidges past any other configuration, we expect more differentiation between the two. Let us look at some more content creation tests.
Notice that the MySQL test-select benchmark improves with DDR1 over DDR2. When we are making thousands of little queries to the database, we are relying more heavily on low latencies rather than the larger headroom of DDR2. However, consider our warning to take the sql-bench tests with a grain of salt. They do not entirely reflect SQL performance or workstation performance. The inconsistancy of which memory configuration from one benchmark to another definitely complicates issues. Price generally finishes the discussion for us.
Summing it all up
We were extremely pleased to see the 64-bit applications generally perform better than their 32-bit counterparts. Unfortunately, there were still several cases where 64-bit binaries performed slower; John the Ripper being one of those examples. Some software, like MEncoder 1.0pre5 proved difficult to install on SuSE 9.1 x86_64 as well. We didn't even touch on the hundreds of software ports that do not have working 64-bit binaries yet, including Wine. Sometimes the advantage of speed does not outweigh the advantage of software compatibility.Another interesting revelation in our analysis came when we swapped our DDR2 memory with DDR1. We observed instances giving either memory configuration the advantage, with no clear winner. Although we tested several benchmarks and saw several trends, DDR2 versus DDR1 on Linux from a performance standpoint looks inconclusive. Even though DDR2 continues to fall in price, the additional premium makes it difficult to justify the cost. Our DDR2 memory configuration retails for $400 while our DDR1 configuration retails for $250. With many 915P and all 925X, you are not even given the choice of which memory type to chose, so weigh your motherboard performance on the value that you put on your memory. The DFI LanParty board that we used for this analysis supports both.
A straight comparison of processor against processor is not as simple as it looks. Price invariably becomes the strongest argument in buying one CPU over another. The cheapest CPU in our shootout (the Athlon 64 3500+) costs $350 while the Pentium 4 560 - if you can find it - retails for $500. The Pentium 4 3.4GHz Extreme Edition and Athlon FX-53 both retail for over $800. If you are considering a processor merely on bang for your buck, the Athlon 64 3500+ does not disappoint. The only real reason why anyone should even consider buying an Extreme Edition or FX processor would be for overclocking (if that's your thing). As we saw in most of our benchmarks, the 3800+ and the FX-53 performed very similarly, usually within 3% of each other.
Realistically, the Pentium 4 560 and the Athlon 64 3500+ are the best contenders in this match up. In six months when we run this shootout again we will likely be saying the same things about the Athlon 64 3800+. For now, however, the Athlon 64 3500+ does an excellent job of balancing price with performance. Arguably the most compelling reason to suggest the Athlon 64 over a Pentium 4 would be for the extremely favorable 64-bit content creation binaries. Wary of x86_64 "gotchas", this editor just dual boots SuSE x86 and SuSE x86_64 for the best of both worlds.