Dynamic Power Management: A Quantitative Approach
by Johan De Gelas on January 18, 2010 2:00 AM EST- Posted in
- IT Computing
Limitations
First of all, let's discuss the limitations of this review. The benchmark we used allowed us to control the number of threads very accurately, but it is not a real world benchmark for most IT professionals. The fact that it is an integer dominated benchmark means that it has some relevance, but it's still not ideal. In our next article we will be using MS SQL Server 2008 R2. That will allow us to measure power efficiency at a certain performance level, which is also much more relevant than pure performance/watt. Also, the low power six-core Opteron 2419 EE is missing. This CPU just arrived in the labs as we finished this article, so expect an update soon.
"Academic" Conclusion
The days where dynamic frequency scaling offers significant power savings are over. The reason is that you can only lower voltages if you scale the complete package towards a lower clock. In that case the power savings are considerable (P ~ V²), but we did not encounter that situation very often. No, both AMD and Intel favor the strategy of placing the idle cores in higher C-states. The most important power savings come from fine grained clock gating, from placing cores in a completely clock gated C-state (AMD's Smart Fetch + C1), or even better placing them in a power gated stated (Intel's Power Gating into deep C6 sleep).
Practical Conclusions
Windows 2008 makes you choose between Balanced and Performance power plans. If your application runs at idle most of the time and you are heavily power constrained, Balanced is always the right choice. But in all other cases, we would advise using the "Performance" plan for the Opterons. For some reason, the CPU driver does not deliver the performance that is demanded. With Balanced, when you ask for 25% of the total CPU performance, you'll get something like 15% to 20%. The result is that you get up to 25% less performance than the CPU delivers in "Performance" mode, without significant power savings. That's not good. We can already give away that we saw response time increases in MS SQL Server 2008 due to this phenomenon. It is also worth saying that our new measurements confirm that the performance/watt ratio of the six-core Opterons is significantly better than the quad-core Opterons.
The Xeons are a different story. For the normal 95W Xeons it makes sense to run in Balanced mode. The "base" performance is excellent and Turbo Boost adds a bit of performance but also quite a bit of power. Ideally, it should be possible to run in Balanced mode and use Turbo Boost when your application is performing a single threaded batch operation, but unfortunately this is not possible with the default Windows 2008 settings.
For the low power Xeons, it is different. Those CPUs run closer to their specified TDP power limit and will rarely use Turbo Boost as soon as they are loaded at 25% or more. If your application is limited by regular single threaded batch operations, it makes a lot of sense to choose the Performance plan. Turbo Boost pays off in that case: the clock speed is raised from a meager 1.86GHz to an impressive 3.2GHz. As Xeons based on the "Nehalem" architecture place idle cores in C6 very quickly, the Performance mode hardly consumes more than the Balanced mode. As we have shown, frequency scaling does not save much power, as most of the cores are power gated automatically. This aggressive "go to C6 sleep" policy allows the architecture with the highest IPC in the industry to morph into a high performance server CPU with modest power consumption. There is a huge difference between this CPU inside a machine where it is pushed towards 100% load and inside a server where it hovers between 20 and 70% load most of the time. The latter situation allows the CPU to put cores in C6 mode a significant amount of time. As a result the power savings in a server environment are nothing short of impressive.
Now that we understand the nuts and bolts, we are able to move on to our next question: How can we get the best power efficiency at a certain performance point? We will follow up with a power efficiency case study based on SQL Server 2008.
References
[1] "Planet Google": One Company's Audacious Plan to Organize Everything, page 82, Randall Stross, Free Press New York.
[2] "AMD Family 10h Server and Workstation Processor Power and Thermal Data Sheet Publication Revision: 3.07, September 2009"
[3]"Power Reduction through RTL Clock Gating," F. Emnett and M. Biegel, SNUG (Synopsis User Group) Conference San Jose, 2000.
[4] "45nm Next Generation Intel Core™ Microarchitecture (Penryn)", Varghese George Principal Engineer Intel Corp, HOT CHIPS 2007
[5] "Analysis of Dynamic Power Management on Multi-Core Processors", W. Lloyd Bircher and Lizy K. John, The University of Texas at Austin. ICS '08 June 2008
[6] "Intel Xeon Processor 3400 series thermal/mechanical specifications and design guidelines, December 2009
35 Comments
View All Comments
JohanAnandtech - Tuesday, January 19, 2010 - link
Well, Oracle has a few downsides when it comes to this kind of testing. It is not very popular in the smaller and medium business AFAIK (our main target), and we still haven't worked out why it performs much worse on Linux than on Windows. So chosing Oracle is a sure way to make the projecttime explode...IMHO.ChristopherRice - Thursday, January 21, 2010 - link
Works worse on Linux then windows? You have a setup issue likely with the kernel parameters or within oracle itself. I actually don't know of any enterprise location that uses oracle on windows anymore. "Generally all Rhel4/Rhel5/Sun".TeXWiller - Monday, January 18, 2010 - link
The 34xx series supports four quad rank modules, giving today a maximum supported amount of 32GB per CPU (and board). The 24GB limit is that of the three channel controller with unbuffered memory modules.pablo906 - Monday, January 18, 2010 - link
I love Johan's articles. I think this has some implications in how virtualization solutions may be the most cost effective. When you're running at 75% capacity on every server I think the AMD solution could have possibly become more attractive. I think I'm going to have to do some independent testin in my datacenter with this.I'd like to mention that focusing on VMWare is a disservice to Vt technology as a whole. It would be like not having benchmarked the K6-3+ just because P2's and Celerons were the mainstream and SS7 boards weren't quite up to par. There are situations, primarily virtualizing Linux, where Citrix XenServer is a better solution. Also many people who are buying Server '08 licenses are getting Hyper-V licenses bundled in for "free."
I've known several IT Directors in very large Health Care organization who are deploying a mixed Hyper-V XenServer environment because of the "integration" between the two. Many of the people I've talked to at events around the country are using this model for at least part of the Virtualization deployments. I believe it would be important to publish to the industry what kind of performance you can expect from deployments.
You can do some really interesting HomeBrew SAN deployments with OpenFiler or OpeniSCSI that can compete with the performance of EMC, Clarion, NetApp, LeftHand, etc. NFS deployments I've found can bring you better performance and manageability. I would love to see some articles about the strengths and weaknesses of the storage subsystem used and how it affects each type of deployment. I would absolutely be willing to devote some datacenter time and experience with helping put something like this together.
I think this article really lends itself well into tieing with the Virtualization talks and I would love to see more comments on what you think this means to someone with a small, medium, and large datacenter.
maveric7911 - Tuesday, January 19, 2010 - link
I'd personally prefer to see kvm over xenserver. Even redhat is ditching xen for kvm. In the environments I work in, xen is actually being decommissioned for VMware.JohanAnandtech - Tuesday, January 19, 2010 - link
I can see the theoretical reasons why some people are excited about KVM, but I still don't see the practical ones. Who is using this in production? Getting Xen, VMware or Hyper-V do their job is pretty easy, KVM does not seem to be even close to being beta. It is hard to get working, and it nowhere near to Xen when it comes to reliabilty. Admitted, those are our first impressions, but we are no virtualization rookies.Why do you prefer KVM?
VJ - Wednesday, January 20, 2010 - link
"It is hard to get working, and it nowhere near to Xen when it comes to reliabilty. "I found Xen (separate kernel boot at the time) more difficult to work with than KVM (kernel module) so I'm thinking that the particular (host) platform you're using (windows?) may be geared towards one platform.
If you had to set it up yourself then that may explain reliability issues you've had?
On Fedora linux, it shouldn't be more difficult than Xen.
Toadster - Monday, January 18, 2010 - link
One of the new technologies released with Xeon 5500 (Nehalem) is Intel Intelligent Power Node Manager which controls P/T states within the server CPU. This is a good article on existing P/C states, but will you guys be doing a review of newer control technologies as well?http://communities.intel.com/community/openportit/...">http://communities.intel.com/community/...r-intel-...
JohanAnandtech - Tuesday, January 19, 2010 - link
I don't think it is "newer". Going to C6 for idle cores is less than a year old remember :-).It seems to be a sort of manager which monitors the electrical input (PDU based?) and then lowers the p-states to keep the power at certain level. Did I miss something? (quickly glanced)
I think personally that HP is more onto something by capping the power inside their server management software. But I still have to evaluate both. We will look into that.
n0nsense - Monday, January 18, 2010 - link
May be i missed something in the article, but from what I see at home C2Q (and C2D) can manage frequencies per core.i'm not sure it is possible under Windows, but in Linux it just works this way. You can actually see each core at its own frequency.
Moreover, you can select for each core which frequency it should run.