the x86 instruction proprietary extensions: a waste of time, money and energy
by Johan De Gelas on December 6, 2009 12:00 AM EST- Posted in
- IT Computing general
Agner Fog, a Danish expert in software optimization is making a plea for an open and standarized procedure for x86 instruction set extensions. Af first sight, this may seem a discussion that does not concern most of us. After all, the poor souls that have to program the insanely complex x86 compilers will take care of the complete chaos called "the x86 ISA", right? Why should the average the developer, system administrator or hardware enthusiast care?
Agner goes in great detail why the incompatible SSE-x.x additions and other ISA extensions were and are a pretty bad idea, but let me summarize it in a few quotes:
So I fully support Agner Fog in his quest to a (slightly) less chaotic and more standarized x86 instruction set.
Agner goes in great detail why the incompatible SSE-x.x additions and other ISA extensions were and are a pretty bad idea, but let me summarize it in a few quotes:
- "The total number of x86 instructions is well above one thousand" (!!)
- "CPU dispatching ... makes the code bigger, and it is so costly in terms of development time and maintenance costs that it is almost never done in a way that adequately optimizes for all brands of CPUs."
- "the decoding of instructions can be a serious bottleneck, and it becomes worse the more complicated the instruction codes are"
- The costs of supporting obsolete instructions is not negligible. You need large execution units to support a large number of instructions. This means more silicon space, longer data paths, more power consumption, and slower execution.
Summarized: Intel and AMD's proprietary x86 additions cost us all money. How much is hard to calculate, but our CPUs are consuming extra energy and underperform as decoders and execution units are unnecessary complicated. The software industry is wasting quite a bit of time and effort supporting different extensions.
Not convinced, still thinking that this only concerns the HPC crowd? The virtualization platforms contain up to 8% more code just to support the incompatible virtualization instructions which are offering almost exactly the same features. Each VMM is 4% bigger because of this. So whether you are running Hyper-V, VMware ESX or Xen, you are wasting valuable RAM space. It is not dramatic of course, but it unnecessary waste. Much worse is that this unstandarized x86 extention mess has made it a lot harder for datacenters to make the step towards a really dynamic environment where you can load balance VMs and thus move applications from one server to another on the fly. It is impossible to move (vmotion, live migrate) a VM from Intel to AMD servers, from newer to (some) older ones, and you need to fiddle with CPU masks in some situations just to make it work (and read complex tech documents). Should 99% of market lose money and flexibility because 1% of the market might get a performance boost?
The reason why Intel and AMD still continue with this is that some people inside feel that can create a "competitive edge". I believe this "competitive edge" is neglible: how many people have bought an Intel "Nehalem" CPU because it has the new SSE 4.2 instructions? How much software is supporting yet another x86 instruction addition?
108 Comments
View All Comments
epobirs - Tuesday, December 8, 2009 - link
Not nothing. It made a significant difference for certain types of apps. MMX was what made software decoding of DVD possible before CPUs became otherwise fast enough. The pre-MMX 233 MHz Pentium couldn't do it but the MMX-equipped 233 MHz Pentium could with updated software. This was a big win for selling new PCs offering DVD playback without the expense. It gave a nice boost to plenty of other apps like PhotoShop. If you were a heavy duty user making your living with that app, it was enough to make a new machine very attractive. Back when DSP boards for the Mac costing $5,000 gave a similar boost, I use to have clients who said they'd make up the cost on one major job by getting it done in four days instead of five. Back then, speedier CPUs appeared only at long intervals. By the time MMX was introduced, new spee grades were getting pretty frequent, making it harder to appreciate what SIMD brought to the table.It didn't set the world on fire but it was a worthwhile addition to the processor. As transistor real estate gets ever cheaper and more compact, it makes very good sense to to create instructions that maximize throughput on frequent operations. Another good example is dedicated silicon in GPUs for offloading video playback operations. The cost for this little bit of chip space is so low it doesn't make sense to bother producing systems without it, even if they might never perform any video playback.
mamisano - Tuesday, December 8, 2009 - link
AMD created the SSEPlus project to help with some of the issues presented in this article.http://sseplus.sourceforge.net/">http://sseplus.sourceforge.net/
[quote]In March 2008, AMD initiated SSEPlus, an open-source project to help developers write high performing SSE code. The SSEPlus library simplifies SIMD development through optimized emulation of SSE instructions, CPUID wrappers, and fast versions of key SIMD algorithms. SSEPlus is available under the Apache v2.0 license.
Originally created as a core technology in the Framewave open-source library, SSEPlus greatly enhances developer productivity. It provides known-good versions of common SIMD operations with focused platform optimizations. By taking advantage of the optimized emulation, a developer can write algorithms once and compile for multiple target architectures. This feature also allows developers to use future SSE instructions before the actual target hardware is available.[/quote]
redpriest_ - Tuesday, December 8, 2009 - link
Johan, if you support both vendors in separate libraries, the one that isn't being used won't get loaded; it's not like you'll clutter up Icache space since it's incompatible code and will never get executed anyway. The loss is more of a code management headache and a marginal amount of extra disk space.redpriest_ - Tuesday, December 8, 2009 - link
Also; let me add that frequently unused instructions get implmemented in microcode in successive revisions. While there is a nominal silicon penalty, it is very small. This is more of a verification nightmare for the companies involved in implementing them than anything else.JohanAnandtech - Tuesday, December 8, 2009 - link
Are you sure that in case of the hypervisor the extra code will not be loaded in RAM anyway? (depends on how it is implemented of course)Remains the fact that it is extra code that must be checked, extra lines that can contain bugs. There is really no reason why AMD's AMD-V and Intel's VT-x ISAs extensions are different.
And again, let us not forget the whole vmotion/live migration mess. It is not normal that I can not move VMs from an AMD to an Intel server. It is a new form of vendor lock-in. Who is responsible for this is another matter... But a decent agreement on a standarized procedure would do wonders I think. And it would pave the way for fair competition.
azmodean - Monday, December 7, 2009 - link
Some food for thought, if all your "killer apps" weren't closed source, you COULD migrate to a new, more efficient processor architecture.All the apps I use run on x86, Alpha, ARM, PowerPC, and anything else you care to throw them at, how about yours?
rs1 - Wednesday, December 9, 2009 - link
I doubt it's that simple. The fact that things like Windows and Office and other "killer apps" are closed-source does nothing to stop their owners from compiling the sources for a different CPU architecture, and I'm sure they would do so if it were feasibly doable and if there were a reasonable market for that sort of thing. The problem (if we ignore the "reasonable market" requirement for now) is that it is probably not currently feasible.Let's start with Windows, since that is a basic prerequisite for most "killer apps" to function without significant modifications to their code. The Windows source code almost definitely contains some sections that are written not in C, and not in C++, but in x86 assembly language. Simply cross-compiling those sections for a different target architecture is unlikely to work, and even if it does work, it is even more unlikely to give correct results. All such sections would need to be re-written in assembly language specific to the desired target architecture (or replaced with an equivalent implementation using a higher-level language, if possible). That kind of work falls well outside of the abilities of the average computer user. It probably falls out of the abilities of most programmers as well, when considering the complexities of an Operating System, and the strictness of the requirements (all ported assembly code needs to remain fully compatible with the rest of the Windows codebase).
And without a full and correct port of Windows, it doesn't matter how many other popular Windows apps are made open-source, as there'd be no platform that could run them without substantial modifications to their source code. Granted, there are likely a handful of programmers skilled enough to port the Windows codebase to a different architecture, and they might even be willing to do so if given the chance. But I'd wager that their efforts would take well over a year to be complete, and would likely be bug-ridden and not fully correct regardless. And in any case, in no way would the average computer user be empowered to migrate to a different CPU architecture simply by having the code for everything be open-source. There are simply more barriers involved than just having access to the source code.
Zan Lynx - Monday, December 14, 2009 - link
Windows has already been ported to other CPU types. Microsoft has an Itanium version of Windows and of course amd64.Since they have ported to Itanium, I don't think it'll be too difficult for them to port to anything else.
Scali - Tuesday, December 8, 2009 - link
Well, just being open source isn't enough.The code also needs to be written in a portable fashion. A lot of open source software would initially only compile under 32-bit linux on x86 processors. Porting to 64-bit required a lot of fixing of pointer-related operations... and porting to big endian architectures often isn't as simple as just recompiling the source either.
Another solution to the problem is languages based on virtual machines, such as Java, .NET, or the various scripting languages. They only need a VM for the new architecture, and then your software will work.
I think this is a better solution than open source in most cases, since the sourcecode doesn't have to deal with architecture-specific issues at all, and you won't have endianness or pointer-size related issues, among other things.
Cerb - Tuesday, December 8, 2009 - link
Personally, I like this method. Now that RAM is getting sufficiently cheap for even mobile devices to have 64MB or more, the overhead of such implementations is less of a concern (once 256MB becomes common, it will all but evaporate), and ARM is certainly ready for it (ThumbEE).Actual performance tends to be fast enough to not worry for most applications (even considering a battery, which will only become a more common issue over time), and with performance statistics gathering and profile-guided optimization added in (as in running on your programs, based on you are regularly using them), you could beat static programs in oft-used branchy code, and reduce various odd performance bottlenecks that static compilers either can't account for, or are not allowed to do.
...we just an entire user space software system that does not allow pointers, yet can use languages that are not soul-sucking enterprise-friendly ones, and do not rely on platforms for such languages (such as Java). Far easier said than done.