POST A COMMENT

99 Comments

Back to Article

  • tuxRoller - Tuesday, March 12, 2013 - link

    Why WOULD you expect DVFS to boost performance?
    You seem to think it slightly revelational that the scores are slightly lower (but perhaps statistically meaningless).
    Reply
  • dig23 - Tuesday, March 12, 2013 - link

    On-demand seems fair choice to me, its what best you can do on this OSes. But I will be very interested to see energy efficiency numbers when DVFS working on swarm of ARM nodes...:) Reply
  • tuxRoller - Tuesday, March 12, 2013 - link

    It's not cpu governor I'm talking about but DVFS in particular.
    There's bound to be some small amount of latency involved with the process.
    It's point isn't for best performance but energy efficiency thus why I made the comment in the first place.
    Reply
  • JarredWalton - Tuesday, March 12, 2013 - link

    There's the potential for DVFS to optimize for better performance on a few cores while putting some of the other cores into a lower P-state, but I think that would be more for stuff like Turbo Boost/Turbo Core. It's also possible Johan is referring to the potential for the optimizations to simply improve performance in general. Reply
  • CodyHall - Friday, March 15, 2013 - link

    Love my job, since I've been bringing in $5600… I sit at home, music playing while I work in front of my new iMac that I got now that I'm making it online.(Click Home information)
    http://goo.gl/9u8us
    Reply
  • JohanAnandtech - Wednesday, March 13, 2013 - link

    Can you tell me where I got you confused? Because I write "This allowed us to make use of Dynamic Voltage and Frequency Scaling (DVFS, P-states) using the CPUfreq tool. First let's see if all these power saving tweaks have reduced the total throughput."

    So it should been clear that we are looking for a better performance/watt ratio. The interesting thing to note is that ARM benefits from p-states, and that Intel's excellent implementation of C-states makes p-states almost useless.
    Reply
  • Twonky - Wednesday, March 13, 2013 - link

    For information about a year ago the following post on the Linkedin ARM Based Group gave a link to a M.Sc. thesis publishing figures on the performance/watt ratio for Cortex-A8 and Cortex-A9 based boards:
    www.linkedin.com/groups/Single-CortexA8-CortexA9-in-comparison-85447.S.84348310
    Reply
  • AncientWisdom - Tuesday, March 12, 2013 - link

    Very interesting read, thanks! Reply
  • staiaoman - Tuesday, March 12, 2013 - link

    Damn, Johan. As always- an incredible writeup. Interesting thought experiment to figure that an upper bound on damage to INTC server share might be found by simply looking at how much of the market is running applications like your web server here (where single-threaded performance isn't as important).

    Intel powering phones and ARM chips in servers...the end is nigh.
    Reply
  • JohanAnandtech - Thursday, March 14, 2013 - link

    Thanks Staiaoman :-). I'll leave the though experiment to you :-) Reply
  • Gigaplex - Tuesday, March 12, 2013 - link

    I wouldn't call that a spectacular performance per watt ratio. It's a bit faster than the Xeon under a cherry picked benchmark (much slower under others), and is only marginally lower power. Best case it's an 80% improvement over Sandy Bridge with regards to performance per watt, and Atom wasn't represented. Considering all the hype, I was expecting something a little more... exciting. Ignoring Ivy Bridge improvements, Haswell isn't far off. Reply
  • spronkey - Tuesday, March 12, 2013 - link

    Yeah... I agree. It also only seems to really come into its own in high concurrency. The Xeons idle quite similarly in terms of power - what happens if you compare it to more Xeon cores? It seems like on a per core basis, Intel still has the advantage on both fronts? Reply
  • spronkey - Tuesday, March 12, 2013 - link

    I would also point out that the A15 has already been compared against Sandy and Ivy cores and come up short in performance per watt; so I'm very interested to see what the next step for these ARM node servers is. Reply
  • JohanAnandtech - Wednesday, March 13, 2013 - link

    I warned against the hype in the first sentences. :-) ARM CPUs are still rather weak and not a good match for most applications. However, the fact that we could actually find a case where they do a lot better than the current Xeon systems was surprising to me. Reply
  • wsw1982 - Wednesday, April 03, 2013 - link

    No, it should not surprise any people regarding how picky the use case is. I mean, I do think you can find a use case the ARM 11 output perform Xeon. E.g. Serving 1 web request per hour :) Reply
  • LogOver - Tuesday, March 12, 2013 - link

    24 servers ran inside 24 VM's on Xeon server, while for ARM server you used the 24 physical server nodes... Hmm... Does not seems to me like apple to apple comparison. Why not to compare, for example, 16 physical nodes on both, xeon and arm servers? Reply
  • haplo602 - Wednesday, March 13, 2013 - link

    And how do you slice the Xeon server into 16 physical nodes ? It does not support any kind of HW partitioning that I am aware of. On the other hand the Calxeda machine is a cluster by design. If you try 16 Xeon nodes you'll go through the roof with power. Reply
  • Colin1497 - Wednesday, March 13, 2013 - link

    I think the question is this:

    Was 24 VM's optimal for the Xeon? Since we're visualizing the Xeon, why 24? Just because you had 24 ARM nodes? Would the Xeon done better with 4VM's? Or 16? Or 1000? 24 seems arbitrary.
    Reply
  • JohanAnandtech - Wednesday, March 13, 2013 - link

    We tested with 16 as I briefly mentioned in the conclusion. The 2650L did 170 responses/s per VM, or about 40% better. Total Throughput = 2.7k/s, while with 24, 2.9 K/s. THe flexibility that the Xeon has to reduce the number of VMs if higher throughput is necessary is definitely an advantage, but the performance numbers are not that different with different VM configs. Reply
  • Kurge - Wednesday, March 13, 2013 - link

    How about with 0 VM's? Just run it on the metal. Reply
  • Kurge - Wednesday, March 13, 2013 - link

    Yeah, should have had two teams - each with goal to optimize on each platform. The Xeon team would not (lol) load up 24 VM's to serve the same web app. It's silly. Go bare metal in that use case.

    There will be different needs for different cases. The "lets load up a bunch of VMs" is useful to cloud providers and in other cases, but not for "I want to feed this app to as many users as possible".
    Reply
  • dig23 - Tuesday, March 12, 2013 - link

    Interesting article and great first effort but felt bit outdated on both ATOM as well as ARM front, I am not blaming you, just saying. Reply
  • JarredWalton - Tuesday, March 12, 2013 - link

    Outdated in what sense? No one else has really made a serious attempt to review thee Calxedas stuff, and while there are better Atom option out there, as Johan notes we were unable to get any in-house in time for testing. Or do you mean Calxedas' use of Cortex-A9 is outdated? If so, that's more of a case of laying the groundwork I think. Assuming they have their A15 option be backwards compatible with the current system (e.g. just get a new set of cards with the updated SoCs), that would be very cool. Reply
  • JohanAnandtech - Wednesday, March 13, 2013 - link

    I can only agree with Jarred. There are no A15 server chips AFAIK, and unless I have missed a launch, I think the Atom N2800 is not outdated at all (Dec 2011). Reply
  • aryonoco - Wednesday, March 13, 2013 - link

    This was a fabulous and most informative write up. You answered so many of my questions with this article. Excellent job covering an area that no one else is, and also kudos for running such great benchmarks.

    This really is tech journalism at its best. Thank you Johan, and thank you Anand for employing such high-quality writers.

    We all know how memory constrained the ARM A9 is. Even something like Krait would solve a lot of A9's traditional weak areas. And yet, it looks like the Calxeda makes sense in enough niches to be sustain their R&D and development efforts. Low-to-medium traffic web hosting, media streaming and storage. Each one of those areas is a sizeable market and the Calxeda solution offers enough to be seriously considered in these makets.

    And when one thinks about how many years of x86 optimisation has gone into the toolchain in things like the gcc, one realises the potential that lies ahead for ARM in this market. ARM's future roadmap is well known, next is Cortex A15 and then Cortex A57. Meanwhile there will be more software optimisation, and the management/deployment side will also improve. With all these in mind, I think it's more than conceivable that ARM will grab up to 20% marketshare in the server market by 2015.
    Reply
  • JohanAnandtech - Wednesday, March 13, 2013 - link

    Thanks! Good summary... and indeed 20% marketshare is not impossible. The real questions is whether Intel give the Atom it is long overdue architecture update, or will Haswell put some pressure from above? Exciting times. Reply
  • beginner99 - Wednesday, March 13, 2013 - link

    Isn't it much easier to administer 24 virtual servers than 24 physical ones (cost of personnel)? When all servers have the same workload it look sgood for ARM but the virtualized intel environment easily wins if some servers get a lot more requests than others, meaning too much for one ARM SOC to handle. The tested scenario is basically the best one could ever hope for the ARM server and pretty unrealistic (same load for all servers). That's fine but then also post worst-case scenarios...Intel server is a lot more flexible. Reply
  • hardwaremister - Wednesday, March 13, 2013 - link

    I completely agree with the other readers that this writing is just absolutely superb. Fantastic novel job Johan.
    However, I also agree with the above commenter: a big part coup on virtualizing a "fat" core system is to be able to properly utilize the resources of the machine across VMs. By equally loading "tiny tiles", the obvious advantage of the inherent load balancing of a virtualized infrastructure completely disappears.
    Under current the current "fat" VM infrastructure you can accomodate individual VMs with heterogeneous loading levels, with extra provisioning in the resource pool.
    That is just not simply the case for these tests based on an army of individual machines against a many VMs virtualized under a few "fat" cpus.
    I don't mean to be overcritical, but this is a proper apples vs oranges comparison.
    Reply
  • bobbozzo - Wednesday, March 13, 2013 - link

    A lot of shared hosting ISP's use lightweight virtualization with Linux or BSD "Containers". I would like to see you re-benchmark with those on both servers instead of using VMs.
    You should see higher performance vs full virtualization. I'm not sure how it would affect the ARM performance, but it shouldn't hurt much, and there is more potential for better load sharing if some sites are busier than others.
    Reply
  • Jambe - Wednesday, March 13, 2013 - link

    Surprising, indeed! Thoroughgoing as usual, and excellently written. Reply
  • JohanAnandtech - Wednesday, March 13, 2013 - link

    Thanks! Reply
  • SunLord - Wednesday, March 13, 2013 - link

    Hmm if these didn't cost $20,000 they would make a nice front end for larger websites and forums using less rack space and power. What setup using these would you use for anandtech? Would you guys keep the intel DB server? Reply
  • Gunbuster - Wednesday, March 13, 2013 - link

    I just got a Dell R720xd decked out with 384GB and 4.3TB of storage for a hair over that price. Reply
  • JohanAnandtech - Wednesday, March 13, 2013 - link

    Intel Xeons are still by far a better choice for relational databases that are very hard to split up (sharding is only a last resort) Reply
  • zachj - Wednesday, March 13, 2013 - link

    I'm not sure I agree with the absolutism that seems imlicit in your comment that Xeons are better for relational databases...I think there are cases where that won't be true.

    Database scale-out doesn't always require sharding...using any of a number of different off-the-shelf capabilities built right into most SQL engines, you can create multiple active replicas of your database. This is generally better-suited to workloads that aren't write-intensive, but both clustering and replication allow for writes. While this may seem like a quick-and-dirty solution that is architecturally "less good" than sharding, hardware is a lot cheaper than paying people to design a sharding solution and the dollars very often drive the conversation. As long as the database size isn't terribly large this can be a very cost-effective way to scale out a database.

    I would wager that the Anandtech website database (not the forum database) would probably be well-suited to this type of scale-out. You do waste some money on redundant storage but you more than make up for that cost by not having to pay a development team to implement sharding. If the comments section of the Anandtech website gets stored in the same underlying database, the size constraints and the write activity may appear to be incompatible with this approach, but I would in fact argue that comments don't require relational capabilities of SQL and would be more rightly stored as blobs in Hadoop or Azure Storage Tables. Then the Anandtech database is strictly articles and is both much more compact and almost entirely read-only (except for a few new articles per day).
    Reply
  • rwei - Friday, March 15, 2013 - link

    To the best of my understanding, replication does well for scaling reads but doesn't do much for writes. I'd still imagine that this would work decently well with AnandTech, where I can't see the volume of writes being that large relative to the volume of reads. Reply
  • Kurge - Wednesday, March 13, 2013 - link

    They would make a horrible front end for such websites. Just buy a single Xeon server and don't artificially limit it by using 24 VMs. Just run the app straight on the metal and it will perform massively better. Reply
  • Oldboy1948 - Wednesday, March 13, 2013 - link

    Very interesting Johan as your tests often are!
    Interesting that the memory bw is so much lower than anything from Intel. In fact Iphone 5 looks much better...why? Only Intel has about the same rsults in compress and decompress.
    Reply
  • JohanAnandtech - Wednesday, March 13, 2013 - link

    Where did you see the stream results on the A6? I might have missed it somewhere. The only ones I could find reported only 1 GB/s in Triad. http://www.anandtech.com/show/6298/analyzing-iphon... The Quad ECX-1000 got 1.8 GB/s Reply
  • PCTC2 - Wednesday, March 13, 2013 - link

    Do you know what would be an interesting concept for a future version of these cluster-in-a-box systems? A solution like ScaleMP. ScaleMP is basically a reverse VM. A hypervisor on each server clusters together to run a single OS with an aggregation of all resources (cores, RAM, network, and disk). ScaleMP running on 4x Dual-socket 8-core Xeon systems w/ 32GB RAM results in a usable system with 64-cores and 128GB RAM as if it was running natively on the hardware. This would be an interesting concept to transfer to the ARM space (if a form of hardware virtualization ever is designed). In a box like this, there would be 192 cores and 192GB of RAM available to a single Fedora instance. Cluster 2 of these together and suddenly there's a system with 384 cores and 384GB of RAM in 4U. Just some food for thought. Reply
  • JohanAnandtech - Wednesday, March 13, 2013 - link

    Hmmm ... There is almost no info on how that hypervisor works. It is hard to imagine that kind of system would scale very well. How does it keep Cache coherent? Do you have info on that? Reply
  • timbuktu - Wednesday, March 13, 2013 - link

    I can't speak directly to ScaleMP, but it looks similar to NUMALink.

    http://en.wikipedia.org/wiki/NUMAlink

    Reading through this article about Calxedas, great job BTW, I couldn't help but think about the old SGI hardware that seemed pretty similar with MIPs (and later Itanium) processors connected through a switch with NUMALink. I haven't played with NUMALink directly in almost a decade, but back then cheaper Altix slabs were ring topology while higher end hardware was switched. In the end though, you could put together a bunch of 1U racks together and have a single system image. Like you mentioned though, cache coherency was exceptionally important. Since we have a uv here, I can point you to the documentation for that box.

    http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc...

    Everything old is new again, I suppose. Well, except NUMAlink never went away. =D
    Reply
  • Tunrip - Wednesday, March 13, 2013 - link

    I'd be interested in knowing how the Xeon compared if you did the same test without the virtual machines. Reply
  • JohanAnandtech - Wednesday, March 13, 2013 - link

    The website won't scale to 32 logical cores I am afraid... but we can try to see how far we can get Reply
  • Colin1497 - Wednesday, March 13, 2013 - link

    A better question might be "is 24 VM's a logical number to use?" Would more or fewer VM's work better? The appearance is that you have 24VM's because you have 24 ARM nodes? Reply
  • duploxxx - Wednesday, March 13, 2013 - link

    very interesting, loved reading it. But although early in the ball game I do think there are other way better solutions in the pipe-line from the big OEM:

    HP Moonshot
    http://h17007.www1.hp.com/us/en/iss/110111.aspx
    Reply
  • JohanAnandtech - Wednesday, March 13, 2013 - link

    Isn't remarkable how PR people manage to fill so many pages with "extreme" and "the future" without telling anything. Frustation became even higher when I clicked "get the facts" page. That is more like "You are not getting any facts at all". Reply
  • DuckieHo - Wednesday, March 13, 2013 - link

    Since these are set up as webservers, what's the power consumption at say 20-40% load? Usually there is some load instead of completely idle. Reply
  • JohanAnandtech - Wednesday, March 13, 2013 - link

    Good suggestion... you'll like to see a step by step power measurement like SpecPower right? Let me try that. Reply
  • DanNeely - Wednesday, March 13, 2013 - link

    I'd be interested in seeing where, and what happens when you start pushing single chips to and slightly beyond their limits. Calxeda's hardware's proved competitive on a very friendly workload (which I didn't really expect would happen until their A15 product); but in the real world a set of small websites are unlikely to all have equal load levels. Virtual servers on larger CPUs should give more headroom for load spikes; so knowing what the limits on Calxeda's hardware are strikes me as fairly important. Reply
  • JohanAnandtech - Wednesday, March 13, 2013 - link

    Ok, good question. I'll look into it, as I am definitely considering a follow-up Reply
  • skyroski - Wednesday, March 13, 2013 - link

    I make performance oriented web apps for a living and I was looking forward to this performance test very much. However, I was quite disappointed at how you have done the "real world" test.

    If you're serving a single site you would never put a Xeon through the performance penalties of virtualisation, so I deem your real world results flawed/unusable.

    Basically, if I was to consider buying a Calxeda server tomorrow, I want to know if I can serve a site faster/better by using the "cluster in a box" solution which ARM's partners are going for or if a single Xeon server with standardised dedicated hardware will serve me and my businesses better.

    The other thing that I would have also tested is SSL request performance because Intel has AES-NI built in and I believe ARM has something similar? I would say the majority of request today for a serious web app/site will be traffic using the SSL protocol, so that would also be one of those deciding factors I would look at.

    If I was a cloud host provider your comparison may contain some truth as their business model would be to presumably let each ARM node out as a VPS alternative, but that isn't what you were testing were you?
    Reply
  • JohanAnandtech - Wednesday, March 13, 2013 - link

    1. The single site: it is not meant to be an environment of one single site. The reason why we use the same site over and over again, is that it makes it easier to interpret the results and more repeatable. Consider a hosting provider who host many similar - but not the same - LAMP sites.
    The repeatable part is the part that most people don't understand very well: we don't just hit the same URL over and over again. We perform real user interactions and randomize them in realworld patterns (like logging in first and then several real actions) and then getting a repeatable benchmark gets very complex.
    2. The SSL comment is definitely good feedback. We are currently writing the connection code for such SSL websites but also need to find one or more good examples. If your site is a good example, maybe we can use yours (even under NDA if necessary) ?
    3. Lastly, the virtualization overhead of ESXi 5 is very small.
    Reply
  • Kurge - Wednesday, March 13, 2013 - link

    You know, you can host multiple different LAMP sites on bare metal ;) Reply
  • klmccaughey - Wednesday, March 13, 2013 - link

    It won't be LAMP sites any more though - take a trawl through something like the Linode forums to get an idea of what people are building. You are talking higher concurrency and more likely nginx.

    Someone made a valid comment about database sharding - for web apps this is much more likely as people try to make sure they have failover.

    Whilst initially very disappointed, if you imaging the refresh on the ARM cores over the next 2 years (and considering the rate of change due to the phone market) you might actualy be looking at a beast of a machine in two or three iterations. Imagine if you could buy these off the shelf for under $10k: That feels to me like mission critical failover systems in a box. I can see this taking off in a couple of years.
    Reply
  • klmccaughey - Wednesday, March 13, 2013 - link

    And kudos for the review - I look forward to the follow-up. This is a space that needs watching! Reply
  • Silma - Thursday, March 14, 2013 - link

    True but do you think Intel will stop product development for the next 3 years? In addition who will have the best fabs then? My guess is Intel. Reply
  • Krysto - Monday, March 18, 2013 - link

    I don't know how fast it actually is, but relative to the ARMv7 architecture, AES should be up to 10x faster on ARMv8. Reply
  • kfreund - Wednesday, March 13, 2013 - link

    Nice job, Johan. Can't wait to see your next one; we will be sure to get you an A15 based system as soon as we get it out! Let the debates begin! Reply
  • kfreund - Wednesday, March 13, 2013 - link

    Regarding Stream performance, this is a known limitation of A9; it just can't handle a lot of concurrent memory requests. A15 will nearly triple the memory bandwidth at same DDR rate. Reply
  • Madpacket - Wednesday, March 13, 2013 - link

    And all of a sudden AMD's acquisition of SeaMicro is starting to make sense. Thanks Johan, great article! Reply
  • JohanAnandtech - Wednesday, March 13, 2013 - link

    I really really hope they downscale the current SeaMicro's soon. Because with a starting price at $139000, they are not catering to the typical SME :-). Reply
  • joshv - Wednesday, March 13, 2013 - link

    It seems this has a very narrow application in VM hosting, but I am not sure it's applicable when you have the choice of just scaling up memory or process usage of the single instance Xeon server. For example, I could load 24 instances of my production middle tier on the ARM server - or I could run one instance on a Xeon server and give it all the memory and make sure it spawns enough threads to keep all the internal cores busy. Perhaps my middle tier software has issues with handling all that RAM, so maybe I run 4 instances of it as a process, not a biggy.

    I am going to bet that the Xeon server will win as it won't have the VM overhead.
    Reply
  • Kurge - Wednesday, March 13, 2013 - link

    I would be interested in a bare metal comparison. Since you're serving up the same app why would you split it between 24 VMs on the Xeon server? It's a bit contrived.

    Just load up Server 2012 and IIS or Linux + Apache straight up on the Xeon and see how it performs.
    Reply
  • MrSpadge - Wednesday, March 13, 2013 - link

    Very interesting!

    I'd prefer a fat machine with virtualized servers to get automatic load balancing, but it's not like one couldn't shuffle tasks around in the ARM farm. And there's room for improvement: be it the next Atom or the memory controller in the current ECX-1000 CPUs. And take a look at how badly they scale from 2 to 4 threads - surely, there's lot's of rooms left!
    Reply
  • rubyl - Wednesday, March 13, 2013 - link

    What is the average CPU utilization for the Viridis nodes and for the Xeon system under the 5 different concurrency loads (for the 24 webserver workload)? Reply
  • gercho - Wednesday, March 13, 2013 - link

    When you said " The next generation ARM servers are already on the way and will probably hit the market in the third quarter of this year. The "Midway" SoC is based on a 28nm (TSMC) Cortex-A15 chip. A 28nm A15 offers 50% higher single-threaded integer performance at slightly higher power levels and can address up to 16GB of RAM." As far as I know the A15 cores have 50% more performance but consume 3X more power, that's not "slightly"......... Reply
  • nofumble62 - Wednesday, March 13, 2013 - link

    50% more performance at 3X more power... reminding me of the Netburst architect. Reply
  • thenewguy617 - Wednesday, March 13, 2013 - link

    Can you please point me to sources of your number?
    Thanks
    Reply
  • Wilco1 - Thursday, March 14, 2013 - link

    Where on earth you do get that 3x from? So far no 28nm Cortex-A15 chips have been released. The A15 in the Exynos Octo uses about 1.25W per core at 1.8GHz according to Samsung. That's slightly more power than a Calxeda A9 uses per core, but the A15 gives twice the performance per core. Reply
  • tech4real - Thursday, March 14, 2013 - link

    Calxeda quotes 6W for the whole SOC. We don't know how much is used for all these uncore stuff. It's possible A9 core only burns around 800mW. Still quite a gap to 1.25W. Reply
  • Wilco1 - Thursday, March 14, 2013 - link

    Assuming the 800mW figure is accurate and the uncore power stays the same, then a node would go from 6W to 7.8W - ie. 30% more power for 100% more performance. Or they could voltage scale down to 1.5GHz and get 65% more performance for 5% more power. While a 28nm A15 uses more power in both scenarios, it is also much faster, so perf/Watt is significantly better. Reply
  • tech4real - Thursday, March 14, 2013 - link

    1. I guess we have to wait to see if it's really 2X perf from a9 to a15 in real tests. I personally wouldn't bet on that just yet.
    2. mostly likely the uncore power will increase too. i don't think the larger memory bandwidth will come free.
    Reply
  • Wilco1 - Thursday, March 14, 2013 - link

    1. We already know A15 is 50-60% faster than A9 per clock (and often more, particularly floating point), so that gives ~2x gain from 1.4GHz to 1.8GHz.
    2. The uncore power will be scaling down with process while the higher bandwidth demand from A15 will increase DRAM power. Without detailed figures it's reasonable to assume these balance each other out.
    Reply
  • tech4real - Thursday, March 14, 2013 - link

    then let's wait to see anand benchmarks the future a15 system.
    also since the real microserver battle is between the future a15 system and 22nm atom system, I am eager to see how it plays out.
    Reply
  • Th-z - Wednesday, March 13, 2013 - link

    Very interesting article, thanks! This really piques another curiosity: how does latest IBM Power based server fair these days. Reply
  • Flunk - Wednesday, March 13, 2013 - link

    It really doesn't sound like the price\performance is there. Also, lack of Windows support makes it useless for those of us who run ASP.NET websites (like the company I work for).

    It's still nice to see companies trying something different from the standard strategy. Maybe this is be better in a few generations and take the web server market by storm. If we see a Windows Server arm I could see considering it as an option.
    Reply
  • skyroski - Wednesday, March 13, 2013 - link

    I agree your testing suite's method is good and ok, so you were testing in consideration with hosting providers, fair enough.

    However on the topic of if you were serving a single site would a standard Xeon be better or ARM based ones? Which - is the case of consideration to FB/Twitter/Google/Baidu etc..., whom are as I have been led to believe by the media this past year, companies that ARM partners are trying to sell this piece of kit to. This test unfortunately cannot tell us.

    A quick search on Google on performance impact of VMs yielded a thread in the VMware community forum by a vExpert/Moderator that mentioned expectation of 90% performance, and frankly, no matter how small you think the performance impact of a VM maybe, it is still using up CPU cycles to emulate hardware, that point will remain true no matter how efficient the hypervisor gets.

    Secondly, coupled with the overhead of running 24 physical copies of the OS + Apache + DB on a box that would otherwise be running a single copy of the OS + Apache + DB is total overkill (on that topic)

    It would be great if you can also test Xeon's req/sec if it ran a single instance so we can see it from a different perspective, as of now as I said, your test is skewered towards hosting providers whom might invest in Calxeda to provide VPS alternatives. But to them (and their client base), the benefit of a VPS is it's portability, which, 24 physical ARM nodes isn't going to provide, so I don't see them considering it as an alternative solution anyway.
    Reply
  • skyroski - Wednesday, March 13, 2013 - link

    I also want to ask if your Xeon test server's network adapter is capable of and was using Intel VT-c? Reply
  • JohanAnandtech - Thursday, March 14, 2013 - link

    It was using VMDq/Netqueue (via VMXnet) but not SR-IOV/VT-c Reply
  • thenewguy617 - Wednesday, March 13, 2013 - link

    I would like to see the results with the website running on bare metal. I would like to, but I don't believe you when you say the virtualization overhead is minimal.
    Also, did you include the power used by the switch? as we scale the xeon cluster we will add a lot of cost and power in the network, however Calxeda fabric should scale for free.
    Reply
  • thebeastie - Thursday, March 14, 2013 - link

    I think a lot of you are missing the main point or future potential of this server technology. And that is that intel like to make an absolute minimum of $50 per CPU they make, in server CPUs it's more like $300.

    These Arm CPUs are being sold at around $10 a CPU.
    Sure Caldexa have gone the hard yards making such a server and want a lot of money for it. BUT once these ARM servers are priced in relative context of their actual CPu costs its going to be the biggest bomb drop on Intels sever profits in history.
    Reply
  • Silma - Thursday, March 14, 2013 - link

    Assuming you are right and ARM is becoming so important that it can't be ignored, what's to prevent Intel to produce and sell ARM itself? In fact what's to prevent Intel to produce the best ARM socs as it has arguably the best fabs?
    There are rumors that Apple is asking Intel to produce procs for them, this would certainly be very interesting if it proves to be true.
    Reply
  • thebeastie - Friday, March 15, 2013 - link

    The fact that Intel would practically look at other businesses then produce SoC/CPUs for $10 each, x86 or ARM based doesn't matter in the face of such high portability of code. Reply
  • Metaluna - Friday, March 15, 2013 - link

    The problem is that ARM cores are pretty much a commodity, so ARM SoC pricing is inevitably going to end up as a race to the bottom. This could make it difficult for Intel to sustain the kind of margins it needs to keep it's superior process R&D efforts going. Or at least, it would need to use its high-margin parts to subsidize R&D for the commodity stuff which could get tricky given the overall slowing of the market for the higher end processors. I think this is what's happening with the supposed Apple deal. There have been reports that they have excess capacity at 22nm right now so it makes sense to use it. And, since Apple only sells its processors as part of its phones and tablets, it doesn't directly compete with x86 on the open market.

    Of course, all the other fabs are operating under the same cost constraints, so there would be an overall slower pace of process improvements (which is happening anyway as we get closer to the absolute limits at <10nm).
    Reply
  • wsw1982 - Wednesday, April 03, 2013 - link

    And so does those companies, run into bottom. What can they do to even their R&D, by put the server chip into mobile phone? Reply
  • Krysto - Monday, March 18, 2013 - link

    Yup. This is actually Intel's biggest threat by far. It's not the technical competition (even though Intel's Atom servers don't seem nearly as competitive as these upcoming ARM servers), but the biggest problem by far for them will be that they will have to compete with the dozen or so ARM server companies on price, while having more or less the same performance.

    THAT is what will kill Intel in the long term. Intel is not a company built to last on Atom-like profits (which will get even lower once the ARM servers flood the market). And they can forget about their juicy Core profits in a couple of years.
    Reply
  • wsw1982 - Wednesday, April 03, 2013 - link

    So your argument is because the ARM solution is more expensive than Intel solution now, therefore it must be cheaper than Intel solution in the feature? The mobile ARM is cheap, so does the Intel mobile chips. Reply
  • Silma - Thursday, March 14, 2013 - link

    1300$ difference / server, that's a lot electricity you have to spare to justify the cost, especially as it is better that Xeon servers only in a few chosen benchmarks.

    Can't see how this is interesting in production environment.
    It's more for testing / experimenting I guess;
    Reply
  • Wilco1 - Thursday, March 14, 2013 - link

    The savings are more than just electricity cost, you also save on cooling costs and can pack your server room more densely. If you do a TCO calculation over several years it might well turn out to be cheaper overall.

    This is the first ARM server solution, so it's partly to get the software working and test the market. However I was surprised how competitive it is already, especially when you realize they use a relatively slow 40nm Cortex-A9. The 2nd generation using 28nm A15 will be out in about 6 months, if they manage to double performance per core at similar cost and power then it will look even better.
    Reply
  • kfreund - Friday, March 15, 2013 - link

    Keep in mind that this is VERY early in the life cycle, and therefore costs are artificially high due to low volumes. Ramp up the volumes, and the prices will come WAY down. Reply
  • wsw1982 - Wednesday, April 03, 2013 - link

    Ja, IF they have high volume. But even if there is high volume, it's shared between different ARM suppliers and needless to say, the ATOM. How much can it be for one company?

    But the question is where the ARM get the volume? less performance, comparable power consumption, less performance/watt rational (not this kind extreme bias case ), less flexibility, less software support (stability), vendor specific (you can build a normal server, but can you build up a massive parallel cluster?), oh, don't forgot, more (much more) expensive. Which company will sacrifice themselves to beef up the market volume of the ARM server?
    Reply
  • Sputnik_b - Thursday, March 14, 2013 - link

    Hi Johan,
    Nice job benchmarking and analyzing the results. Our group at EPFL has recently done some work aimed at understanding the demands that scale-out workloads, such as web serving, place on processor architectures. Our findings very much agree with your benchmark conclusions for the Xeon/Calxeda pair. However, a key result of our work was that many-core processors (with dozens of simple cores per chip) are the sweet spot with regard to performance per TCO dollar. I encourage you to take a look at our work -- http://parsa.epfl.ch/~grot/pubs/SOP-TCO_IEEEMicro....
    Please consider benchmarking a Tilera system to round-out your evaluation.
    Best regards!
    Reply
  • Sputnik_b - Thursday, March 14, 2013 - link

    Sorry, bad URL in the post above. This should work: http://parsa.epfl.ch/~grot/pubs/SOP-TCO_IEEEMicro.... Reply
  • aryonoco - Friday, March 15, 2013 - link

    LWN.net has a very interesting write-up on a talk given by Facebook's Director of Capacity Engineering & Analysis on the future of ARM servers and how they see ARM servers fit in with their operation. I think it gives valuable insight on this topic.

    http://lwn.net/SubscriberLink/542518/bb5d5d3498359... (free link)
    Reply
  • phoenix_rizzen - Friday, March 15, 2013 - link

    ARM already has hardware virtualisation extensions. Linux-KVM has already been ported over to support it. Reply
  • Andys - Saturday, March 16, 2013 - link

    Great article, finally good to see some realistic benchmarks run on the new ARM platform.

    But I feel that you screwed up in one regard: You should have tested the top Xoen CPU also - the E5-2690.

    As you know from your own previous articles, Intel's top CPUs are also the most power efficient under full load, and the price would still be cheaper than the full loaded Calxeda box anyway.
    Reply
  • an3000 - Monday, March 25, 2013 - link

    It is a test using wrong software stack. Yes, I am not afraid to say that! Apache will never be used on such ARM servers. They are exact match for Memcached or Nginx or another set-get type services, like static data serving. Using Apache or LAMP stack is too much favorable for Xeon.
    What I would like to see is: Xeon server with max RAM non-virtualized running 4-8 (similar to core count) instances of Memcached/Nginx/lighttpd vs cluster of ARM cores doing the same light task. Measure performance and power usage.
    Reply
  • wsw1982 - Wednesday, April 03, 2013 - link

    My suggestion will be let them run one hard-disk to one hard-disk copy and measure the power usage:) Reply

Log in

Don't have an account? Sign up now