One of the big selling points of the Xeon Phi is that you can simply run multi-threaded Xeon code on the Xeon Phi. If you want to get decent performance out of the Xeon Phi, that code should be compiled with the Intel C or fortan Compiler and the Intel MKL math libraries. In that case, Intel claims many "typical applications" get about 2 to 2.5 higher performance with the Xeon Phi. A few exceptions get more.

That is an impressive performance boost, but not earth shattering. These numbers are much more realistic than the typical benchmarks of 100x that are throw around by the GPU folks. Those benchmarks are typically comparing a single threaded non SIMD binaries running on a CPU to a fully threaded carefully tuned application running on a GPU. 

The question remains in which applications a cheaper quad CPU solution is more effective. Before the Xeon E5 (Sandy Bridge EP) came out, AMD was quite succesful with their less expensive quad CPU platforms in the HPC world. It will be interesting to compare the performance per dollar and performance per watt of such quad CPU platforms with a CPU + Phi solution. There are certainly applications where the CPU + Phi wins hands down, but we are willing to bet that there are lots of HPC applications where it is a close call (e.g. highly threaded, but harder to vectorize code).

The point is of course that the time investment to get there is a lot lower than is the case with CUDA on NVIDIA's Tesla K20. We have heard from several companies that debugging CUDA code is still a pretty daunting experience. One good example can be found here. The maturity of the Intel compilers and high performance software is a big plus for the Xeon Phi. The numerous papers and OpenMP to CUDA frameworks/translators clearly indicate that porting OpenMP applications to CUDA is not necessarily straightforward. That in contrast with the Xeon Phi, where existing OpenMP applications run faster on the Xeon Phi without a recompile. OpenMP is simply the ecosystem where the Xeon Phi thrives. And Intel has an excellent track record when it comes to supporting OpenMP in its compilers.

The Xeon Phi might also prove to be a bit more flexible and forgiving. The Xeon Phi architecture still, at a high level, resembles a general purpose Xeon core. We're talking about 60 in-order x86 cores with wider SIMD units, a 512KB L2 feeding 4 threads per core. 

GPUs on the other hand are built for more "extreme" parallelism: hundreds of stream processors, with small shared L1-caches and one relatively small L2-cache. 

We'll have to hold final judgement until we get a Xeon Phi equipped system in house, but our first impressions are that the Xeon Phi looks like a more cost effective, potentially easier to use alternative to high-end GPUs for HPC.

Dell's C8220 and The TACC Stampede
Comments Locked

46 Comments

View All Comments

  • Kevin G - Saturday, November 17, 2012 - link

    It is dependent upon the bus encoding. PCI-E 1.0/2.0 use an 8/10 encoding scheme to handle traffic while PCI-E 3.0 uses 128/130 encoding. PCI-E only increases the clock speed of the bus by 66% with the rest of the bandwidth increase stemming from the more efficient encoding schema. Xeon Phi seems to have kept the PCI-E 1.0/2.0 encoding but supports the higher clock rate of PCI-E 3.0. This appears to be nonstandard but the LGA 2011 Xeons appear to support this for additional bandwidth.

    Any overhead is likely adding full PCI-E 3.0 support in addition to PCI-E 1.0/2.0.
  • mayankleoboy1 - Thursday, November 15, 2012 - link

    Assuming you can buy a single Xeon phi card, can it work in desktop motherboards and processors ?
    Can it work with AMD processors ? Can it work in tandem with Nvidia and ATI GPU's ?
  • Joschka77 - Thursday, November 15, 2012 - link

    i think the answers would be: Yes, Yes, no! ;-)
  • Jaybus - Thursday, November 15, 2012 - link

    No, it is yes, yes, and yes. The Stampede also uses 128 NVIDIA Tesla K20 GPUs, as stated in the article.
  • Kevin G - Saturday, November 17, 2012 - link

    That's with in the cluster, not necessarily in the same host system. I strongly suspect that the visual nodes featuring K20 GPU's are isolated from the Xeon Phi nodes.
  • maximumGPU - Thursday, November 15, 2012 - link

    can't deny that openMP code that automatically runs faster on the phi would represent a great solution for those looking for the speed up without the cost and time of modifying code for gpus. There certainly is a market to cater for with these cards.
  • creed3020 - Thursday, November 15, 2012 - link

    Johan,

    The numbers you describe as to configuration of the units doesn't add up.

    Eight of the compute sleds plus two PSU sleds cannot fit into a 4U unit. Judging by the photo it appears that in this vertical configuration of nodes it goes something like this: {Compute, Compute, PSU, PSU, Compute, Compute } 4U. It appears this leads to a total of 5 chassis per cluster within the rack.

    This is then compounded with two distinct clusters per rack with their own Infiniband switch + regular Ethernet switch. This makes for a total of 10 C8000 chassis per rack.

    This makes sense when considering a 48U rack. 22U per cluster x 2 clusters per rack = 44U with a few spare slots at the top and middle. I could two at top and one in the middle.
  • llninja1 - Thursday, November 15, 2012 - link

    I think the author got some Dell facts mixed up. Looking at Dell.com you can fit 8 compute sleds in, but those compute sleds are half width and don't contain the necessary double wide PCIe slots to accommodate a Xeon Phi card. So in a single 4U unit, you are correct it can only hold 4 compute sleds and two power units as depicted in the Stampede picture.
  • creed3020 - Friday, November 16, 2012 - link

    Thanks for the explanation. I didn't go over to the Dell site but it would explain that a slimmer sled is possible if you don't have to stick one these huge 2 slot Xeon Phi cards in.

    It makes me wonder how big a difference there is in total PFLOPS/rack when configured with the half height sled vs. full height sled with Phi.
  • mfilipow - Thursday, November 15, 2012 - link

    "Eight of those server sleds find a home inside the C8000 4U Chassis, together with two power sleds." did you write - but on the photo I only see 4x computing sleds plus 2x power. Where are the other 4?!

Log in

Don't have an account? Sign up now