Intel's Larrabee Architecture Disclosure: A Calculated First Move

Name: Intel's Larrabee Architecture Disclosure: A Calculated First Move
Item: Intel's Larrabee Architecture Disclosure: A Calculated First Move
Author: Anand Lal Shimpi & Derek Wilson

by Anand Lal Shimpi & Derek Wilson on August 4, 2008 12:00 AM EST

Posted in
GPUs

101 Comments | Add A Comment

101 Comments

The Design Experiment: Could Intel Build a GPU?

Larrabee is fundamentally built out of existing Intel x86 core technology, which not only means that the chip design isn't foreign to Intel, but also has serious implications for the future of desktop microprocessors. Larrabee isn't however built on Intel's current bread and butter, the Core architecture, instead Intel turned to a much older architecture as the basis for Larrabee: the original Pentium.

The original Pentium was manufactured on a 0.80µm process, later shrinking to 0.60µm. The question Intel posed was this: could an updated version of the Pentium core, built on a modern day process and equipped with a very wide vector unit, make a solid foundation for a high-end GPU?

To first test the theory Intel took a standard Core 2 Duo, with a 4MB L2 cache at an undisclosed clock speed (somewhere in the 1.8 - 2.9GHz range I'd guess). Then, on the same manufacturing process, roughly the same die area and power consumption, Intel sought to find out how many of these modified Pentium cores it could fit. The number was 10.

So in the space of a dual-core Core 2 Duo, Intel could construct this hypothetical 10-core chip. Let's look at the stats:

	Intel Core 2 Duo	Hypothetical Larrabee
# of CPU Cores	2 out of order	10 in-order
Instructions per Issue	4 per clock	2 per clock
VPU Lanes per Core	4-wide SSE	16-wide
L2 Cache Size	4MB	4MB
Single-Stream Throughput	4 per clock	2 per clock
Vector Throughput	8 per clock	160 per clock

Note that what we're comparing here are operation throughputs, not how fast it can actually execute anything, just how many operations it can retire per clock.

Running a single instruction stream (e.g. single threaded application), the Core 2 can process as many as four operations per clock, since it can issue 4-instructions per clock and it isn't execution unit constrained. The 10-core design however can only issue two instructions per clock and thus the peak execution rate for a single instruction stream is two operations per clock, half the throughput of the Core 2. That's fine however since you'll actually want to be running vector operations on this core and leave your single threaded tasks to your Core 2 CPU anyways, and here's where the proposed architecture spreads its wings.

With two cores, each with their ability to execute 4 concurrent SSE operations per clock, you've got a throughput of 8 ops per clock on Core 2. On the 10-core design? 160 ops per clock, an increase of 20x in roughly the same die area and power budget.

On paper this could actually work. If you had enough of these cores, you could get the vector throughput necessary to actually build a reasonable GPU. Of course there are issues like adapting the x86 instruction set for use in a GPU, getting all of the cores to communicate with one another and actually keeping all of these execution resources busy - but this design experiment showed that it was possible.

Thus Larrabee was born.

Index Not Quite a Pentium, Not Quite an Atom: The Larrabee Core

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

101 Comments

View All Comments

NeonFlak - Thursday, November 15, 2012 - link
"Remember that Larrabee won't ship until sometime in 2009 or 2010, the first chips aren't even back from the fab yet, so not wanting to discuss how many cores Intel will be able to fit on a single Larrabee GPU makes sense. "

Intel's Larrabee Architecture Disclosure: A Calculated First Move

The Design Experiment: Could Intel Build a GPU?

Post Your Comment

101 Comments

View All Comments

NeonFlak - Thursday, November 15, 2012 - link

Log in

Don't have an account? Sign up now