In something of a surprise move, NVIDIA took to the stage today at GTC to announce a new roadmap for their GPU families. With today’s announcement comes news of a significant restructuring of the roadmap that will see GPUs and features moved around, and a new GPU architecture, Pascal, introduced in the middle.

We’ll get to Pascal in a second, but to put it into context let’s first discuss NVIDIA’s restructuring. At GTC 2013 NVIDIA announced their future Volta architecture. Volta, which had no scheduled date at the time, would be the GPU after Maxwell. Volta’s marquee feature would be on-package DRAM, utilizing Through Silicon Vias (TSVs) to die stack memory and place it on the same package as the GPU. Meanwhile in that roadmap NVIDIA also gave Maxwell a date and a marquee feature: 2014, and Unified Virtual Memory.


NVIDIA's Old Volta Roadmap


NVIDIA's New Pascal Roadmap

As of today that roadmap has more or less been thrown out. No products have been removed, but what Maxwell is and what Volta is have changed, as has the pacing. Maxwell for its part has “lost” its unified virtual memory feature. This feature is now slated for the chip after Maxwell, and in the meantime the closest Maxwell will get is the software based unified memory feature being rolled out in CUDA 6. Furthermore NVIDIA has not offered any further details on second generation Maxwell (the higher performing Maxwell chips) and how those might be integrated into professional products.

As far as NVIDIA is concerned, Maxwell’s marquee feature is now DirectX 12 support (though even the extent of this isn’t perfectly clear), and that with the shipment of the GeForce GTX 750 series, Maxwell is now shipping in 2014 as scheduled. We’re still expecting second generation Maxwell products, but at this juncture it does not look like we should be expecting any additional functionality beyond what Big Kepler + 1st Gen Maxwell can achieve.

Meanwhile Volta has been pushed back and stripped of its marquee feature. It’s on-package DRAM has been promoted to the GPU before Volta, and while Volta still exists, publicly it is a blank slate. We do not know anything else about Volta beyond the fact that it will come after the 2016 GPU.

Which brings us to Pascal, the 2016 GPU. Pascal is NVIDIA’s latest GPU architecture and is being introduced in between Maxwell and Volta. In the process it has absorbed old Maxwell’s unified virtual memory support and old Volta’s on-package DRAM, integrating those feature additions into a single new product.

With today’s announcement comes a small degree of additional detail on NVIDIA’s on-package memory plans. The bulk of what we wrote for Volta last year remains true: NVIDIA uses on-package stacked DRAM, allowed by the use of TSVs. What’s new is that NVIDIA has confirmed they will be using JEDEC’s High Bandwidth Memory (HBM) standard, and the test vehicle Pascal card we have seen uses entirely on-package memory, so there isn’t a split memory design. Though we’d also point out that unlike the old Volta announcement, NVIDIA isn’t listing any solid bandwidth goals like the 1TB/sec number we had last time. From what NVIDIA has said, this likely comes down to a cost issue: how much memory bandwidth are customers willing to pay for, given the cutting edge nature of this technology?

Meanwhile NVIDIA hasn’t said anything else directly about the unified memory plans that Pascal has inherited from old Maxwell. However after we get to the final pillar of Pascal, how that will fit in should make more sense.

Coming to the final pillar then, we have a brand new feature being introduced for Pascal: NVLink. NVLink, in a nutshell, is NVIDIA’s effort to supplant PCI-Express with a faster interconnect bus. From the perspective of NVIDIA, who is looking at what it would take to allow compute workloads to better scale across multiple GPUs, the 16GB/sec made available by PCI-Express 3.0 is hardly adequate. Especially when compared to the 250GB/sec+ of memory bandwidth available within a single card. PCIe 4.0 in turn will eventually bring higher bandwidth yet, but this still is not enough. As such NVIDIA is pursuing their own bus to achieve the kind of bandwidth they desire.

The end result is a bus that looks a whole heck of a lot like PCIe, and is even programmed like PCIe, but operates with tighter requirements and a true point-to-point design. NVLink uses differential signaling (like PCIe), with the smallest unit of connectivity being a “block.” A block contains 8 lanes, each rated for 20Gbps, for a combined bandwidth of 20GB/sec. In terms of transfers per second this puts NVLink at roughly 20 gigatransfers/second, as compared to an already staggering 8GT/sec for PCIe 3.0, indicating at just how high a frequency this bus is planned to run at.

Multiple blocks in turn can be teamed together to provide additional bandwidth between two devices, or those blocks can be used to connect to additional devices, with the number of bricks depending on the SKU. The actual bus is purely point-to-point – no root complex has been discussed – so we’d be looking at processors directly wired to each other instead of going through a discrete PCIe switch or the root complex built into a CPU. This makes NVLink very similar to AMD’s Hypertransport, or Intel’s Quick Path Interconnect (QPI). This includes the NUMA aspects of not necessarily having every processor connected to every other processor.

But the rabbit hole goes deeper. To pull off the kind of transfer rates NVIDIA wants to accomplish, the traditional PCI/PCIe style edge connector is no good; if nothing else the lengths that can be supported by such a fast bus are too short. So NVLink will be ditching the slot in favor of what NVIDIA is labeling a mezzanine connector, the type of connector typically used to sandwich multiple PCBs together (think GTX 295). We haven’t seen the connector yet, but it goes without saying that this requires a major change in motherboard designs for the boards that will support NVLink. The upside of this however is that with this change and the use of a true point-to-point bus, what NVIDIA is proposing is for all practical purposes a socketed GPU, just with the memory and power delivery circuitry on the GPU instead of on the motherboard.

NVIDIA’s Pascal test vehicle is one such example of what a card would look like. We cannot see the connector itself, but the basic idea is that it will lay down on a motherboard parallel to the board (instead of perpendicular like PCIe slots), with each Pascal card connected to the board through the NVLink mezzanine connector. Besides reducing trace lengths, this has the added benefit of allowing such GPUs to be cooled with CPU-style cooling methods (we’re talking about servers here, not desktops) in a space efficient manner. How many NVLink mezzanine connectors available would of course depend on how many the motherboard design calls for, which in turn will depend on how much space is available.


Molex's NeoScale: An example of a modern, high bandwidth mezzanine connector

One final benefit NVIDIA is touting is that the new connector and bus will improve both energy efficiency and energy delivery. When it comes to energy efficiency NVIDIA is telling us that per byte, NVLink will be more efficient than PCIe – this being a legitimate concern when scaling up to many GPUs. At the same time the connector will be designed to provide far more than the 75W PCIe is spec’d for today, allowing the GPU to be directly powered via the connector, as opposed to requiring external PCIe power cables that clutter up designs.

With all of that said, while NVIDIA has grand plans for NVLink, it’s also clear that PCIe isn’t going to be completely replaced anytime soon on a large scale. NVIDIA will still support PCIe – in fact the blocks can talk PCIe or NVLink – and even in NVLink setups there are certain command and control communiques that must be sent through PCIe rather than NVLink. In other words, PCIe will still be supported across NVIDIA's product lines, with NVLink existing as a high performance alternative for the appropriate product lines. The best case scenario for NVLink right now is that it takes hold in servers, while workstations and consumers would continue to use PCIe as they do today.

Meanwhile, though NVLink won’t even be shipping until Pascal in 2016, NVIDIA already has some future plans in store for the technology. Along with a GPU-to-GPU link, NVIDIA’s plans include a more ambitious CPU-to-GPU link, in large part to achieve the same data transfer and synchronization goals as with inter-GPU communication. As part of the OpenPOWER consortium, NVLink is being made available to POWER CPU designs, though no specific CPU has been announced. Meanwhile the door is also left open for NVIDIA to build an ARM CPU implementing NVLink (Denver perhaps?) but again, no such product is being announced today. If it did come to fruition though, then it would be similar in concept to AMD’s abandoned “Torrenza” plans to utilize HyperTransport to connect CPUs with other processors (e.g. GPUs).

Finally, NVIDIA has already worked out some feature goals for what they want to do with NVLink 2.0, which would come on the GPU after Pascal (which by NV’s other statements should be Volta). NVLink 2.0 would introduce cache coherency to the interface and processors on it, which would allow for further performance improvements and the ability to more readily execute programs in a heterogeneous manner, as cache coherency is a precursor to tightly shared memory.

Wrapping things up, with an attached date for Pascal and numerous features now billed for that product, NVIDIA looks to have to set the wheels in motion for developing the GPU they’d like to have in 2016. The roadmap alteration we’ve seen today is unexpected to say the least, but Pascal is on much more solid footing than old Volta was in 2013. In the meantime we’re still waiting to see what Maxwell will bring NVIDIA’s professional products, and it looks like we’ll be waiting a bit longer to get the answer to that question.

POST A COMMENT

68 Comments

View All Comments

  • Guspaz - Wednesday, March 26, 2014 - link

    This will probably never make it into consumer parts. Consumer use isn't currently maxing out PCIe 3.0, and PCIe 4.0 will have been available for some time by the time Pascal ships. As long as the current trends continue, PCIe will continue to supply more bandwidth than consumers need indefinitely.

    I'm not saying NVLink is a bad thing, just that people should understand that it's strictly for enterprise use, since it won't make sense for consumers any time soon.
    Reply
  • sascha - Thursday, March 27, 2014 - link

    Once Intel and AMD have heavier GPUs on their CPU die NVIDIA has no chance to survive in the consumer business even with PCIe 4.0. At some point this needs to be in desktops and notebooks otherwise they will be history. Reply
  • BMNify - Thursday, March 27, 2014 - link

    your assuming OC that the new Open PPC initiative and ARM NOC (Network On Chip) IP don't migrate to the desktop and laptop , it could be fun seeing ppc with updated altivec SIMD laptops again that aren't apple branded. Reply
  • BMNify - Thursday, March 27, 2014 - link

    "Consumer use isn't currently maxing out PCIe 3.0" dont keep repeating the MEME , its not true, try copying back your gfx data to the CPU for more processing (such as UHD editing etc) and see its data throughput drop through the floor.

    A feasability study purportedly showed that 16 GT/s can be achieved ... You'd be looking at about 31.5GB/s per direction for a PCIe 4.0 x16 slot.

    the pcisig and its generic interface relations Ethernet sig etc are no longer fit for purpose when all the low power ARM vendors are able to use up to 2Tb/s (256MegaBytes/s) per NOC for minimal cost today...
    Reply
  • BMNify - Thursday, March 27, 2014 - link

    oops that sound be 2Tb/s= 256GigaBytes/s Reply
  • jamescox - Sunday, March 30, 2014 - link

    This is more needed as a link between GPUs. Dual gpu cards at the moment are using completely separate memory which will mostly duplicate the contents. With a super high-speed interconnect, they could make the memory bus narrower, and allow access to memory attached to other GPUS via the high speed interconnect. This is exactly how AMD Opteron processors work; they use HyperTransport connections to share memory attached to all other processors. Intel essentially copied this with QPI later. GPUs need a significantly higher speed interconnect to share memory. For the new Titan Z card, they could have just used something like 4 GB on a 256-bit bus on each card and allowed both GPUs to access the memory via a high-speed link. For games, the 12 GB on this card is a waste since it will really just look like 6 GB since it it independent; there is not a fast enough link to allow the memory to be shared, so each GPU will need its own copy.

    This becomes even more evident when you have the memory stacked on the same chip. The whole graphics card shrinks to a single chip + power delivery system. The only off chip interconnect will be the high speed interface. The memory will be stacked with on-die interconnect. This will allow a large number of pins to be devoted to the communications links. AMD could implement something similar using next generation HyperTransport. It is currently at 25.6 GB/s (unidirectional) at 32-bit. Just 4 such links would allow over 100 GB/s read. For Intel and AMD, PCI-express will become somewhat irrelevant. Both companies will be pursuing similar technology, except, they will be integrating the CPU and the GPU on a single chip. The PCI-express link will only be needed for communication with the IO subsystem. Nvidia could integrate an ARM cpu, but they don't have a competitive AMD64 cpu, that I know of.

    I have been wondering for a long time when we would get a new system form factor. The current set-up doesn't make much sense once the cpu and gpu merge. System memory is not fast enough to support a gpu, so really, the cpu needs to move on-die with gpu. APU pci-express cards would not work since pci-express does not have sufficient bandwidth to allow gpus to share memory; graphics memory is in the 100s of GB/s rather than 10s of GB/s for current system memory. APU cards/modules will need much faster interconnect, and NVidia seems to want to invent their own rather than use some industry standard.
    Reply
  • TeXWiller - Wednesday, March 26, 2014 - link

    The Pascal prototype is a test vehicle, not a prototype of an actual implementation. The size of the card do remind of those ARM blade server nodes on a backend, however. Reply
  • haukionkannel - Wednesday, March 26, 2014 - link

    It is just logical. The NVLink is for servers, so it just make sense to release these at the size of blade server nodes! Reply
  • BMNify - Thursday, March 27, 2014 - link

    actually it makes more sense to make the modules as self contained and small as you can , so many System On SODIMM and System On DIMM to then fit a blade are more viable longer term , today theres far to much metal and not enough processor in these antiquated server farms Reply
  • iceman-sven - Wednesday, March 26, 2014 - link

    It seems Volta was not ready for 2016 and Nvidia decided to make another Fermi family GPU: Maxwell v2 aka Pascal. Maybe Pascal will see the light a little sooner. Begin/mid 2016 and Volta will fellow mid/late 2017.
    Dam, as dual Titan user I wanted to skip Maxwell and go straight to Volta. So last Fermi or waiting a little longer for Volta?
    Reply

Log in

Don't have an account? Sign up now