Comments for NVIDIA @ ICML 2015: CUDA 7.5, cuDNN 3, & DIGITS 2 Announced

NVIDIA @ ICML 2015: CUDA 7.5, cuDNN 3, & DIGITS 2 Announced

by Ryan Smith on 7/7/2015 4:00 AM EST

Posted in
CUDA
HPC
GPUs
NVIDIA

Post Your Comment
Please log in or sign up to comment.

Comments Locked

26 Comments

Back to Article

Terry Suave - Tuesday, July 7, 2015 - link
"Though the company only has a single product so far that has a higher performance FP16 mode – Tegra X1 – it has more than subtly been hinted at that the next-generation Pascal GPUs will incorporate similar functionality"

Now, I am no expert on compute (I barely know the difference between precision levels), but if Nvidia is concentrating on half-precision workload performance, does that mean 32 and 64 bit performance will be worse? I mean, Maxwell is already worse by some margin than Keplar was at Double precision, so is it just something that isn't used very often? Or, is this not the case at all and I'm drawing all the wrong conclusions?
MrSpadge - Tuesday, July 7, 2015 - link
FP64 is being reduced for these reasons:
- it's not needed for almost any client workloads
- it's badly needed in many professional workloads

It's not too difficult to build 2 FP32 units which can also work together as 1 FP64 unit. You need more transistors than for 2 regular FP32 units, though, which hurts consumer GPUs. For FP16 this tradeoff may well be worth it, because the regular shaders are already "big", i.e. support FP32, so they won't become much larger by allowing them to work as 2 FP16 units. If it's used, this ability saves bandwith, memory space, power (which can be used to reach higher clocks) and allows significantly higher performance with new the hardware. For some client workloads FP16 will be enough, so this sounds pretty good.
p1esk - Tuesday, July 7, 2015 - link
"It's not too difficult to build 2 FP32 units which can also work together as 1 FP64 unit"

It's not clear to me how to combine 2 32 bit FP units into 1 64 bit unit. Care to explain? Or provide some reference?
MrSpadge - Wednesday, July 8, 2015 - link
In CPUs this has been done for ages. They can't utilize many ALUs / FP units, but must be very flexible. Hence the tradeoff "some more transistors per unit" is well worth it for them. As far as I know AMD is also doing it on their GPUs with considerable FP64 capability.
p1esk - Wednesday, July 8, 2015 - link
No, Intel CPUs for example, as far as I understood, did FP with 80 bit precision FP units, then rounded them off to produce either 32 or 64 bit results.
Again, combining two FP32 units into one FP64 units seems to be impossible to be. Just look at the number formats - FP64 has more than twice mantissa bits than FP32. It might be possible to perform FP64 operation using a single FP32 unit, by doing multiple passes, but this too is not clear to me.
MrSpadge - Wednesday, July 8, 2015 - link
No, 80 bit is the internal precision for FP64. For FP32 this would be totally overkill. It's very normal for Intel CPUs (and AMD, btw) to have twice the maximum FP32 throughput than FP64:
http://www.nas.nasa.gov/hecc/support/kb/Ivy-Bridge...
https://software.intel.com/de-de/forums/topic/3942...

"It might be possible to perform FP64 operation using a single FP32 unit, by doing multiple passes, but this too is not clear to me. "
That's what they're doing if the FPU should stay very small (AMDs cat-cores, Intel Silvermont).

How it's done EXACTLY is not clear to me either, as I'm not a circuit engineer. But I know it's pretty standard stuff, so look it up if you're really interested.
p1esk - Wednesday, July 8, 2015 - link
Here's a diagram of the 8087 FP coprocessor: it's clear that both FP32 and FP64 use 80 bit precision:
https://en.wikipedia.org/wiki/Intel_8087#/media/Fi...
I did try to look it up, but haven't found any good info. I don't need to know exactly how it's done, a diagram showing the design of FP units in a modern CPU or GPU would be sufficient.
MrSpadge - Thursday, July 9, 2015 - link
What I see there is simply all 8 stack registers using 80 bit and the busses being 16+64 bit wide. That does not automatically mean the execution unit(s?) would use the full 80 bit internally for FP32. That would be a waste of energy.. but power consumption was no big factor for those old CPUs.

Anyway, after some searching I found this regarding AMDs VLIW architecture:
http://www.realworldtech.com/cayman/6/
"AMD’s VLIW pipelines are a multi-precision, staggered design that can bypass results between the pipelines." - these may be the keywords to look for.
p1esk - Thursday, July 9, 2015 - link
The x87 FP coprocessor did use 80 bit FP ALU even for FP32, and in fact when it was present, it was used for all ALU operations, including 32 bit integer ops. It's long obsolete though, probably due to the energy/area efficiency issues you mentioned.
From your AMD link, and also from some info I found on Maxwell, it turns out all modern GPUs include both FP32 and FP64 units. Typically a lot fewer FP64 than FP32, that's where a typical 64 vs 32 performance ratio comes from when describing GPU architectures e.g. Maxwell 200 is 32:1, because for every SSM there are 128 FP32 units and only 4 FP64 units.
However, I still could not find any info about FP ALUs in modern CPUs. For example, Haswell can process 256 bit wide FP vectors. These can be a set of 32 bit FP values, or a set of 64 bit FP values. But it's not clear if it uses the same FP64 units for all FP ops, or if there's a mix of FP64 and FP32 units, like in GPUs.
saratoga4 - Thursday, July 9, 2015 - link
Its the same multiplier used in all cases. Theres control lines that set the precision and number of parallel operations used. Putting in multiple 256 bit wide FPUs that could not be used in parallel wouldn't make sense, it'd be a huge waste of power/area.

ARM is even more interesting with NEON. On parts with small dies, the NEON floating point multipliers are only 32 bits wide, so a 64 bit vector needs the multiplier for 2 cycles, whereas a 128 not vector needs it for 4. On the more advanced ARM devices, the multiplier unit is itself 128 bits wide, so the performance is the same.
saratoga4 - Wednesday, July 8, 2015 - link
When you multiply a large number on pen and paper, you don't solve every digit at once. The way we learn in grade school is to do each decimal digit (~3 bits) as one multiply and then shift/sum the result. A computer can do the same thing using 2 32 bit multiplies and some logic to combine the result.

This was actually really common in the past even for 32 bit operations. The original iPhone for instance launched with a CPU that only had a 16 bit multiplier. 32 bit operations were synthesized (in hardware) by doing the first 16 bit multiply on the first cycle, and the second on the second cycle. For this reason multiplying 16 bit numbers was a lot faster then 32.
p1esk - Wednesday, July 8, 2015 - link
I'm talking about floating point, not integers.
saratoga4 - Wednesday, July 8, 2015 - link
The iPhone example was for the ALU, but aside from that, yes I was referring to floating point.
p1esk - Wednesday, July 8, 2015 - link
Sure, you can execute FP32 instructions with FP16 units, but I'm asking how to execute two FP16 instructions on a single FP32 unit simultaneously. I don't see how to do that without completely redesigning FP32 unit into some monstrosity that looks like two FP16 units interconnected in a complicated way. That would probably slow down the execution of native FP32 ops significantly.

If you have any reference to such design I'd love to see it.
bortiz - Wednesday, July 8, 2015 - link
I don't get the problem.. Lets go to the core of the problem which is 2-16b adds going to 1 32b add. You have 2 16 bit ripple adders. You set the overflow bit of the first adder as an input to the lowest bit of the next ripple adder (ripple adders add 3 bit numbers). This portion is not complicated. Where it gets complicated is that the add now takes 2 clocks to execute as opposed to 1. The simple solution is to block all operations in 32b mode to take 2 clocks, otherwise you do have check for boundary conditions and block the specific addresses and registers being referenced to allow for pipelining, and or 16/32b operations.
p1esk - Wednesday, July 8, 2015 - link
Again, you are talking about integers. I'm asking about floating point.
saratoga4 - Wednesday, July 8, 2015 - link
There really isn't much difference. A floating point number is just two paired integers, a mantissa and an exponent. When you do a multiply, the mantissas are just integers that are multiplied like any other integer. The exponents are then added. Finally the resulting mantissa is renormalized and the exponent updated.

So basically, what bortiz said.
p1esk - Wednesday, July 8, 2015 - link
Bortiz was talking about using two 16 bit adders to do one 32 bit addition. I'm asking about the opposite - how to use one 32 bit adder to do two 16 bit additions at once.

Have you seen an FP adder? I looked at several designs, and none of them can support two additions.
saratoga4 - Thursday, July 9, 2015 - link
Logically if you can use 2 FP16 units as one FP32 unit, then you could make an FP32 unit that could do two FP16 operations at once (by just making it from two FP16 units).

Did I just blow your mind ;)
p1esk - Friday, July 10, 2015 - link
You would blow my mind if you supported your statements with references. I'm not asking about what is possible to do "logically". I want to know what is actually done in real chips, more specifically, in Tegra X1. So far your guesses are as good as mine.
p1esk - Tuesday, July 7, 2015 - link
FP64 performance is being reduced because NVIDIA currently focuses on a single application: deep learning. For deep learning, 16 bits precision is enough.
Ryan Smith - Tuesday, July 7, 2015 - link
"if Nvidia is concentrating on half-precision workload performance, does that mean 32 and 64 bit performance will be worse?"

No. They way they're accomplishing it is by packing two FP16 instructions inside a single FP32 instruction as a Vec2. Currently FP32 ALUs can only run a single FP16 instruction, so this allows a great increase in FP16 performance without adding any more real hardware.
p1esk - Tuesday, July 7, 2015 - link
Ryan, are you implying that a single FP32 ALU will be capable of executing two 16 bit FP operations simultaneously? I don't see how is this possible. Care to explain?
MrSpadge - Wednesday, July 8, 2015 - link
You have to wire things more flexibly, that's why such units need more space & transistors.
p1esk - Wednesday, July 8, 2015 - link
LOL, I'd like more info than "wire things more flexibly".
KhanTengri - Wednesday, July 8, 2015 - link
The NVIDIA DIGITS Devbox uses a ASUS X99E-WS motherboard and supports 4 Way GPU's (16x16x16x16) via multiplexing - the CPU provides only 40 PCIe lanes. To achieve better scaling results an other motherboard (dual xeon) would be preferable ... ASUS Z10PE-D8 WS or Supermicro X10DRG-Q.

NVIDIA @ ICML 2015: CUDA 7.5, cuDNN 3, & DIGITS 2 Announced

Post Your Comment

26 Comments

Back to Article

Terry Suave - Tuesday, July 7, 2015 - link

MrSpadge - Tuesday, July 7, 2015 - link

p1esk - Tuesday, July 7, 2015 - link

MrSpadge - Wednesday, July 8, 2015 - link

p1esk - Wednesday, July 8, 2015 - link

MrSpadge - Wednesday, July 8, 2015 - link

p1esk - Wednesday, July 8, 2015 - link

MrSpadge - Thursday, July 9, 2015 - link

p1esk - Thursday, July 9, 2015 - link

saratoga4 - Thursday, July 9, 2015 - link

saratoga4 - Wednesday, July 8, 2015 - link

p1esk - Wednesday, July 8, 2015 - link

saratoga4 - Wednesday, July 8, 2015 - link

p1esk - Wednesday, July 8, 2015 - link

bortiz - Wednesday, July 8, 2015 - link

p1esk - Wednesday, July 8, 2015 - link

saratoga4 - Wednesday, July 8, 2015 - link

p1esk - Wednesday, July 8, 2015 - link

saratoga4 - Thursday, July 9, 2015 - link

p1esk - Friday, July 10, 2015 - link

p1esk - Tuesday, July 7, 2015 - link

Ryan Smith - Tuesday, July 7, 2015 - link

p1esk - Tuesday, July 7, 2015 - link

MrSpadge - Wednesday, July 8, 2015 - link

p1esk - Wednesday, July 8, 2015 - link

KhanTengri - Wednesday, July 8, 2015 - link

Log in

Don't have an account? Sign up now