Big Integer Subtraction
Subtracting B from A on the Power architecture is easy when both A and B fit in a single machine word: subf[c] rt, B, A. Much more interesting is the case where they do not fit, as is often the case in big number libraries and cryptographic code. In order to make this fast, this needs to be implemented directly in assembly for the target architecture. This exercise, and the code in it, arose from optimizing hot spots in the Go math/big library as it is used by the crypto packages.
Big integers are (in our case) stored as an array of words and a sign. The function being implemented, subVV (subtract-vector-vector) has several conditions to the API in order to call it: the arguments all have the same length, (This is a fine assumption - to do subtraction of numbers that don't have the same number of words, truncate them to length of the shorter array and handle the case of needing to borrow once more at the end; this is handled at the call site) and function must return 1 if there was a borrow and 0 otherwise. Thus, the subtraction algorithm can be outlined in psuedocode (or actual code):
var c = 0
for i from 0 to length:
result[i] = A[i] - B[i] - c
c = 1 if that borrowed, 0 if it didn't
return c
The ideal instruction to implement this will handle both of these steps. To understand how this ideal instruction should work, it pays off to examine what the subtraction instruction does.
According to the Power ISA manual (version 2.07, p. 68), subtraction performs "the sum ¬B + A + 1" and places it in rt. This works due to the nature of two's complement numbers: namely, that addition of numbers represented in two's complement form works identically to addition of unsigned numbers. This allows performing A + (-B). -B, in two's complement, is ¬B+1.
The desired instruction takes a possible borrow into account, and so would look like ¬B + A + 1 - c, where c is 1 if there was a borrow and 0 otherwise.
The difference between subf and subfc on POWER is that the subfc version sets the carry bit, CA, in the processor. When will subtraction carry? This is easiest to examine on a single bit:
| a | b | ¬b | a + ¬b + 1 | CA | Borrowed? (b > a) |
|---|---|---|---|---|---|
| 0 | 0 | 1 | 10 | 1 | No |
| 0 | 1 | 0 | 01 | 0 | Yes |
| 1 | 0 | 1 | 11 | 1 | No |
| 1 | 1 | 0 | 10 | 1 | No |
The CA bit is set to 1 when a ≥ b, and 0 when a < b: CA serves as a not-borrow flag. This allows rewriting the expression ¬B + A + 1 - c in terms of CA by noting that CA = 1 - c *(Proof left as an exercise to the reader), giving ¬B + A + CA.
A quick browse of the ISA, and we find the instruction we're looking for: Subtract From Extended, subfe. Like subfc, subfe sets the CA bit. Exactly as required, "the sum ¬B + A + CA" is placed in rt.
This can be explored by considering an example of 9-bit subtraction on a 3-bit computer.
368 101 110 000
- 123 - 001 111 011
----- -------------
245 011 110 101
Beginning with the least significant words, 000-011:
000 - 011 = 000 + ¬011 + CA = 000 + 100 + 1 = 101
For the first subtraction, there was no previous borrow and so CA must be 1. This does not overflow, and so CA is 0 for the next word:
110 - 111 - c = 110 + ¬111 + 1 - c = 110 + 000 + CA = 110
This leads to our final word of the subtraction, again with CA = 0:
101 - 001 - c = 101 + ¬001 + 1 - c = 101 + 110 + CA = 011
This just leaves a couple loose ends. First, to use subfe for the loop, CA needs to have an initial value. As this is a normal subtraction without a borrow, that initial value must be 1. Additionally, this implementation is being written targeting the Go assembler which uses its own assembly syntax.
// func subVV(r, a, b []Word) c Word
TEXT ·subVV, NOSPLIT, $0
MOVD res_len+8(FP), R7 // length of result
MOVD a+24(FP), R8 // address of subtrahend
MOVD b+48(FP), R9 // address of minuend
MOVD r+0(FP), R10 // address of result
MOVD $0, R5 // i = 0
MOVD $8, R28 // unaware of mulli equivalent in go asm
SUBC R0, R0 // R0 always contains 0, so this sets CA
JMP END
LOOP:
MULLD R5, R28, R6 // offset in bytes by i
MOVD (R8)(R6), R11 // a[i]
MOVD (R9)(R6), R12 // b[b]
SUBE R12, R11, R15 // same as subfe r15, r12, r11
MOVD R15, (R10)(R6) // store result in r[i]
ADD $1, R5 // i++
END:
CMP R5, R7 // if i < len(r):
BLT LOOP // goto LOOP
MOVD $0, R4
ADDZE R4 // R4 += CA
XOR $1, R4 // return value is !CA
MOVD R4, r+72(FP) // Go passes return values on the stack
RET
How much faster is this assembly than the generic implementation? The benchmark for addVV is easily modified to benchmark subVV, giving the following results:
benchmark old ns/op new ns/op delta
BenchmarkSubVV/1 16.1 6.76 -58.01%
BenchmarkSubVV/2 18.2 9.59 -47.31%
BenchmarkSubVV/3 20.5 9.32 -54.54%
BenchmarkSubVV/4 22.9 11.3 -50.66%
BenchmarkSubVV/5 25.5 13.3 -47.84%
BenchmarkSubVV/10 42.6 27.1 -36.38%
BenchmarkSubVV/100 259 198 -23.55%
BenchmarkSubVV/1000 2408 1910 -20.68%
BenchmarkSubVV/10000 23853 19037 -20.19%
BenchmarkSubVV/100000 244542 190399 -22.14%
benchmark old MB/s new MB/s speedup
BenchmarkSubVV/1 3968.14 9460.99 2.38x
BenchmarkSubVV/2 7045.22 13347.39 1.89x
BenchmarkSubVV/3 9357.28 20609.20 2.20x
BenchmarkSubVV/4 11156.85 22688.51 2.03x
BenchmarkSubVV/5 12556.21 24002.40 1.91x
BenchmarkSubVV/10 15007.22 23583.03 1.57x
BenchmarkSubVV/100 24693.65 32217.26 1.30x
BenchmarkSubVV/1000 26575.40 33492.23 1.26x
BenchmarkSubVV/10000 26830.29 33617.69 1.25x
BenchmarkSubVV/100000 26171.33 33613.55 1.28x