Big Integer Subtraction

Subtracting B from A on the Power architecture is easy when both A and B fit in a single machine word: subf[c] rt, B, A. Much more interesting is the case where they do not fit, as is often the case in big number libraries and cryptographic code. In order to make this fast, this needs to be implemented directly in assembly for the target architecture. This exercise, and the code in it, arose from optimizing hot spots in the Go math/big library as it is used by the crypto packages.

Big integers are (in our case) stored as an array of words and a sign. The function being implemented, subVV (subtract-vector-vector) has several conditions to the API in order to call it: the arguments all have the same length, (This is a fine assumption - to do subtraction of numbers that don't have the same number of words, truncate them to length of the shorter array and handle the case of needing to borrow once more at the end; this is handled at the call site) and function must return 1 if there was a borrow and 0 otherwise. Thus, the subtraction algorithm can be outlined in psuedocode (or actual code):

var c = 0
for i from 0 to length:
result[i] = A[i] - B[i] - c
c = 1 if that borrowed, 0 if it didn't
return c


The ideal instruction to implement this will handle both of these steps. To understand how this ideal instruction should work, it pays off to examine what the subtraction instruction does.

According to the Power ISA manual (version 2.07, p. 68), subtraction performs "the sum ¬B + A + 1" and places it in rt. This works due to the nature of two's complement numbers: namely, that addition of numbers represented in two's complement form works identically to addition of unsigned numbers. This allows performing A + (-B). -B, in two's complement, is ¬B+1.

The desired instruction takes a possible borrow into account, and so would look like ¬B + A + 1 - c, where c is 1 if there was a borrow and 0 otherwise.

The difference between subf and subfc on POWER is that the subfc version sets the carry bit, CA, in the processor. When will subtraction carry? This is easiest to examine on a single bit:

a b ¬b a + ¬b + 1 CA Borrowed? (b > a)
0 0 1 10 1 No
0 1 0 01 0 Yes
1 0 1 11 1 No
1 1 0 10 1 No

The CA bit is set to 1 when a ≥ b, and 0 when a < b: CA serves as a not-borrow flag. This allows rewriting the expression ¬B + A + 1 - c in terms of CA by noting that CA = 1 - c *(Proof left as an exercise to the reader), giving ¬B + A + CA.

A quick browse of the ISA, and we find the instruction we're looking for: Subtract From Extended, subfe. Like subfc, subfe sets the CA bit. Exactly as required, "the sum ¬B + A + CA" is placed in rt.

This can be explored by considering an example of 9-bit subtraction on a 3-bit computer.

  368      101 110 000
- 123    - 001 111 011
-----    -------------
245      011 110 101


Beginning with the least significant words, 000-011:

000 - 011 = 000 + ¬011 + CA = 000 + 100 + 1 = 101


For the first subtraction, there was no previous borrow and so CA must be 1. This does not overflow, and so CA is 0 for the next word:

110 - 111 - c = 110 + ¬111 + 1 - c = 110 + 000 + CA = 110


This leads to our final word of the subtraction, again with CA = 0:

101 - 001 - c = 101 + ¬001 + 1 - c = 101 + 110 + CA = 011


This just leaves a couple loose ends. First, to use subfe for the loop, CA needs to have an initial value. As this is a normal subtraction without a borrow, that initial value must be 1. Additionally, this implementation is being written targeting the Go assembler which uses its own assembly syntax.

// func subVV(r, a, b []Word) c Word
TEXT ·subVV, NOSPLIT, $0 MOVD res_len+8(FP), R7 // length of result MOVD a+24(FP), R8 // address of subtrahend MOVD b+48(FP), R9 // address of minuend MOVD r+0(FP), R10 // address of result MOVD$0, R5  // i = 0
MOVD $8, R28 // unaware of mulli equivalent in go asm SUBC R0, R0 // R0 always contains 0, so this sets CA JMP END LOOP: MULLD R5, R28, R6 // offset in bytes by i MOVD (R8)(R6), R11 // a[i] MOVD (R9)(R6), R12 // b[b] SUBE R12, R11, R15 // same as subfe r15, r12, r11 MOVD R15, (R10)(R6) // store result in r[i] ADD$1, R5 // i++

END:
CMP R5, R7 // if i < len(r):
BLT LOOP   //   goto LOOP

MOVD  $0, R4 ADDZE R4 // R4 += CA XOR$1, R4 // return value is !CA
MOVD  R4, r+72(FP) // Go passes return values on the stack
RET


How much faster is this assembly than the generic implementation? The benchmark for addVV is easily modified to benchmark subVV, giving the following results:

benchmark                 old ns/op     new ns/op     delta
BenchmarkSubVV/1          16.1          6.76          -58.01%
BenchmarkSubVV/2          18.2          9.59          -47.31%
BenchmarkSubVV/3          20.5          9.32          -54.54%
BenchmarkSubVV/4          22.9          11.3          -50.66%
BenchmarkSubVV/5          25.5          13.3          -47.84%
BenchmarkSubVV/10         42.6          27.1          -36.38%
BenchmarkSubVV/100        259           198           -23.55%
BenchmarkSubVV/1000       2408          1910          -20.68%
BenchmarkSubVV/10000      23853         19037         -20.19%
BenchmarkSubVV/100000     244542        190399        -22.14%

benchmark                 old MB/s     new MB/s     speedup
BenchmarkSubVV/1          3968.14      9460.99      2.38x
BenchmarkSubVV/2          7045.22      13347.39     1.89x
BenchmarkSubVV/3          9357.28      20609.20     2.20x
BenchmarkSubVV/4          11156.85     22688.51     2.03x
BenchmarkSubVV/5          12556.21     24002.40     1.91x
BenchmarkSubVV/10         15007.22     23583.03     1.57x
BenchmarkSubVV/100        24693.65     32217.26     1.30x
BenchmarkSubVV/1000       26575.40     33492.23     1.26x
BenchmarkSubVV/10000      26830.29     33617.69     1.25x
BenchmarkSubVV/100000     26171.33     33613.55     1.28x