<h3>Binary multiplication methods</h3>

<b>Shift and ADD</i>

The standard approach to multiplication (on a CPU with no multiply hardware) is to use the well known binary 'shift and ADD' method (indeed, on some PIC CPU's this is actually FASTER than using the built-in multiply "hardware" :-) ).

Multiply is usually coded 'bit at a time' as a simple loop, so requires 8 steps for an 8 bit * 8 bit multiply.

<i>The PIC instructions set lends itself well to nulti-byte shifting and single byte ADDing, so I implement a 16x8 multiply using a 16 step loop (i.e. I shift the 16 bit value and ADD the 8 bit one, which requires fewer instructions than 8 bit shift with a 16bit ADD :-)). The actual instruction code for a 16x8 multiply differs little from the 8x8 code.</i>

For maximum speed, the loop is 'unwound', which avoids the need for a loop count "dec and skip, jump loop" for each 'step'. Unwinding almost doubles the speed of an 8x8 multiply (at the cost of double the instruction code size).

The multiply speed is always the same (8 or 16 loops), irrespective of the values to be multiplied. Modifying the code to detect and 'short circuit' the loop for 'special cases' (such as 'multiply by 0', 'multiply by 1', 'multiply by power of 2') imposes the extra cost of the 'special case test' as an overhead on all multiply operations - so 'on average' (assuming 'random' values are multiplied) it's faster just to let the loops run (even if '0' is being added).

<h4>Alternative methods</h4>
 
In the 'good old days', early computers had a separate "ALU" (Arithmetic Logic Unit). These were initially constructed using discrete logic gates and implemented only the most basic of arithmetic operations. Some functions were 'fast' (such as 'invert') whilst others (such as SHIFT, ADD and SUBtract) would typically require at least one CLK cycle per bit (so a 1bit Shift was fast but an 8 bit Add would require at least 8 CLK's). Any 'trick' you could pull in the ALU to speed up calculations was worth doing, especially if you could minimise the number of Add (or Subtract) operations when performing a 'shift and add' multiply sequence.

One 'trick' was to note that you don't have to start adding the multiplicand to the 'total' until the first '1' is found. What's more, if you start with the multiplicand in the result register, you can actually skip the first 'add' and (only) need to start shifting on the first '1'.

Further, when the control 'shift' register contains nothing but zero's you can skip all further ADD operations (although you will need to keep shifting the result).

Some ALU's were even clever enough to 'swap' multiplier and multiplicand to get the one with the most zero's as the 'shift' (and 'add') control = and some even 'inverted' the multiplier/multiplicand (and inverted the result) to get more "zero's" (and thus avoid 'add' operations) !

<b>Can the above tricks save cycles on a PIC 8x8 / 16x8 multiply ?</b>

The register (pair) contining the multiplier (ADD control) is 're-used' as (the low bytes of) the result. So we still have to shift it = however if the (to be) ADDed value starts in the MSB (it does) we can save a 'Clear' (and save shifting the MSB result) until the first '1' is seen.

Unfortunately, with the control reg being re-used as the LSB(s) of the result, we can't "detect no more 1's" (RRF shift doesn't set the Z flag anyway, so an explicit test would be needed at a cost of +2 Clk's (Test and skip if non-Z) per bit).

However, IF we know in advance that (one of) the nunbers will be 'small', so lots of ms bits are zero (and especially if the ms bits of the 16bt value are zero) we can sve time at the 'top end'. However this means the 'shift and add' has to be 'reversed' so the ms bits are 'processed' (skipped) first == see the 'halves and doubels' method below ...


<b>The Kenyan "halves and doubles" method</b>

<i>See <a href="http://www.piclist.com/techref/method/math/muldiv.htm">Novel Methods of Integer Multiplication and Division</a></i>

You start by halving the multiplier (shift right).
If that leaves a remainder (1 shifted out) you add the multiplicand to (LSB of) the total.
You then double the multiplicand.
You continue with halving the multiplier, adding 'if remainder' and doubling until (after no more than 7 shifts) the multiplier reaches '1', at which point the final multiplicand is added to the sum.

This is (of course) just the standard unsigned binary 'shift and add' multiply process by another name, made more complex by 'staring at the bottom' (and doubling up).

It does have the advantage of 'stopping short' when the 'halves' reach '1' (which could make a big difference when multiplying by 'small' values, ESPECIALLY when 'small' 16 bit numbers are involved)

<b>Implementing the Kenyan "halves and doubles" method</b>

For an 8x8 multiply, whilst the multiplier (shifted) value is never more than 1 byte, the multiplicand (doubled) value will need 2 bytes. This means the addition is a 16bit ADD into a 16 bit result (which is 6 instructions with Cy propergation (the PIC has no 'ADD with Cy' instruction)) instead of 1 (for a single byte ADD)

So, as described, "halves and doubles" method needs 5 registers and requires a 16 bit ADD = as opposed to the 'binary shift and add' (MUL macro) which uses only 3 registers and an 8 bit ADD.

However, unlike 'binary shift and add', the above process 'stops' when halving reaches the final '1' = so, not only is there no need to 'step count' but it can also 'end early' (when there are no more '1's) thus maximising speed when multiplying with small numbers (binary shift and add has to do all 8 or 16 shifts, so there is nothing to be saved in looking for "no more 1's").

<i><b>Can we save processing time by checking for the lowest and (if necessary) swapping the two values ?</b>

If the two values are totally random, there may be little to gain by 'finding the smaller' since, 'on average', half the values will be >127 i.e. 8 bits = and (on average) half the time the smaller (7 bits or less) will already be in the 'halving reg' (rather than the 'double reg'). So the 'cost' of checking (and swapping, on average half the time) may exceed the (average) number of CLK's saved.

If the values are 'non-random' then it's likely that the 'calling' code can arrange for the smaller value to be in the 'halving reg.'

However, it may well be necessary to check the values (in case of 0 or 1) anyway, so an extra 'compare' may be easy to arrange.</i>

<i><b>What's the cost/saved of terminating the process at count '1' ?</b>

Checking for '1 left' requires 1 instruction (DECFSZ = decrement reg and skip if zero). This can be added directly before a 'jump back to loop', so there may be no other additional cost (except when the value in Acc has to be preserved when DECFSZ has to update the reg. instead of Acc and that means an 'INC reg' will be needed to restore the value after the '1' test).

Finally, if we 'unwind' the start of the loop (and assuming we only multiply when the (smaller) value is 2 or more), we don't have to test until after the first shift & double.</i>

I include the code below, however unless many multiplies are by 'powers of  2' or by small values (so can take advantage of the first section of the code that will 'detect power of 2' or the 'early exit' when halving reaches 1), it actually costs significantly more CLK cycles than the simple 'in line' binary 'shift and add' (see my MUL macro)

;
; Kenyan "halves and doubles"
; Enter with rSource = multiplier (should be lower of the two values) - will be lost in the process
; rLSB, rMSB (LSB+1) = the multiplicand, and the Result.
; Uses Acc, rTemp, rTemp+1 during processing
;
MUL8x8 ;Subroutine
 CLR rMSB
; IF rLSB=0, exit
 MOVF rLSB,1 ;test LSB
 Skip nZ
 RETURN      
; IF rSource=0, CLR rLSB, exit
 MOVF rSource,1 ;test source
 Skip Z
 GOTO loop0     ;continue if non-Zero
 CLR rLSB
 RETURN 
loop0 ; power of 2 loop (x1,2,4,8,16,32,64,128)
; IF rSource=1, exit
 DECFSZ rSource,0  ;dec reg to Acc, skip if Z (i.e. was 1)
 RETURN
loop1 ;
 CLR Cy      ;must make sure Cy is clear before shiftung rSource (as we need actual value for 1 test later)
 RRF rSource
 Skip nCy    ;if Cy ..
 Jump loop2  ; .. go to main loop
; rSource didn't Cy yet, double up and test again  
 RLF rLSB
 RLF rMSB                
 Jump loop0
loop2     ; OK arrive here when we have the first Cy and we know rSource wasn't 1
; result contains correct start value, double to temp and shift/test again
 CLR Cy
  RLF rLSB,0      ;rotate lsb to acc
  COPY Acc,rTemp  ;save as temp
  RLF rMSB,0      ;rotate msb to acc
  COPY Acc,rTemp+1  ;save as temp+1
loop3 ;the msb double can't have resulted in a Cy
 RRF rSource   
  Skip Cy
  Jump loop4      ;no Cy, skip the add
  COPY rTemp,Acc
  ADD Acc,rLSB
  Skip nCy
  INC rMSB
  COPY rTemp+1,Acc
  ADD Acc,rMSB
loop4 ;nCy from last RRF or from MSB add
  RRL rTemp    ;double up again
  RRL rTemp+1  ;msb can't Cy
;  Test if rSource=1, if not jump to loop3 
 DECFSZ rSource,0  ;dec reg to Acc, skip if Z (i.e. was 1)
 Jump loop3        ;not 1, keep going
; rSource now 1, do the final add
  ADD Acc,rLSB
  Skip nCy
  INC rMSB
  COPY rTemp+1,Acc
  ADD Acc,rMSB
  RETURN
;

<b>Using squares</b>

If your PIC has lots of spare Flash space, then you may be tempted to use a Look-Up-Table method that uses 'squares' ..

We start by noting that "(a+b)^2 = a^2 + b^2 + 2ab", and "(a-b)^2 = a^2 + b^2 - 2ab".
Subtracting these gives: "4ab = (a+b)^2 - (a-b)^2", or ab = (sum a+b)^2/4 - (difference a-b)^2/4.

So, if we have a table of "n^2/4" (n squared over 4), we can do a multiply with one 8bit add (sum a+b), one 8bit subtract (difference a-b), two table lookups (n squared /4) and a 16bit subtract. In fact, if the difference is 0 (or 1), the second look-up and subtract can be skipped.

The table will have a total of no more than a+b entries (rather than a*b for a direct n*n multiply look-up), but must hold values as large as a*b if the full range of possible a & b inputs need to be supported.

If a and b are 8 bit integers, then table (a+b)^2/4 has 511 16bit entries (the same table will cover the "difference" (a-b)^2/4 subset which uses only 256 of the entries), for a total of (just under) 1kb which requires 4 x complete PIC 254 entry tables (so, only worth it if you have huge amounts of table space to spare).

To look up a 16 bit value, you :-

Copy sum/difference value (pointer) from reg to A (1 clk)
 [assume +2 CLK's to Bit set/clr the PA bits to choose the table ...]
Call one of the table base address with the pointer in A (2 clk) 
Jump into the Table using A as offset (2 clk)
Return with lsb in A (2 clk)
Save A to reg (1 clk)
... and then do it again to get the next byte

<i>The problem is that Return with value 'destroys' the pointer in Acc - so each byte looked up means re-fetching the pointer. To speed up the look-up we have to eliminate the need to re-fetch the pointer i.e. 'jump into table' only once per look-up. That means the table would have to contain 2 bytes per entry, which means each entry has to save 1 byte to a register before returning with the other in Acc. A 16bit table thus consists of 3 instructions per entry (LOAD byte,Acc COPY Acc,rTemp, Return other byte in Acc).

That means the LUT size goes up from 1k to 1.5k (for 511 16b values), plus, to avoid the need for lots of PA bit calculations, this will have to be spread over 8 pages.</i>

Plainly this isn't going to be any faster than the straight 'unwound' binary shift and add == in fact it's going to be slower :-)

If each sum/difference look-up only 'costs' 12 CLK's, that's 24 just to get the values .. then it's at least 3 clk per 8 bit add/sub to get the sum/difference (so +6) and another 5 for the 16 bit subtract for a total of (at least) 35 CLK's.

On the other hand, my 'raw unwound shift and add' Macro code takes 33 CLK's  !!

<i>Squaring will be faster since we can drop the second look-up and the 16b subtract = 1 calculate sum & 1 look-up, but still have calculate the difference to spot the square + 2 more to test 0/1 & skip) == 20 Clk's total</i>