<h3>16x8 Multiply (shift and ADD) Subroutine</h3>

You can find many 8x16 multiply methods (for example, see <a href="http://www.piclist.com/techref/microchip/math/basic.htm">piclist math methods</a>), however implementation is typically 'left up to the user' :-). Generating 'optimum' code depends on your goal, either max. speed or minimum code space.

My 'new PIC 33 instruction set (macros)' contains a MUL8x16 macro that implements a 'maximum speed' approach, so this subroutine is aimed at 'minimum code space' approach (i.e. it takes longer because it loops).

<b>Method</b>

The basic binary 'shift and add' method uses one value to control the adding of the second value into the 'top' of the result, after which the result is shifted down 1 bit and the next control bit is checked. After n shifts (and n possible adds) the result is complete.

When it comes to 8x16, the result will be 24 bits. The temptation is to shift the 8 bit value as the 'control' and ADD the 16 bit value into the result (since this limits the process to a maximium of 8 ADD operations), HOWEVER the PIC instruction set (and ALU) only supports the ADD of the Accumerlator to a Register.

This would mean keeping the ADD value in a temp register pair and copying byte at a time to the ACC in order to perform a 16 bit ADD - whereas an 8bit value can be kept in the Acc all the time. Further, the 'skip on shift no Cy' can skip the single 8bit ADD instruction, but a 16bit ADD sequence would have to be 'jumped' on no Cy.

All this means it's actually faster to 'shift' the m16 register pair and ADD the m8 value, up to 16 times, than it is to shift the m8 byte and add the m16 (a 2 byte register pair ADD with Cy propergation 'costs' at least 6 instructions), up to 8 times.

Further, if the m8 value is copied to the Acc for the ADD, then it's register can be re-used as part of the result, plus the m16 value register pair can be 'double used' as the Mid/Lo result as m16 will be totally 'shifted out' of the bottom (as the result is shifted in from the top) during processing.

Hence we actually perform '16(shit)x8(ADD)' (rather than 8(shift)x16(ADD)).

<i>Note that 'shifting down' means that any Cy from the top byte ADD is automatically preserved (i.e. it's shifted into the (new) top bit). When we 'know' that m16 is 'small' (many top bits 0), it would be 'nice' to accumerlate the ADD starting from the bottom up (instead of from top down), since that means we can skip the '0 bits'.

The problem with this is the Cy - when ADD to LSB generates a Cy we can't just 'shift it up' - instead we have to propergate it up (by INCrementing the Mid/Hi byte) .. and that can result in the top 2 bytes changing (so we can't 'double use' them as both m16 and part of the result). In short, the ADD has to be at the top end (with shift down), not the bottom end (with shift up).</i>

The subroutine below will multiply a 16 bit (shift) value (Mid,Lo) by an 8 bit (ADD) value (Hi), passed in 3 registers (Hi,Mid,Lo). To save on register space, the same 3 registers will be used to hold the (24 bit) result.

rTemp is used as the 16 step loop counter (or we can 'unwind' the code into 16 steps) :-
m8 in Hi is copied to Acc, m16 in Mid,Lo (mult16) is shifted down (so b0 to Cy).
If Cy is set, then m8 (in Acc) is added to Hi and the loop continues with the next shift.

After 16 shifts, m16 will be completly shifted out of Mid,Lo and Hi,Mid,Lo will contain the full result.

<code>
;<b>
MULTIPLY8x16 ;Unsigned 8x16 multiply subroutine</b>
; Called with the 8bit multiplier in mRegHi, 16bit multiplicand in mRegMid+mRegLo
; Returns with 24bit result mRegHi,mRegMid,mRegLo.
; Acc and rTemp (count) are used
; Note, 'no shortcut' header (i.e. no 0 or 1 tests, code still works, but takes same time in all cases)
 LOAD 0x1F        ;count 16
 COPY Acc,rTemp   ;set loop
 COPY mRegHi,Acc   ;copy m8 to Acc
 CLR mRegHi        ;clr m8 reg for use with result
 RRF mRegMid     ;shift m16 down (we don't care if Cy is shifted into b15, because it will be discarded by last shift at end)
 RRF mRegLo      ;m16,b0 to Cy
Loop              ;loop is 16*8 CLK's, irrespective of ADD/no ADD
; (back) here with Cy set if ADD needed
 Skip nCy        ;skip ADD if no cy
 ADD Acc,mRegHi  ;add m8 to msb, may set Cy (which will need to be shifted into Hi)
 RRF mRegHi      ;add was skipped no Cy, or add may have set Cy, either way shift Cy into Hi
 RRF mRegMid     ;shift m16 down
 RRF mRegLo      ;next bit m16,bX to Cy
 DECFSZ rTemp    ;done last add ? dec loop, skip if zero (no effect on flags)
 Jump Loop     ;nZ, not done, keep looping (no effect on flags)
 RETURN        ;all ADDs done, exit (with Acc 0) note, last shift b0 is ignored
</code>

Total of 14 instructions, cost 6+ 16*8+ 2(return) = 136 Clk's.

<i>The multiply can be 'short cut' by checking for *0 (and/or *1) by adding extra instructions to the 6 'header' set, however that adds extra overhead to all MULtiplies, so is only 'worth it' if your applicatio is giong to result in lots of 0* or 1*</i>

<code>
; Header for 0 test shortcut
 COPY mRegLo,Acc  ;++extra code for 0 test, check Lo, sets Z (cost is +1) = to test for *1, use DECFSZ (Dec reg to Acc, skip if Z)
 Skip Z           ;++extra for Z test +2
 Jmp LoNotZero    ;++cont. if Lo nonZ, cost is +2 clk for jump
 CLR mRegHi       ;Lo is Z, clr Hi and exit
 RETURN
LoNotZero        ;Z cost is +4 total if nonZ
 LOAD 0x1F        ;count 16
 COPY Acc,rTemp   ;set loop
 COPY mRegHi,Acc   ;copy m8 to Acc (sets Z)
 Skip Z          ;++extra code for Z test (+1)
 Jmp HiNotZero   ;++continue if Hi non-Z (+2)
 CLR mRegLo      ; Hi is Z, clr Lo and exit
 RETURN
HiNotZero       ;Z test cost is +3 clks if Hi nonZ
 CLR mRegHi      ;clr m8 reg for use with result 
 RRF mRegMid     ;shift m16 down (we don't care if Cy is shifted into b15, because it will be discarded by last shift at end)
 RRF mRegLo      ;m16,b0 to Cy
; CONTINUE WITH 16*8 INSTRUCTION LOOP ABOVE
</code>