Binary multiplication methods

Shift and ADD The standard approach to multiplication (on a CPU with no multiply hardware) is to use the well known binary 'shift and ADD' method (indeed, on some PIC CPU's this is actually FASTER than using the built-in multiply "hardware" :-) ). Multiply is usually coded 'bit at a time' as a simple loop, so requires 8 steps for an 8 bit * 8 bit multiply. The PIC instructions set lends itself well to nulti-byte shifting and single byte ADDing, so I implement a 16x8 multiply using a 16 step loop (i.e. I shift the 16 bit value and ADD the 8 bit one, which requires fewer instructions than 8 bit shift with a 16bit ADD :-)). The actual instruction code for a 16x8 multiply differs little from the 8x8 code. For maximum speed, the loop is 'unwound', which avoids the need for a loop count "dec and skip, jump loop" for each 'step'. Unwinding almost doubles the speed of an 8x8 multiply (at the cost of double the instruction code size). The multiply speed is always the same (8 or 16 loops), irrespective of the values to be multiplied. Modifying the code to detect and 'short circuit' the loop for 'special cases' (such as 'multiply by 0', 'multiply by 1', 'multiply by power of 2') imposes the extra cost of the 'special case test' as an overhead on all multiply operations - so 'on average' (assuming 'random' values are multiplied) it's faster just to let the loops run (even if '0' is being added).