Home and Links
 Your PC and Security
 Server NAS
 DVD making
 Raspberry Pi
 PIC projects
 Other projects
 Next >>

16F5x tips and tricks

Baseline PIC (10F2x,12/16F5x etc) tips and tricks

The 'baseline' PIC range (10F2xx (200/2/4/6, 220/2), 12F5xx (508/9), 16F5x/5xx (54,57,59 and 505/6, 510/9, 526/7, 526/9(t39a), 570) are the most basic 'entry level' devices you can get. Of these, the 16F5x (54,57 and 59) have a decent i/o pin count and are also about the cheapest. The mall instruction set (only 33 !), few variable registers and low program space (the 16F54 has only nn registers and 512 instruction space, the 16F57/9 only xx registers and 2k space) means you have to pull every trick you can find to squeeze your code into the limited Flash program space

TIP. If you intend to stick with 'Assembler level code' (rather than program in C code) with more than one device from the PIC family, I highly recommend you adopt (i.e. learn) the 49 instructions of the 'enhanced mid-range' CPU and 'emulate' any instructions 'missing' in the 'baseline' devices by using Macros (see later))
If you learn the lower end set first, chances are you will restrict your 'enhanced' PIC to less than all it's capable of

Accumulator ('W reg') Bit set & Bit clr

You can directly set and clear bits in any of the 'f' Registers but not the Accumulator - or can you ?

To Bit Clear Accumulator, use the 'ANDLW' instruction (AND Accumulator with value). For example, to BClr Acc,b1 use 'AND Acc,0xFE'.
To Bit Set, use the 'IORLW' (OR Accumulator with value). For example, to BSet Acc,b0 use 'OR Acc,0x01'
To Bit Flip, use the 'XORLW' (XOR ACC,value) instruction = so it's bit flipping a Register that's actually the 'missibg' instruction :-)
All of the above can best be supprted by writing yoir own 'bSet', 'bClr' and 'bFlip' macros (see my own macro set, later)
NB. For some inexplicable reason, there is a specific instruction to 'Clear the Accumulator' (CLRW), although 'AND ACC,0x00' will have exactly the same effect (even to the extent of only effecting the Z status bit), or you could use the 'MOVE 0x00 to ACC' (which has no effect on any status bits) which makes the CLRW instructions doubly redundant !

The Return address Stack

The 'baseline' PIC's have a Return 'stack' that is only 2 levels deep - so we have a 'ToS' (Top of Stack) and a 'BoS' (Botton of Stack) only. Each time you CALL a subroutine, the current ToS overwrites the current BoS and the new return address is placed in ToS. RETURN will jump to the current ToS, whilst the current BoS overwrites the ToS.

There is no 'overflow/underflow' protection and whilst at power-on both stack locations (ToS and BoS) are 'undefined' (i.e. not specified in the Spec. sheet :-) ) = so you can't count on them being 0 (or 0x3FF) you can still play tricks using 'Return' without a 'Call' (and even Call without a Return) !

The first CALL will 'push' (or, to be more exact, copy) the undefined TOS to BoS and replace ToS with the (first) Return address.
A second CALL (without a RETURN from the first) will copy ToS (the first Return address) to BoS and replace ToS with the 2nd Return address.
If you perform a 3rd Call, the first Return address will be lost (although you can exit from the 2nd subroutine by 'jumping' into a 3rd (rather than calling it) and it will Return to the first)
On a RETURN, the TOS is copied to the Program Counter and ToS is then replaced by (a copy of) BoS, so both ToS and BoS now contain the first CALLed return address.
Cunning use of this fact allows you to perform Returns (from a look-up-table) without Calls.
This is a key advantage of the baselane CPU stack (the 'extended' CPU eg. 16F684 for example, has an 8 location 'circular buffer' sty;e stack and whilst RETURN without CALL is also possible on the 16F684 (since, unlike the midrange PIC's, it has no overflow/underflow interrupt), you would have to perform 8 CALL's to 'fill up' the stack with known values first).

The 2 level restriction means you have to carefully consider the need for every Call. By all means divide your code up into logical 'blocks' - but DON'T just Call a 'block' .. rather consider if a block REALLY needs to be a Subroutine - i.e. can you arrange the 'main line' code so a program block is only accessed from one place - so you can JUMP to it (and JUMP back) to avoid using up one of your precious stack Return addresses ?

Remember = the 'restrictive' nature of the 16F5x stack has one massive advantage - after two successive CALLs and Returns you know the stack is 'filled' with the first Call'd Return address - so you can 'jump' to that location (using a Return without a Call) as many times as you like without having to worry about Paging bits etc.
'Clever' PIC's (like the PIC 18Fx series which has a 32 level stack) have Stack Full and Stack Empty 'flags'. If the stack overflows, the PIC typically performs a Reset, HOWEVER (on the 18Fx) on a stack underflow, the PC is reset to address 0 but this is not a 'complete' Reset (all the register contents etc. remain intact) so a 'similar' trick could be pulled
NB. in addition to the Reset address (0), the PIC 18Fx has two Interrupt locations 0x0008 (high priority interrupt) and 0x0018 (low priority interrupt). The external Interrupt 'latency' is 3 or 4 CPU CLK's.

Using the PCL

The low byte of the Program Counter (PC) can be accessed via the PCL register, but when yor read or write PCL things arn't quite as simple as you might think.

The PIC CPU 'prefetches' the next instruction whilst the first is being executed. This means that when you read the PCL, it is already pointing at the next instruction.
Second, when you write to the PCL, the next instruction (which was fetched and is now waiting to be executed), has to be dropped and it's replacemnet fetched instead. So, as you might expect, there is a 1 CPU CLK delay every time PCL is written.
What you might not expect is that when PCL (low byte of the Program Counter) is written, the 9th bit is always cleared !

A Jump' instruction specifies (and writes) 9 bits direct to the PC. However a Call instruction specifies only 8 bits and writes the low byte (just like the PCL), so the 9th bit is reset !br>
This means a 'Call', along with all other PCL writes ('Copy Acc to PCL' (or 'Add Acc,PCL', 'Clr PCL', 'BSet'/'BClr PCL' etc.) can only access an address in the first 256 locations !

The 16F54 has 512 program locations, so the PC is 9 bits wide. This means you can 'jump' anywhere but only 'Call' to an address in the first 256 locations.
Devices with more than 512 program space use a 'paging' system (so the 2k 16F57/9 have 4 pages). Two 2 bits in the Status register (known as the PA bits) are used so the Jump and PCL writes can access locations outside the first 512.
Yep, that's right = on a baseline/midrange PIC with more than 512 program space, the Jump, Call and writes to PCL still only effect the first 9 bits of the Program Counter == the 'extra' bits come from the Status Register (PA bits) !
A Jump always writes 9 bits direct to the PC (and the PA bits are copied to the PC top bits). Write PCL writes the low byte of the PC, clears bit 9 and copies the PA bits to the top bits.
So you can 'jump' anywhere IN THE 'CURRENT' PAGE and 'Call' to any of the low 256 locations of that page.
Only the return stack (ToS/BoS) has the complete PC (Return) address. However, on a Return, no matter what you do to the PA bits during the Subroutine, you will arrive back at the Calling loaction+1 (but note that Return does NOT change the Page bits !)
Note also that when the program counter 'steps over' from one Page to the next, the Page bits are NOT UPDATED !!
which means ALL Jumps, Calls and PCL writes will continue to 'point' at the previous page !
Finally, remember that the value of PCL used in a calculation is the value AFTER the PC was incremented for that instruction. So, for example, ADD 0,PCL just gets you a 1 CLK delay as the next instruction is re-fetched and does not 'stop' execution on the current location = for that you need  "ADD -1,PCL" (always assuming you are running in the low 256 locations of the Page pointed to by the PA bits, of course).
MPASM uses the $ symbol to mean 'here' (location of the current instruction). It's value is PCL-1 = so 'Jump $' will freeze the CPU 'here' = if the PA bits are pointing at the current Page = if not, then 'Jump $' gets you to the 'equivalent' location to 'here' in another Page.
NB. It's all too easy to confuse the code PAGE system with the Register BANK system = just rmember, code and register space are TOTALLY SEPARATE.

The Return stack holds the full (9bit or 11bit) CALLing address, so you can always Return to the correct place - but, remember, the PA bits are not 'updated' by the Return !

RETURN with Acc as status

On the Return from a subroutine, the Acc is loaded with the value you specify. Wouldn't it be nice if we could Return an 'error' code' ? (eg 0 = no error) - and then do an immediate 'skip on Z' / 'skip nZ' ? Well, we can't - "RETURN (ACC=) 0x00" does not set the Z flag !. .. plus, all the 'skip' instructions (inc/dec and skip, bit test and skip) only 'work' on Registers anyway. Of course we can always 'TEST Acc' before doing the 'skip on Z/nZ', but that 'wastes' an instruction ..

Is there a one instruction 'skip on (Return Acc set) status' ?
Well, yes, but it's rather dangerous - and it only 'works' when the Return code (or at least the 'error handler' code) is in the low 256 address locations.
When Returning, set the Acc to the 'error code', with 0 for 'no error' or 0xNN for error (where NN = number of instructions to skip i.e. where the error handler code is). On the Return, your code executes an "ADD ACC,PCL" instruction to 'jump on error'.
If Acc was returned as '0', this just means execution continues 'in line'. Anything other than 0 will 'jump' to some further address location (i.e. the error routine).
Of course this only 'works' if the Error routine address destination (result of the ADD) is in the current page, since ADD ACC,PLC always clears the 9th bit of the program counter, EVEN IF ADDING 0.
See also "mini-subroutine using 'Reg return'" below

Mini-subroutines using 'Acc return'

Often a 'main' Subroutine will need to 'Call' some common code as a 'mini-subroutine' - however chances are you have already 'used up' the 16F5x two level Subroutine Return stack (especially likely on the 16F57/9 which has 2k instructions but still only a 2 level Return stack). If the 'mini-subroutine' code doesn't modify the Acc, you can 'Return via Acc' (COPY Acc,PCL) so long as your return destination is in the low 256 locations

To 'call' the mini-subroutine, you load the Acc with PCL, ADD 2 and 'Jump Sub' (rather than 'Call Sub').
Of course you can only Jump (and 'Return via Acc') to (and from) the first 256 locations of the current page (i.e. same area where the 'main' subroutine resides), plus the 'Call' now costs 4 CLK's (although the 'Return' is still only 2 CLK) and, since 'Return' is 'COPY Acc,PCL' you don't get the chance to 'return' with a different value in the Acc (so can't use it for simple Data Tables, although a 'word' based Table (which loads data into 2 or more Registers with each 'call') would be a possible use)

Mini-subroutines using 'Reg return'

What to do if the mini-subroutine modifies the Acc ? Well you can place the 'Return' address into a register, at the cost of an extra CLK = 'Call' is LOAD Acc, ADD 3; COPY Acc,rtnReg; JMP sub; (4 instructions, 5 CLK) ..... 'Return' is COPY rtnReg,Acc; COPY Acc,PCL; (3 CLK).

The 'power' of the mini-subroutine is that it can modify it's own 'return' address allowing a direct 'jump on (error) status' - unlike the 'normal' RETURN where one or two extra instructions have to be used after the Return to do a 'jump on status'

Subroutines above the 256 boundary

A JMP (GOTO) specifies 9 bits of address, so you can always 'jump' to a location above the 255 'boundary'. The 'problem' is returning = if you don't Call you can't Return (via the stack). The 'trick' is thus to 'Call' to a location in the first 256 which then immediately 'jumps' (at the cost of +2 CLK's) to the 'real' subroutine code 'beyond the 255 boundary', from which you can always Return just fine.

This trick 'works' for normal subroutines but not for Data Tables == there is no problem with the 'return with value' table entries being in the upper 256 address space - it's the calculated 'jump' into the table that's the issue (when the calculation writes PCL you always end up with bit9 cleared so can only get to the low 256 locations == see Data Tables, below)

Using the PCL reg as a data source

Since the low 8 bits of the Program Counter are always 'readable' (via the PCL register) you can use the PCL as a 'source' in any 'normal' register instruction (as touched on above, you can 'INC Acc' by subtracting PCL then adding PCL (to increment by 2, use 'SUB PCL, NOP, ADD PCL')).

If your Add / SUB instruction is in a program location where the PCL == '0x01', then to INC Acc you only need to 'ADD PCL' (and the same for DEC Acc = SUB PCL).
However the overhead of CALLing / RETurning to/from the 'right' PCL location (not to mention saving the 'result' as the Acc is reloaded on a Return) gains us nothing for INC and DEC, although for some rather more complex arithmetic operations (eg calculating n! (N factorial) for probability functions) it may be worth while

In each 'bank' of 256 addresses, there are some 'golden' bit pattern locations (for example, to use the PCL as a 'bit' selector, you could call / goto location 0000 0001, 0000 0010, 0000 0100, 0000 1000 and so on (or 1111 1110, 1111 1101, 1111 1011 ...). Of course using these locations does rather 'fragment' the memory map, however the PIC16F54 has 512 locations, so there are 2 complete sets of 256 PCL values - and GOTO specifies a 9 bit address, so (unlike for the 57/59), for the '54 no actual 'bank addressing' is needed to get to any of these (remember = a CALL instruction can only address the first 256 locations, RETURN is the full 9 bits (or 11bits for 57/59).

Note also that the very top location of Flash memory (0x1FF for the 16F54) is the power-on start address (and if 0x1FF contains a NOP, the next instruction to be executed will be at location 0x000 (i.e. the program counter will 'overflow' and loop back to zero) = so we might want to avoid playing clever tricks with these two locations (you can get still find PCL=0xFF at 0x0FF and PCL 0x00 at 0x100)

You can, of course, write to the PCL. Unlike 'goto' (where the destination is fixed at compile time) writing the PCL allows a 'computed GoTo', although you are limited to the first 256 address locations (writing the PCL always clears the 9th bit of the program counter to 0) of (the currently selected) address bank

The most obvious reason for using a 'computed goto' is to 'fetch data' from that location. Whilst this is facilitated by the 'Return with data in Acc' instruction, it's at this point we discover that there is no 'computed Call' instruction - however there is a way it can be done (see below)

Data Tables

To conserve Register space when outputting ascii character shapes from a data table to a display (7 segment, dot matrix, video), the register containing the 'character code' can also be used as the 'jump offset pointer' (assuming the code that calculates the table offset from the character code is 'reversible')

Computed Jump into a DataTable

A data table is built using a stack of 'RETURN with value (in Acc)' instructions (where 'value' is the data). To 'read' the data you could CALL the data location 'direct' = which is fine if you know what location to CALL at compile time, however at compile time you also know the data value = so why would you not just 'load' it directly ??

What we need, of course, is a 'computed CALL'. Needless to say, no such instruction exists

Instead we start by performing a 'CALL' to the start of the table subroutine with the 'offset' onto the data table in the Accumulator. The subroutine then performs a 'computed Jump' using the 'ADD Acc to PCL' instruction. This takes us to to the actual data location where a 'RETURN with value (in Acc)' instruction returns us directly to the 'main line' code

Since writing the PCL always clears the 9th bit of the program counter, the Data Table values must always be within the low 256 locations - on the 16F54 that means within the bottom half of the 512 program memory (on the F57/9, the low 256 of each bank).

The maximum size of any DataTables is thus 255 (256 minus the computed Jump instruction at the start of the table) per 'bank'. The 16F54 has 1 'bank', so supports a max data table of 255 values (less any used for Subroutines), the 57/59 both have 4 banks (i.e. 2k of program space) so are limited to 4x255 = 2020 bytes of Data Table

For the 16F57/9 with multi-bank Data Tables, the 'calling' code would need to 'pre-set' the PA bits (and thus the Data Table bank) and then Call the 'jump to table' subroutine code which would be 'duplicated' in each of the 'banks'. The PA bits would thus select the appropriate table code in the appropriate bank.

NOTE that the PA bits are NOT 'updated' automatically by 'where you are' = they always contain the values you 'set', even after you Return to a different bank (or the Program Counter 'steps' into a new bank)

Conserving low address space

To conserve low address space, the bulk of most Subroutine code can be placed into the high address space - the initial subroutine Call must be to low space but you can then 'divert' control by using an immediate Jump into the actual code in high space (of the same bank)

This is possible because a Jump (and Return) supports 9 bits, whilst a Call supports only 8 bits (as does a computed jumps, using 'Add Acc,PCL' etc)

Speeding up the Call sequence

The 'classical' method to speeding up a data table read sequence is to 'unwind the Calls' i.e. write each Call 'longhand' as an individual snippet of code. This eliminates loops / pointer increments, at a 'cost' of extra program space

The main requirement for faster look-up is typically a need to 'output' the values looked up 'as fast as possible'. So my example below puts the data onto a set of PORT pins (for example, lookup a series of character 'shapes' for a 7-segment display - or even for a video display (perhaps with the support of an extrernal 'shift register')

Take, for example, the following 'unwound' code snippet :-/p> ; Code snippet (duplicate this for however many Calls you need). Assumes the Table pointers have been pre-calculated and placed into INDF reg set. ; Each snippet loads the (next) INDF Register and then Call's the table to do the look-up and finally output a byte ; (note this is still a 'computed Call', so the sequence can always be ended early by setting the RegN contents to cause a jump outside the table) ; Numbers in brackets () below = the number of instruction cycles used Copy RegNx to Acc ;get the next reg offset (1) Call table ;Call the table (2) ; table performs a computed jump (2) to the required entry which then Returns with data (2) Copy Acc, PORT ;finally, o/p the data (1) ; ;... (next 3 instruction snippet, RegN01..Nnn ) ; ;table Add Acc, PCL ; jump to the requested entry (2) [or past the end of table] ; RETURN with data ;each entry (2) ...

The above code will output 1 data byte every 8 CPU clk cycles at a 'cost' of 3 extra instructions for each output after the first. If we want to output (say) a date and time (to 1/100ths of a second), then we have 7 date + 6 time (dd mmm yy hh:ss.ss) with spaces and delimiters = 16 'shapes'. This 'costs' 3x16 = 48 instructions and takes 128 CLK cycles

Note that the output sequence can be terminated 'early' (i.e. before the entire set of 'snippets' completes) with a jump (past) the end of the table.
However this will leave an 'orphan' Return address on the stack, not a problem for any baseline CPU BUT the midrange Stack will quickly 'fill up' and when it overflows execution will be aborted (with a jump to 0 ?)

Speeding up the snippet

What slows the reading of the data table (above) is the need for a 'multi-step' sequence "Call, computed Jump, Return" sequence. The computed Jump is 'forced' (there is no way to compute a Call destination), and the "Return with value" is the only way to load the ACC and execute a program branch in a single instruction. So any 'saving' has to be in the 'Call'. So, what happens if we skip the 'Call' and just let the table 'Return with value' ?

Well, on all baseline PICs, Return is ALWAYS to the current ToS. On a Return, ToS is over-written by BoS (so then TosS = BoS). If we Return again without a Call, this (and all subsequent) Return will go to the exact same place !

So if we can set-up BoS to some specific Return address BEFORE the first computed Jump into the table (i.e. before the first Return), then we can skip the Call !

To setup the stack so all Returns arrive back at 'here', all we have to do is execute a CALL from 'here' twice without performing a Return. Once this is 'set up', you can do a 'computed' Jump straight into the table from 'anywhere' knowing the table Return will always arrive back 'here'.
This trick works fine with the baseline PICs. You can't do it with the 'midrange' PIC's (the midrange CPU will 'abort' a Return from an 'empty' stack by performing a 'reset' i.e. execute a jump to address 0x000).
However you can 'sort of' do it with other 'more clever' PIC's that deal with aan empty stack return by jumping to a hard-wired 'exception' address (where you could have code waiting)

Terminating the data read loop

Before going further, we have to note that the 'drawback' of a 'Return without Call' is that the Return will keep going back to the exact same address. This means you can't write an 'in-line' data read sequence, so it looks like what we gain by skipping the Call will be lost by 'counting' the Table loop ... or is it ?

What happens if we write :-

; Assume we have setup ToS=BoS so that every Table "RETURN with value" will arrive back 'here', with data in Acc Copy Acc PORT ; o/p the data (1) Copy INDF to Acc ; get the offset to the next data byte (1) INC FSR ; inc. the INDF pointer (1) Add Acc to PCL ; jump straight into the 'Return with data' table (2) ; end of loop (when Acc was zero) JUMP backToMainCode ; start of table RETURN with value 1 (2) RETURN with value 2 (2) ... RETURN with value N (2) ;

Well, the above code will 'run for ever' outputting 1 data byte every 7 CPU clk cycles, but only so long as the jump destination (PCL+Acc) is within the 'Return with value' data table. So, to terminate the sequence, we just set the last offset value in the Index register stack to 'point' somewhere other than within the data table.

Note that the above code must all be the low 256 locations - when the PCL is written, the 9th bit of the Program Counter is reset to zero = that means 'add offset to PCL' will ALWAYS take you to somewhere in the first 256 address locations
It's to be noted that Zero (0x00) is often used to 'terminate' character strings (0 is not a 'normal' ascii value) and adding 0 to the PCL 'jumps' to the next instruction i.e. 'falls through'.

An alternative method of 'terminating' the loop is to setup the Timer/Counter to 'Interrupt' after however many bytes you want to output (i.e. set Timer Count = 7 * the output count)

NB. 'Return without Call' depends on setting up an 'empty' stack so you know where the Return is going to be

The ToS=BoS 'Return here' loop code

The stack consists of 2 registers, ToS (Top of Stack) and BoS (Bottom of Stack). On a Call, ToS is copied to BoS and CALL+1 is copied to ToS. On a RETurn, ToS is copied to PC and BoS copied to ToS.
So, to setup ToS=BoS 'all' we have to do es execute a 'double stack push' (or 'double Call') at the very 'top' of the data look-up loop.
; Need to ensure that 'Return without Call' comes back to here, so first Call has to 'double push' CALL dblpush ; ; Table Returns to 'here' always, with data in Acc (2) ; Copy Acc PORT ; o/p the data (1) Copy INDF to Acc ; get the offset to the next data byte (1) INC FSR ; inc. the INDF pointer (1) Copy Acc to PCL ; jump straight into the 'Return with data' table (2) ; that's it - a Return takes us back to the Copy Acc,PORT instruction ; Data Table ; first the 'end' destination JUMP somewhere ; to terminate the loop, INDF 'points' at this instruction (which jumps out of low 256 address bytes) ; Now the double push code dblpush: CALL $+1 ; Call next location and push the first (dblpush Call) Return address down to stack level 2 ; the first time we arrive here, Call+1 will be in ToS and dblpush BoS RETURN 0xnn ; When this is executed the first time, Return is back to here, with TOS = BOS (the CALL dblpush Return address) ; The second time the Return is executed, return goes to the 'Call dblpush' Return address (which is now left in both TOS and BOS) ; (0xnn is the value returned in Accumulator (and 'dummy' output to PORT) ; now the actual Data table values, RET with Data entry (1 instruction, 2 Clks) RETURN with value ... ; end of data table

If this all has to fit withing 256 bytes, then the max. data table is now 256 - 6 - 2 = 248 entries (so if the table is storing ascii character shapes in a 8x8 font, we have space for 248/8 = 31 characters, just enough for my PIC Time and Date on screen display (OSD) )

To avoid the need to use an Index register for the final 'goto Jump' and the actual 'JUMP somewhere', the Table can be modified so that the last entry outputs the 'end of line' (space) and doesn't Return) as below :-
0x0FF: ;last location in the table == this is where the 'end of line' jump arrives (which means Acc = 0xFF) Copy Acc, PORT ; output FF to PORT (or use CLR PORT if we want 00 output instead) = note this is 2 clk's 'early', so may want to add a couple of NOP's first) ; drop through to the rest of the code ....

Maximum DataTable read rates (no Return, no Call)

There is just one further way to maximise the data table read rate = 'unwind' the Return loop by having each Table entry output it's own data and then jump (in)directly to the next entry to be output. Unfortunately, this 'multiplies up' each data table entry by N extra instructions, so is of limited use (i.e. only for small tables)

; each entry in the table ; start by outputting the byte Load data to Acc ; get data (1) Copy Acc PORT ; o/p the data (1) ; now jump to the next table entry Copy INDF to Acc ; get the offset to the next data byte (1) INC FSR ; inc. the INDF pointer (1) Copy Acc to PCL ; jump straight to the next table entry (2)

Above outputs 1 byte per 6 CLK's, however each table entry is 5 instructions, so the max. table size is now 256/5 = 51 entries

The 16F54 has space for one table, however the 57/9, with 2k instructions words has space for 4 such tables. Dedicating all 4 tables to ascii Font store means we have 204 entries which is enough room for only 25 characters (in an 8x8 font). This allows a numeric date/time display (0-9, -, :, am, pm, space etc)

To squeeze additional characters into the font table, see 'Character FONT (shape) packing' below and my PIC18F OSD display project

Max data output, pre-loading registers from the data table

Finally we arrive at the maximum possible output speed. By pre-loading the Registers with the data to be output we can achieve 2 clk's per byte

Copy regNx Acc Copy Acc PORT ; 2 clks, repeat above for as many byte outputs as necessary

The maximum possible speed using 'indexed' output is 5 clks, however this is done at the expense of not testing for an 'end of line', so some 'external' means of stopping the loop will be needed :-)

Copy INDX Acc ;1 Copy Acc PORT ;1 INC FSR ;1 JMP $-3 ; 2 clks ; 5 clks, but no end of line ...

If we use register contents = 0 as an 'end of line flag' (or the FSR count 'overflow' into some significant bit is used) we get a 'controlled' loop of 6 clks (although we might as well be using the table)

Using Register contents 0 as end of line flag (0 is output on last cycle)

; top of loop (6 clks per loop, first byte output in 2) INC FSR ;1, Inc the FSR (affects Z) ; main code enters here with start FSR Copy INDX Acc ;1, (sets Z) Copy Acc PORT ;1, (no effect on Z) Skip if zero ;test for Acc=0 = end of line (1, if not end) Jump Loop ;not exit, keep on looping (2) ; fall through to exit

Using the FSR reaching some significant bit = end of line (last FSR is not used, not output)

; top of loop (6 clocks per output, first byte output in 2) ; main code enters here with start FSR Copy INDX Acc ;1 Copy Acc PORT ;1 INC FSR ;1 aim at next Reg BIT n, skip clear ; has the count reached the significant end bit ? (1 if not) Jump Loop ;not exit, keep on looping (2) ; fall through to exit

Direct shape output

For an OSD / VTI (On Screen Display / Video Time Inserter) the number of characters 'per line' is typically low and the display position is usually 'fixed' on the far right of the screen (i.e. at the 'end of the scan line'). In this case, rather than use the character code registers as 'pointers' to the shape, the shapes can be 'looked up slowly' before the output point is reached and stored in a second set of registers. Output can then be achieved at CPU clk rate (using ROTL PORT to shift bits) with only a small inter-character gap delay

; o/p from RegN Copy RegN, Acc ;interchar gap (1 pixel time, however this will be interchar gap #2 ) Copy Acc, PORT ;o/p first bit of a 7 bit font ; now o/p 6 additional bits ROTL PORT ROTL PORT ROTL PORT ROTL PORT ROTL PORT ROTL PORT ; the last bit in the RegN may be a flag bit, so we have to zero the PORT CLR PORT ;set space = interchar gap #1

The above is 9 CPU Clk's per byte output but avoids the need for an external Shift Register

Remember, from above, bytes can be output at 2 clk rate. An external shift register would thus be running at 4 x the CPU clk rate i.e OSC rate. See my On Screen Display project pages

Packing text (ASCII character codes)

There is a whole field of study dedicated to compression, compaction and encoding of data. Whilst I consider here only the 'simple' methods applicable to text, even this can become complex (for example, 'word substitution' (using short-cut codes) demands a knowledge of the likely text (eg narrative, program code) that will need to be encoded). Hopefully I provide enough 'pointers' to get you started

All PIC's suffer from the massive drawback of having very few registers (RAM). If we want to store multi-character ascii strings as 'variables' it makes sense to 'pack' the data.

In the 'old days' of early computers, often 3 characters were packed into 2 bytes (so 33% space saving), however this was achieved by reducing the character set to 40 ! (typically 26 uppercase only characters, 10 numbers and 3 punctuation (typically space, dot, carriage return/line feed) plus an 'escape' code (which allowed other characters to be defined 'long hand')
The '3 into 2' trick works because although 40 characters requires 6 bits, 40 x 40 x 40 is only 64,000 (i.e. less than 64k) and only needs 16bits, not 6+6+6 = 18bits. To store the '6 bit' ascii, you place the first char into a 16bit word, multiply by 40, add the next char, multiply by 40 and then add the final character
To 'unpack' the ascii, the 16 bit value has to be successively divided (1st char = quotient of full value divide by 1600, 2nd char = quotient of first remainder divided by 40, 3rd char = remainder). All of which is fine when you have a computer with a 1 cycle 16bit hardware multiple/divide ALU (but limited storage)

These days we typically require lower-case characters as well as extensive punctuation (and the low end PIC devices lack multiple/divide circuits :-) ). Most ascii sets contain at least 96 characters (7 bits), however there are still a few techniques that can save space when large blocks of text (such as a screen display) have to be stored.

8 into 7

The first is the simplest - if only 7 bits are used, each byte has one 'spare' bit, so 7 bytes have 7 spare bits and these can be used to store an 8th character (saving 1 byte in 8).

6 into 5

The next trick also makes use of the 8th bit - however this relies on the fact that 'on average' English words are a bit less than 5 characters long. Instead of storing the 'space' (at the end of the word) as a character, we store it in the 8th bit, thus saving one byte in 6 (on average)

3 into 2

The final trick is to note that in 'normal' text very few characters are 'upper case', numbers or 'punctuation marks'. Reducing the 'common' character set to 26 lower-case letters (with room for 6 additional characters, obviously '.', 'space' and EOL plus whatever supports the text that's going to be shown (';:-' or '')) which means we only have to store 5 bits for each 'common character'.

We can thus store 3 characters in 2 bytes (15 bits) with one bit 'spare' as a 'flag' (to indicate that this byte is a 7 bit character and not part of a 3 character word). This is similar to the '3 into 2' packing used in the 'old days', but without the multiple/divide overhead.

Packing is on 'complete sets of 3' successive 'common characters' ('space' is a 'common character', so strings consisting of lower case with spaces will all get packed). Occasionally, individual (or sets of 2) common characters will have to be left 'un-encoded' (7 bits spans the whole character set) so compression will be 'less than' 1 byte in 3 (especially for text without long strings of 'common characters'). Unpacking is as follows :-
Get a byte to the Accumulator If bit 7 is set, it's the first byte of a word, so get the second byte and unpack 3 characters, leaving last character in Accumulator Save the byte in the Accumulator Loop to get the next byte
Note the flag bit is '1 means packed byte', so we don't have to do any unpacking when the end of line '0x00' code is seen

Actual compression (encoding) of text

So far (above) I have only looked at simple 'save 1 byte in N' packing approaches, however the last of these (3 into 2) 'points the way' to the use of 'compression coding'. Of course actually compression (eg .zip) is likely to far exceed any benefit, however if we are only using 96 ascii values in a 7 bit (127) code range we have 31 'unused codes' (0x00-0x1F) - and these can be used as 'key codes' to replace (up to) 31 'common words' in the text you are (currently) storing. The good bit about this trick is that you can use it IN ADDITION to the '8 into 7' / '6 into 5' or '3 into 2' method above !

Of course it's going to make '3 into 2' rather less efficient, since the 'common character' strings will be 'split up' every time a 'common word' is removed and a 'key code' inserted (except when the key code turns out to be one of the more 'common characters' !)

Holding 96 ASCII characters in a 7 bit value means we have 127-96 = 31 'spare' codes, although, in practice only 30 are 'available' (since 0x00 is reserved for the 'end of line code'). To avoid the need for complex programming, the 30 codes could be simply assigned to the first 30 'duplicate' words found in the 'text store'

To maximise the space 'saved', the first occasion of a duplicate word has to be replaced as well. This means replacing the word with the code and then 'shuffling up' the rest of the text in the store.

The drawback of this scheme is that only the first 30 duplicates are 'compressed' - so more frequent duplicates that occur later in the text will be left 'uncompressed' - and when the text being stored is for a screen display the 'duplicate store' can quickly become 'stale' or out of date (as duplicated words are 'scrolled off the top' of the display :-) )

One alternative to the 'just use the first 30' is to pre-program 'common' English words (such as 'the'), however (and especially if it's a terminal display screen) the text that needs to be stored may not be 'common English'.

If you want to define your own 'common word' set, I suggest you analyse the contents of some text-like files (for example your web page 'source' files) which you expect your PIC to store (eg for display).

Finally there is the 'variable word Look-Up-Table' approach. Only really possible with the 'high end' PIC's with lots of program space (and some EEPROM storage) you program the PIC to perform it's own 'common word analysis', choose it's own 'key words' and 'update' it's word substitution LUT (in EEPROM) 'on the fly' :-)

Of course, by the time we start building solutions with high end PIC's we will, no doubt, be using external serial RAM storage - so the 'lack of space' problem goes away (and is replaced with output speed issues :-) )

Character FONT (shape) packing

Another typical requirement is to store 'bit maps', for example to define character shapes. A 'basic' character font is 8 lines of 5 bits (8x5), which can be simply stored 'packed' into 3 16bit words (so 6 bytes, rather than 8). Flipping the bit map by 90 degrees (5x8) reduces the count to 5 bytes, however 'unpacking' becomes 'a bit more complex' :-)

Since the shape look-up table will be in program (Flash) space, the constraint is typically read-out speed, not space. So splitting the table into 'scan lines' is likely to be the better approach, even if it results in poor space utilisation

When space is tight, it's worth noting a few bytes can still be saved (without adding overhead to the 'read out') by 'pointer manipulation' (assuming there's enough time between read-outs to do so)

The most obvious is 'space' - we only need to store 1 byte (because every scan line through a space is the same) = so the 'space' pointer not be changed between scan lines.
Less obvious is the 'substitution' of letters/numbers (for example, numeral 1 for lower case l, numeral 0 for uppercase 0)

It's also to be noted that an 8x8 font TYPICALLY includes the 'inter-character gap' i.e. the 'last bit' in the byte is always 'blank' for text (but not graphics characters). Whilst there is no time to set/reset bits during output, there's nothing to stop you 'dropping' the last bit from the external circuits (shift register) by 'not turning on' that i/o line (in text mode) - which opens up the possibility of using that bit as a 'flag' ...

Next up, it's worth noting that the 'top' few scan lines through many lower-case letters are 'blank' (in an 8x font, 14 characters (acemnopqruvwuz) all start with 3 blank lines). Further, very very few characters have a 'descender' (gpqy_), all of which points to possible savings

The problem is always with read-out speed. We will always want to 'clock out' the font shapes at the maximum possible speed, which means we have no time during actual output to 'check bits' or 'decide to choose another byte'.
If, however, output is performed using 'pointer' look-ups (variable jumps into the font table), there may be time to 'adjust' the pointers in some simple ways before starting the output. For example, in text mode, the 8th bit of the font could be used as a 'double me' flag (many characters have 'duplicate' scan lines in their shapes)

For more on text and font packing, see my PIC OSD and PIC VGA display projects

Timing delays using Return

A CALL to any existing RETURN 'costs' a single new instruction but gets you a 4 CLK delay == but don't forget - the Acc is overwritten (although Status is unaffected). Since the Acc is 'lost' anyway, to get longer delays, you can CALL into any code prior to a RETURN that only effects the Acc (but you might need to watch out for Status bits being changed)

A 2 clock single instruction inline delay (that does not effect the stack) is achieved with a "GOTO $+1"

Using the 16F5x STATUS register (03)

The Status register is actually a real register (unlike the INDR pointer 'register', the PORT control (TRIS) 'register', the OPTION 'register' and the W 'register' (Accumulator), none of which are real registers at all). Further, all 8 bits are implemented (see below re: PA bits) - unlike the register 'bank' bits (FSR bits 5-7)

What happens to the Status bits when the Status Register itself is specified as the 'destination' of some instruction ?
Well, first the TO/PD bits are not 'writable' at all, and second, if the instruction itself is one that effects ANY 'flag bit' (i.e. any of Z,DC,C) then the 'write' to all 3 bits is ignored (Bit Set/Clr doesn't effect the 'flags', so will work on Z,DC,C just fine, as will Nibble swap, COPY Acc to Status and (oddly) INC / DEC Skip if Z (since PD is only 0 after a SLEEP and TO is only 0 after a WDT Time-out, you will be hard pressed to get the skip to take place)

The 'Z' flag is auto-updated by most instructions, whilst b1 (DC = Digit Carry) flag is updated only on Add or Subtract and Bit 0 (Cy), is only updated on Add/Subtract and Rotate (which is via Carry). Since Bit Set and Bit Clear allows the flags to be accessed, we can 'play' with the values.

Since Rotate is via Cy, manipulating the Cy flag can be quite useful = see later for "how to 'bit shift' 5 bits using the 4 bit PORT A"

It's worth noting that the state of the Cy / DC flags are inverted after a Subtract. Specifically, on Add Cy and DC are 'set' on carry, on Subtract, they are set on 'no borrow' (i.e. 0 = borrow, 1 = none). This is a real pain to remember (and prompted me to 'invent' two extra 'flag macros', specifically "BRA_Bw dest" = Branch (jump to dest) if Borrow, and "BRA_nB dest" = Branch (jump to dest) if no Borrow)

Using the PA bits

The top 3 bits of the Status register are 3 PA bits, which exist on all 16F5x devices, despite the fact that PA2 (= b7) is not used on any of them, and PA1 (b6) & PA0 (b5) are only used by the 16F57/9.

The PIC manual says "Using PA3 (or PA1/0 on the 16F54) is 'not recommended'", however PA3=b7 in particular makes an excellent 'custom Status flag', especially as you can 'ROTL Status to the Acc' to copy it (b7) into Cy (b0) and then ROTR Cy into somewhere else

Using the 16F54 FSR (reg 04)

The FSR (reg 4) is the 'pointer' register for use with INDF (reg 0) indirect addressing into the register stack. When using indirect addressing, you can 'flip' the FSR pointer between 2 registers by using the INVert instruction ("COMF 4,1" or, as I like to code it, 'INV FSR').

You can only 'pair up' outside the 'special reg' set. For the 16F54, the first general reg is at address 0x07 (0,0111) - it's "complement" is at 0x18 1,1000 - the next pair 0x09 (0,1001) with 0x16 (1,0110) .. and so on).
The 16F54 has 25 registers (address 0x07-0x1F i.e. 0,0111 to 1,1111, inclusive). Of the 25, 16 can be 'paired' together using 'INV FSR' to flip between them (the 7 registers at the top of the address range would 'pair' with one of the special registers (0-6)). The full list is :-
0,1000 1,0111 0,1001 1,0110 0,1010 1,0101 0,1011 1,0100 0,1100 1,0011 0,1101 1,0010 0,1110 1,0001 0,1111 1,0000

FSR bits 5-7 are the 'bank select' bits. On the 16F54, all 3 bits are 'unimplemented' (and will read as 1), on the 16F57 only bits 5 & 6 are implemented (7 is '1') = only on the 16F59 are all 3 bits implemented

The unimplemented 3 bits always read as '1', no matter what instruction is operated on it - so, for example, on a 16F54, "CLR reg4" will set Reg4=0xE0)

For how the 'register bank' system works on the 16F57/9, see below

Adding 8 (0 1000) to the FSR thus allows you to 'cycle' around 4 specific registers, starting at 0x07, 0x0F, 0x17 and finally 0x1F. Note that this is only possible with the one specific 'set' on the 16F54.

If FSR is set to 0 (i.e. pointing at the 'indirect' INDR 'dummy' reg location), any read operation using INDF will return 0 - and (by implication) any write will be ignored

16F57/59 register addressing

Understanding the 16F57/59 register bank addressing system is not easy.

The 8 bit register 'address' should be regarded as a 4 bit address (0-3), a 1 bit 'bank enable' (4) and a 3 bit 'bank select' (5,6,7).
When bit4 = 0, one of the 16 'common set' registers will be accessed, irrespective of the 'bank select' bits. The 'common set' includes all the Special (control and status) registers (INDF, PORT, actual Status etc) so these are always accessible, irrespective of the 'bank' bits.
When bit4 = 1, access is to one of the 16 'upper' registers, as determined by the bank select bits.
The 16F54 has one 'bank' (register address 0x10-0x1F). Of the 16 'common' registers (0x00-0x0F), the first X are Special, so it has 16 + = GP registers.
The 16F57 has 4 register 'banks' (10-1F, 30-3F, 50-5F, 70-7F)= 64 GP registers, which, when added to the 8 GP registers in the 'common set' (the low 16) gives us 72 registers in total. Note that the FSR contains 7 bits (the 8th == 1) so can address any of the 'banked' registers (but only in 'sets' of 16 - incrementing 'off the top' of a GP set will 'drop back' to the 'common set' !)
The 16F59 has 8 banks of 32 (00-1F .. E0-FF). This gives it 8x16 = 128 GP registers. Since the first 10 of the common set are 'Special' (control) registers, there are only 6 'fixed' GP registers (0A-0F) for a total of 128+6 = 134. The FSR contains 8 bits

Note that 'incrementing' the FSR is very dangerous (as it will happily 'roll over' from the top of one bank (eg address 1F) to the bottom of the next (20) which then maps to a Special register.

Using Rotate

You can both 'unpack' a register (extract it's bits) and control the 'unpack loop' using the same register with a Rotate instruction.

The trick is to note that Rotate is via the Carry bit, so on each rotate the current state of Cy is 'shifted in' at one end of the register (and the bit at the other end is 'shifted out' to Cy). To 'control' the bit loop, you start by setting Cy, so on the first Rotate the byte gains a '1' at the 'shift in' end (so it is now non-zero).
On all subsequent Rotates you clear the Cy bit first, so 0's are 'shifted in'. You can then 'exit the loop' when the byte is Zero (which you have to Test for, since Rotate has no effect on the Z flag)

Packing a register (shifting in) is similar, however in this case you start with the register set to 1000 0000 or 0000 0001 (i.e. all 0 except the 'end bit'

On each rotate instruction, the Cy bit is rotated into one end of the register and the other end bit shifted out to Cy.
If you bracnh on the state of Cy, on the last bit in, the '1' end bit will be shifted out

PORT registers

Reading the data bits from the PORT 'register' ALWAYS gets the actual value on the actual pins, irrespective of the mode (and not the value in the (output) register) EVEN IF the PORT pins are set to 'output'

It's quite possible for the pin to be 'pulled' in the opposite direction of it's 'data out' value by some external circuit (for example, it could be shorted to Gnd)

This means you have to be very careful when using any command that writes the 'result' back to the 'source' - for example, ADD Acc,PORT will read the PORT pins, add the value in the Acc and write the result to the PORT o/p latch.

Only if the entire PORT is set to 'output' mode (rather than input = tri-state) will Acc be added to the latch contents and even then, only if the o/p pin state is not being 'overridden' by some external circuit 'back drive'

PORT A, bits 4-7 (and E bits 0-3) are 'unimplemented' and always read as 0, which could be of some use (for example, ROTL PORTA,Acc gets the pin state into Acc b1-4 whilst clearing the Cy, ROTR PORT E,Acc is similar (PORT E pins 7-4 end up in Acc 6-3 with Cy clear)

Note that the PORT pin changes state at the end of the instruction execution (i.e. on the trailing edge of OSC 4), whilst the PORT pin is 'read' during the first 1/4 CLK (i.e. end of first OSC cycle) of the instruction

If you Write to the PORT (latch) and then immediately Read back from the same PORT (pins), there is a single OSC between the two operations. This means that the new pin state has to 'settle' within 1 OSC if it is to be read back correctly

The TRIS command sets the pin 'mode' (i.e input or output) latch (it's not a register, it can't be 'read'). If you TRIS the pin to output mode, the pin is driven Lo or Hi depending on the PORT register data value. To avoid 'glitches', set the PORT data value before switching the pin to o/p mode

High speed serial output ('bit banging')

The maximum possible speed is 'one (OSC/4) instruction per bit' - which means using the Rotate through Carry instruction - and using a whole 8 bit PORT for full byte serial comms

Both the 16F57 and 16F59 have sufficient i/o pins to allow a set of 8 i/o's to be 'dedicated' to serial output.
The 16F54, however, has only one 8 bit PORT (plus one 4 bit), so you might well have to 'double use' some of the PORT B (8 bit 'serial out') pins.
Note, by the way, that because the 'shift' is actually a 'Rotate via Carry' you can gain an extra bit 'for free' by pre-setting the Cy bit (i.e. you can shift out 5 bit data using the 4 bit port or 9 bits via the byte port)

The byte to be transmitted is copied into a PORT register and then a sequence of 7 'Rotate' instructions is used to shift all the bits out

From RS232 serial comms spec. we note each byte requires a 'start bit' (Lo) followed by the 8 data bits (bit 0 first) then one or more 'stop' bits (Hi).

Bit 0 of the PORT is thus defined as the 'serial Tx' line, and is pulled Hi with a 10k resistor (so when the PORT is set to 'input' mode, the serial Tx = 'idle' = Hi). The remaining 7 bits of the PORT can be left 'n/c', however all 8 pins must be 'enabled' as outputs before serial Tx starts (remember - reading the PORT gets the pin state, not the o/p latch values).

The 7 'n/c' pins could also be used as inputs (when the serial comms is not in use), however careful circuit design is required to ensure that, when set to o/p mode (during serial transmission), the PORT o/p latch value 'overrides' any input signal level
To send a byte :- 'Step 0' is the 'start bit'. To achieve this, we 'BCLR PORT,bit0' 'Step 1' is to load the PORT latch with the byte we want to transmit, and thus bit0 of the byte is output (Copy Acc to PORT). 'Step 2-8' consists of 7 'Rotate right' instructions ('ROTR PORT'), which 'reads' the PORT pins, shifts right (down) by one bit and writes the PORT o/p latch thus sending the rest of the byte, one bit after the other 'Step 9' is to end the byte with a stop bit, 'BSET PORT,bit0'. From here we can let the serial line 'idle' with 'stop' bits whilst fetching the next byte (see previous re: Data Tables)

To read a byte, one i/o pin (b0 if MSB is senr first, or b7 if LSB) is used as the Serial in and the other 7 bits set to output. As the bytes is shifted in it is built up be storing the first 7 bits on the data output latches. The last 'read' command gets the 7 bits of saved data plus the last input bit

Outputting 5x8 character bit maps

A PIC based OSD / VTI (On Screen Display / Video Time Inserter) will need to output bit-mapped characters at video rates. The 16F5x series is spec. limited to 20MHz OSC, so the CPU CLK is only 4MHz and thus even at 1 CLK per bit the 'bit map' characters will be about 3x slower than normal video bit rates (so each bit is at least 3 pixels wide)

Overclocking the 16F5x to 24MHz gets us a 6MHz video rate (assuming we can do 1 bit per CPU clock) which is 'acceptable' for interleaved TV video (but not for VGA, where we need a 24MHz data rate for the most basic 640x480 display). For VGA, we thus need a PIC supporting at least a 48MHz OSC (12 MHz CPU clk) OR use an external shift register (with an 8x8 bit map and load it at 2 CPU clk per byte whilst clocking at OSC rate)

The 16F54 PORTA consists of 4 bits (0-3) only. HOWEVER it can be used to support a 5 bit wide character map if the bit3 pin is used as the actual output !

Four bits of the map are loaded into b3-b0 and the 5th bit loaded into Cy, so the first 'Rotate Left PORTA' shifts Cy into b0 (to set this up, the 5 bit map would be Rotated right into Cy before starting the output). NB. To support this approach, the bit maps will need to be held 'mirror imaged'

Note also that the pin is either Hi or Lo (i.e. not 'tri-state'), so if we want to 'superimpose' text on the existing video stream we will need to 'buffer' the pin with an external transistor

Max speed output using a single i/o pin (bit banging)

If one of the 'end' bits are used (b0 or b7), it is possible to output in 3 instructions whilst preserving the 'contents' of the rest of the output register

; The Tx pin is b0 of PORTB. The state of the other pins must not be changed during transmission ; The byte to be output is in register txData, bit0 is Tx first ROTR PORTB,Acc ; Rotate PORTB to the Acc (this captures the state of the bits 'shifted down by 1') COPY Acc,temp ; save the rotated PORT data ; start the output ROTR txData ; Rotate the TxD register so the Lo bit (to be output next) is in Cy ROTL temp,Acc ; Rotate PORTB backup from temp (so PORT b0 now = Cy = bit to output) COPY Acc,PORTB ; Update PORTB bit0 (TxD i/o pin) with the Cy value ; repeat ...

If we don't care about preserving the other bits, the whole PORT is defined as an output and a single instruction ROTATE shifts the bits along

The maximum continuous o/p is 9 bits (8 byte + Cy), after which 1+ clk time has to be spent loading the next byte (and Cy).

Next page :- Better PIC instruction set - (mnemonics)