Sunday 30 March 2014

what happens after the cpu starts

So I started thinking about the startup code that you might want if you were executing code on the epiphany as a kernel processor as opposed to running a "standard" C interface. One problem is you can't be loading arbitrary code while the cpu is not in a halted state so there has to be some fixed block of code which is always there which can contain that state.

data

To get my bearings I created a full memory map of the entire epiphany 16-core device. Below is an excerpt from the lowest addresses on each core as I had it initially. Yeah ... didn't really feel like watching tv.

 +----------+---- local ram start for each core
 |     0000 | sync
 |     0004 | swe
 |     0008 | memfault
 |     000c | timer0
 |     0010 | timer1
 |     0014 | message
 |     0018 | dma0
 |     001c | dma1
 |     0020 | wand
 |     0024 | swi
 +----------+------
 |     0028 | extmem
 |     002c | group_id
 |     0030 | group_rows
 |     0034 | group_cols
 |     0038 | core_row
 |     003c | core_col
 |     0040 | group_size
 |     0044 | core_index
 +----------+------
 |     0048 | 
 |     004c | 
 |     0050 | 
 |     0054 | 
 +----------+------
 |     0058 | .text initial code
 |          |

The base load address matched the current sdk because I was originally compatible with e-lib.

I tried a couple of ideas but settled on leaving room for 8 words (easy indexing) and found a good use for every slot.

 |          |
 +----------+------
 |     0048 | argv0
 |     004c | argv1
 |     0050 | argv2
 |     0054 | argv3
 |     0058 | entry
 |     005c | sp
 |     0060 | imask
 |     0064 | exit code
 +----------+------
 |     0068 | .text.init
 |          |

Being able to load r0-r3 and set the stack pointer from the host allows any argument list to be created - i.e. the 'main' routine takes native types as arguments rather than nothing at all. This can help simplify some startup issues because you don't need to synchronise with anything to get the data on startup and for simple cases might be all you need.

Being able to set the entry point allows multiple 'kernels' to be stored in the same binary and be invoked without having to reload the code. It also 'fixes' the problem of having to trampoline to a 32-bit address via the startup routine if you want an off-core main (which may have limited uses).

And the imask just allows some initial system state to be configured. There may need to be more.

And finally the exit code allows the kernel to return some value. Actually i'm not sure how useful it is but there was an empty slot and it was easy to add.

code

So to the reset interrupt handler. This is just a rough sketch of where my thoughts are at the moment and i haven't tried it on the machine so it may contain some silly mistakes or other thinkos.

        .section .text.ivt0
_ivt_sync:      
        b.l     _isr_sync       ;sync

Currently in ez-loader the section name .text.ivt0 sets the interrupt vector although because it can create the ivt entries on the fly it may be something that can be removed. Or have it work as an override in which case all the following code is never used. The compiler outputs various sections for interrupt handlers too and I intend to remain compatible with those even though interrupt handlers in c can be a bit nasty due to the size of the register file and lack of multi push/pull instructions.

        .section .text.init
_isr_sync:
        ;; load initial state
        mov     r7,#tcb_off
        ldrd    r0,[r7,#tcb_argv0_1]      ; argv0, argv1
        ldrd    r2,[r7,#tcb_argv2_3]      ; argv2, argv3
        ldrd    r12,[r7,#tcb_entry_sp]    ; entry, sp
        ldr     r4,[r7,#tcb_imask]        ; imask

Then we come to the meat and potatoes. First the state is loaded from the 'task control block'. Double-word loads are used where possible and most values are loaded directly into their final desired location.

        ;; set frame pointer
        mov     fp,sp

        ;; set imask
        movts   imask,r4

A couple of values then need to be set by hand. There may need to be other init work done here such as clearing ipend.

        ;; set entry point
        movts   iret,r12

Rather than drop down to user-mode to run setup routines as required by a full c runtime this just jumps straight to the kernel entry point via iret. This saves the need to resolve the pre-main 'start' trampoline routine too.

        ;; link to exit routine
        mov     lr,%low(_exit)

Similarly because main isn't being called via a user-mode stub using bl or jalr the link register needs to be set manually. This can go also directly to _exit without passing go. Because this is intended to be on-core it only needs to load the bottom 16 bits of the address.

        ;; launch main directly
        rti

And after that ... there's nothing more to do. rti will clear the interrupt state, enable interrupts and jump to the chosen kernel entry point. r0-r3 and the stack will contain any args required, and when it finishes it will return from the function ...

_exit:  mov     r7,#tcb_off
        str     r0,[r7,#tcb_exit_code]
1:      idle
        b.s     1b
        .size   _exit, .-_exit

... and end up directly at exit. This can save out the return code from the function and then 'shuts down' via an idle and repeat loop. And at this point new code can (relatively) safely be loaded, or just the entry and arguments changed to relaunch the code with a sync signal. It may need another field for to indicate the core state such as running or exited and that might be more useful than having an exit code.

The nice thing about the dynamic loader I have is that I can separate this startup mechanism from the code completely - with a bit more work it doesn't even need to be linked to it. Because code is relocated at runtime I can change the startup code or tcb structure without forcing a recompile and only the workgroup config stuff needs to remain fixed.

For example another option may be to have a job dispatch loop replace most of the above and avoid the need to have to add it externally. It could even potentially be loading in the code itself from asynchronously en-queued jobs like hsa does. Possibly even using the hsa dispatch packet or some subset of it. Hmm, thoughts.

No comments: