Sunday 7 March 2010

FOPS in context

Another late night bit of hacking last night to add FPU support to the context switch code (which incidentally is now committed).

It works by disabling the FPU when a new task is scheduled which doesn't match the owner of the FPU. FPU instructions then cause an illegal instruction exception, and that is used to switch the context if it was an FPU instruction.

I think I spent more time working out how to check for floating point instructions in the illegal instruction exception handler than anything else. Partly just finding out what the instruction bits were, but also just how to implement it without a big mess.

Basically, afaict the following bit patterns are those used by all of the FPU related instructions.
  1111001x        -        -        - advanced simd data processing
xxxx1110 - xxxx101x xxx0xxxx vfp data processing
xxxx110x - xxxx101x xxxxxxxx ext reg l/s
11110100 xxx0xxxx - - asimd struct l/s
xxxx1110 - xxxx101x xxx1xxxx vfp<>arm transfer
xxxx1100 010xxxxx xxxx101x -

Well, the problem is that ARM can only compare to an immediate 8 bit value (albeit anywhere within the 32 bits). So it becomes a bit of a mess of mucking about with those, or loading constants from memory. In the end I opted for the constant load - I figure that a constant load from memory is not really any different from a constant load from the instruction stream, once in the L1 cache, so assuming you're going to save a lot of instructions it will probably be faster. Well it's smaller anyway ...

As a comparison I wondered what gcc would come up with, e.g. with:
void modecheck(unsigned int *a) {
unsigned int v = *a;

if ((v & 0xfe000000) == 0xf2000000
|| (v & 0x0f000e00) == 0x0e000a00
|| (v & 0x0e000e00) == 0x0c000a00
|| (v & 0xff100000) == 0xf4000000
|| (v & 0x0fe00e00) == 0x0c400a00)
printf("do stuff\n");
else
printf("do other stuff\n");
}

And it's pretty ugly:
        ldr     r0, [r0, #0]
and r3, r0, #-33554432
cmp r3, #-234881024
beq .L2
bic r3, r0, #-268435456
bic r3, r3, #16711680
bic r3, r3, #61696
mov r2, #234881024
bic r3, r3, #255
add r2, r2, #2560
cmp r3, r2
beq .L2
bic r3, r0, #-251658240
bic r3, r3, #16711680
bic r3, r3, #61696
mov r2, #201326592
bic r3, r3, #255
add r2, r2, #2560
cmp r3, r2
beq .L2
bic r3, r0, #14680064
mov r3, r3, lsr #20
mov r3, r3, asl #20
cmp r3, #-201326592
beq .L2
bic r3, r0, #-268435456
bic r3, r3, #2080768
bic r3, r3, #12736
mov r2, #205520896
bic r3, r3, #63
add r2, r2, #2560
cmp r3, r2
beq .L2
... else
b done
.L2: .. if
done:

So basically it's converting all the constants to 8 bit operations. AFAICT the `branch predictor' basically just has a cache of recently taken branches, and the `default prediction' is that branches are not taken. If that's the case, the above code has a 13 cycle penalty in the best case ... plus all the extra instructions to execute in the worst. The first 2 cases are the most likely with fpu instructions though.

Not that this is in any way time-critical code (at most it will execute once per context switch), it just seems a bit of a mess.

I came up with something somewhat simpler, although to be honest I have no idea which would run faster.

        .balign 64
vfp_checks: @ mask ,value
.word 0x0f000e00,0x0e000a00 @ xxxx1110 - xxxx101x -
.word 0x0e000e00,0x0c000a00 @ xxxx110x - xxxx101x -
.word 0xff100000,0xf4000000 @ 11110100 xxx0xxxx - -
.word 0x0fe00e00,0x0c400a00 @ xxxx1100 010xxxxx xxxx101x -

...

ldr r2,[r3,#-8] @ load offending instruction
ldr r2,[r2]

and r0,r2,#0xfe000000 @ check 1111001x xxxxxxxx xxxxxxxx xxxxxxxx
cmp r0,#0xf2000000
ldrned r0,r1,vfp_checks
andne r0,r0,r2
cmpne r0,r1
ldrned r0,r1,vfp_checks+8
andne r0,r0,r2
cmpne r0,r1
ldrned r0,r1,vfp_checks+16
andne r0,r0,r2
cmpne r0,r1
ldrned r0,r1,vfp_checks+24
andne r0,r0,r2
cmpne r0,r1
bne 1f

... if

1: ... else

Given the constants are on a cache boundary, presumably their loads should all be 1 cycle (for all 64 bits) after the cache line is loaded, so they shouldn't be any worse than in-line constants that take more than 2 or more instruction to create (let alone the 7 required for both constants in the second C case).

Interestingly although not surprisingly the -Os code is more similar to the hand-rolled assembly, although it includes explicit branches rather than conditional execution. And does single-word loads instead of double.

Apart from that the context switching itself is pretty simple - just store the fpu registers above the others. irq_new_task is where the fpu is turned on or off depending on the current fpu owner, so it doesn't do anything if only one task is actively using the fpu. The existing context switch code required no changes - it is all handled 'out of band'.

Given that I now have some tasks to manage I also added a general illegal instruction handler for really `illegal instructions'. All it does is suspend the current task and remove it from the run queue as well, and the other tasks keep running just fine. For my uKernel it will then have to signal a process server to clean up/or whatever.

Code is in context.S and irq-context-fp.c.

OCaml

Well yesterday I should have been out working on the retaining wall but in the morning I got stuck thinking about optimising implementations of mathematical expressions (too much coffee late the night before methinks).

The primary reason is that I think I found a pretty extreme case of wasteful calculation in an algorithm I'm working on. As far as I can tell from manual inspection, in one place it calculates a `square' 4D intermediate result - actually of the 8M array entries (64^4), most of them, about ~7M, remain unset at (0+0i). But when it uses this array for further processing it only uses 2047 of the results! Considering the intermediate array is 256MB in size(!!!), this is an obvious target for optimisation - it spends far more time clearing the memory than calculating the results it uses!

(I had the mistaken impression that matlab did smart things with such expressions, but I guess it doesn't - it's really just a scripting language front-end to a bunch of C functions.)

I have a feeling this sort of thing is happening throughout the algorithm although this is probably at the extreme end, but analysing it manually seems error prone and time consuming.

Surely there's a better way ... functional languages?

I'm not sure if i'll be able to take advantage of any of them, at least in the short-term. I was reading about how FFTW uses OCaml to implement it's optimiser - this is the sort of thing I might have to look at, although I never could wrap my head around functional programming languages. But anwyay - it's something i'll keep investigating although my focus to start with will just be code translation. Perhaps some sort of symbolic expression analyser might help me reduce the equations - once I work out what they are at least.

Another brick in the wall

I did get a couple of hours in on the retaining wall, laying the foundation layer of bricks which is the tricky bit - but then just as I was starting to make good progress it started to rain. I got nearly half-way on the main wall at least. I assessed that the stair-well i'd created was too steep and started too high (a slack string led me to believe the first-run of bricks would be lower), so i'll have to dig that out more too, but that wont take too long. I need to get some ag-pipe before I can back-fill it though. Hmm, should really get out there and do a bit more for a couple of hours - at least do the first run - hmm, but alas I think the day has gotten away from me again, and I have some other things to do.

It's a long-weekend anyway, so maybe tomorrow is a better day to get dirty, and the clouds are looking a bit dark at the moment.

No comments: