Tuesday 12 August 2014

asm vs c II

I dunno, i'm almost lost for words on this one.

typedef float float4 __attribute__((vector_size(16))) __attribute__((aligned(16)));

void mult4(float *mat, float4 * src, float4 * dst) {
 dst[0] = src[0] + mat[0];
}
notzed@minized:src$ make simd.o
arm-linux-gnueabihf-gcc -c -o simd.o simd.c -O3 -mcpu=cortex-a9 -marm -mfpu=neon
notzed@minized:$ arm-linux-gnueabihf-objdump -dr simd.o

simd.o:     file format elf32-littlearm

Disassembly of section .text:

00000000 :
   0:   f4610aef        vld1.64  {d16-d17}, [r1 :128]
   4:   ee103b90        vmov.32  r3, d16[0]
   8:   edd07a00        vldr     s15, [r0]
   c:   e24dd010        sub      sp, sp, #16
  10:   ee063a10        vmov     s12, r3
  14:   ee303b90        vmov.32  r3, d16[1]
  18:   ee063a90        vmov     s13, r3
  1c:   ee113b90        vmov.32  r3, d17[0]
  20:   ee366a27        vadd.f32 s12, s12, s15
  24:   ee073a10        vmov     s14, r3
  28:   ee313b90        vmov.32  r3, d17[1]
  2c:   ee766aa7        vadd.f32 s13, s13, s15
  30:   ee053a90        vmov     s11, r3
  34:   ee377a27        vadd.f32 s14, s14, s15
  38:   ee757aa7        vadd.f32 s15, s11, s15
  3c:   ed8d6a00        vstr     s12, [sp]
  40:   edcd6a01        vstr     s13, [sp, #4]
  44:   ed8d7a02        vstr     s14, [sp, #8]
  48:   edcd7a03        vstr     s15, [sp, #12]
  4c:   f46d0adf        vld1.64  {d16-d17}, [sp :64]
  50:   f4420aef        vst1.64  {d16-d17}, [r2 :128]
  54:   e28dd010        add      sp, sp, #16
  58:   e12fff1e        bx       lr
notzed@minized:/export/notzed/src/raster/gl/src$ 
I thought that the store/load/store via the stack was a particularly cute bit of work, especially given the results were already in the right order and in adequately aligned registers. r3 also seems a little too popular.

I guess the vector extensions to gcc just aren't finished - or just don't work. Maybe I used the wrong flags or my build is broken. It produces similar junk code for the epiphany mind you. I've never really tried using them but after a bunch of OpenCL in the past I thought it might be worth a shot to access SIMD without machine code.

My NEON is very rusty but I think it could be something like this:

notzed@minized:src$ arm-linux-gnueabihf-objdump -dr neon-mat4.o

neon-mat4.o:     file format elf32-littlearm

Disassembly of section .text:

00000000 :
   0:   f4a02caf        vld1.32  {d2[]-d3[]}, [r0]
   4:   f4210a8f        vld1.32  {d0-d1}, [r1]
   8:   f2000d42        vadd.f32 q0, q0, q1
   c:   f4020a8f        vst1.32  {d0-d1}, [r2]
  10:   e12fff1e        bx       lr

As can be seen from the names I started with a "simple" matrix multiply but whittled it down to something I thought the compiler could manage after seeing what it did to it - this is just a meaningless snippet.

After a pretty long day at work I was just half-heartedly poking at filling out the frontend to the epiphany gpu but just got distracted by whining at the compiler again. I should've just started with NEON, after a little poking I remembered how nice it was.

No comments: