Tuesday 25 February 2014

more on JNI overheads

I wrote most of the OpenCL binding yesterday but now i'm mucking about with simplifying it.

I've experimented with a couple of binding mechanisms but they have various drawbacks. They all work in basically the same way in that there is an abstract base class of each type then a concrete platform-specific implementation that defines the pointer holder.

The difference is how the jni C code gets hold of that pointer:

Passed directly
The abstract base class defines all the methods, which are implemented in the concrete class, which just invokes the native methods. The native methods may be static or non-static.

This requires a lot of boilerplate in the java code, but the C code can just use a simple cast to access the CL resources.

C code performs a field lookup
The base class can define the methods directly as native. The concrete class primarily is just a holder for the pointer value.

This requires only minimal boiler-plate but the resources must be looked up via a field reference. The field reference is dependent on the type though.

C code performs a virtual method invocation.
The base class can define the methods directly as native. The concrete class primarily is just a holder for the pointer value.

This requires only minimal boiler-plate but the resources must be looked up via a method invocation. But here the field reference is independent on the type.

The last is kind of the nicest - in the C code it's the same amount of effort (coding wise) as the second but allows for some polymorphism. The first is the least attractive as it requires a lot of boilerplate - 3 simple functions rather than just one empty one.

But, a big chunk of the OpenCL API is dealing with mundane things like 'get*Info()' lookups and to simplify it's use I came up with a number of type-specific calls. However rather than write these for every possible type I pass a type-id to the JNI code so a single function works. This works fine except that I would like to have a separate CLBuffer and CLImage object - and in this case the second implementation falls down.

To gain more information on the trade-off involved I did some timing on a basic function:

  public CLDevice[] getDevices(long type) throws CLException;

This invokes clGetDeviceIDs twice (first to get the list size) and then returns an array of instantiated wrappers for the pointers. I invoked this 10M times for various binding mechanisms.

Method                   Time
 pass long                13.777s
 pass long static         14.212s
 field lookup             14.060s
 method lookup            16.252s

So interesting points here. First is that static method invocations appear to be slower than non-static even when the pointer isn't being used. This is somewhat surprising as 'static' methods seem to be quite popular as a mechanism for JNI binding.

Second is that a field lookup from C isn't that much cost compared to a field lookup in Java.

Lastly, as expected the method lookup is more expensive and if one considers that the task does somewhat more than the pointer resolution then it is quite significantly more expensive. So much so that it probably isn't the ideal solution.

So ... it looks like I may end up going with the same solution I've used before. That is, just use the simple field lookup from C. Although it's slightly slower than the first mechanism it is just a lot less work for me without a code generator and produces much smaller classes either way. I'll just have to work out a way to implement the polymorphic getInfo methods some other way: using IsInstanceOf() or just using CLMemory for all memory types. In general performance is not an issue here anyway.

I suppose to do it properly I would need to profile the same stuff on 32-bit platforms and/or android as well. But right now i don't particularly care and don't have any capable hardware anyway (apart from the parallella). I wasn't even bothering to implement the 32-bit backend so far anyway.

Examples

This is just more detail on how the bindings work. In each case objects are instantiated from the C code - so the java doesn't need to know anything about the platform (and is thus, automatically platform agnostic).

First is passing the pointer directly. Drawback is all the bulky boilerplate - it looks less severe here as there is only a single method.

public abstract class CLPlatform extends CLObject {

    abstract public CLDevice[] getDevices(long type) throws CLException;

    class CLPlatform64 extends CLPlatform {
        final long p;

        CLPlatform64(long p) {
            this.p = p;
        }

        public CLDevice[] getDevices(long type) throws CLException {
            return getDevices(p, type);
        }

        native CLDevice[] getDevices(long p, long type) throws CLException;
    }

    class CLPlatform32 extends CLPlatform {
        final int p;

        CLPlatform32(int p) {
            this.p = p;
        }

        public CLDevice[] getDevices(long type) throws CLException {
            return getDevices(p, type);
        }

        native CLDevice[] getDevices(int p, long type) throws CLException;
    }
}

Then having the C lookup the field. Drawback is each concrete class must be handled separately.

public abstract class CLPlatform extends CLObject {
    native public CLDevice[] getDevices(long type) throws CLException;

    class CLPlatform64 extends CLPlatform {
        final long p;

        CLPlatform64(long p) {
            this.p = p;
        }
    }

    class CLPlatform64 extends CLPlatform {
        final long p;

        CLPlatform64(long p) {
            this.p = p;
        }
    }

    class CLPlatform32 extends CLPlatform {
        final int p;

        CLPlatform32(long p) {
            this.p = p;
        }
    }
}

And lastly having a pointer retrieval method. This has lots of nice coding benefits ... but too much in the way of overheads.

public abstract class CLPlatform extends CLObject {
    native public CLDevice[] getDevices(long type) throws CLException;

    class CLPlatform64 extends CLPlatform implements CLNative64 {
        final long p;

        CLPlatform64(long p) {
            this.p = p;
        }

        long getPointer() {
            return p;
        }
    }

    class CLPlatform64 extends CLPlatform implements CLNative32 {
        final int p;

        CLPlatform64(int p) {
            this.p = p;
        }

        int getPointer() {
            return p;
        }
    }
}

Or ... I could of course just use a long for storage on 32-bit platforms and be done with it - the extra memory overhead is pretty much insignificant in the grand scheme of things. It might require some extra work on the C side when dealing with a couple of the interfaces but it is pretty minor.

With that mechanism the worst-case becomes:

public abstract class CLPlatform extends CLObject {
    final long p;

    CLPlatform(long p) {
        this.p = p;
    }

    public CLDevice[] getDevices(long type) throws CLException {
        return getDevices(p, type);
    }

    native CLDevice[] getDevices(long p, long type) throws CLException;
}

Actually I can move 'p' to the base class then which simplifies any polymorphism too.

I still like the second approach somewhat for a hand-coded binding since it keeps the type information and allows all the details to be hidden in the C code where it is easier to hide using macros and so on. And the java becomes very simple:

public abstract class CLPlatform extends CLObject {
    CLPlatform(long p) {
        super(p);
    }

    public native CLDevice[] getDevices(long type) throws CLException;
}

CLEventList

Another problematic part of the OpenCL api is cl_event. It's actually a bit of a pain to work with even in C but the idea doesn't really map well to java at all.

I think I came up with a workable solution that hides all the details without too much overheads. My initial solution is to have a growable list of items (the same as JOCL) that was managed on the Java side. It's a bit messy on the C side but really messy on the Java side:

public class CLEventList {
   static class CLEventList64 {
      int index;
      long[] events;
   }
}

...
   enqueueSomething(..., CLEventList wait, CLEventList event) {
       CLEventList64 wait64 = (CLEventList64)wait;
       CLEventList64 event64 = (CLEventList64)event;

       enqueueSomething(...,
           wait64 == null ? 0 : wait64.index, wait64 == null ? null : wait64.events,
           event64 == null ? 0 : event64.index, event64 == null ? null : event64.events);

       if (event64 != null) {
           event64.index+=1;
       }
   }

Yeah, maybe not - for the 20 odd enqueue functions in the API.

So I moved most of the logic to the C code - actually the logic isn't really any different on the C side it just has to do a couple of field lookups rather than take arguments, and I added a method to record the output event.

public class CLEventList {
   static class CLEventList64 {
      int index;
      long[] events;
      void addEvent(long e) {
        events[index++];
      }
   }
}

...
   enqueueSomething(..., CLEventList wait, CLEventList event) {
       enqueueSomething(...,
           wait,
           event);
   }

UserEvents are still a bit of a pain to fit in with this but I think I can work those out. The difficulty is with the reference counting.

No comments: