Wednesday 15 January 2014

resampling, again

Been having a little re-visit of resampling ... again. By tweaking the parameters of the data extraction code for the eye and face detectors i've come up with a ratio which lets me utilise multiple classifiers at different scales to increase performance and accuracy. I can run a face detection at one scale then check for eyes at 2x the scale, or simply check / improve accuracy with a 2x face classifier. This is a little trickier than it sounds because you can't just take 1/4 of the face and treat it as an eye - the choice of image normalisation for training has a big impact on performance (i.e. how big it is and where it sits relative to the bounding box); I have some numbers which look like they should work but I haven't tried them yet.

With all these powers of two I think i've come up with a simple way to create all the scales necessary for multi-scale detection; scale the input image a small number of times between [0.5, 1.0] so that the scale adjustment is linear. Then create all other scales using simple 2x2 averaging.

This produces good results quickly, and gives me all the octave-pairs I want to any scale; so I don't need to create any special scales for different classifiers.

The tricky bit is coming up with the initial scalers. Cubic resampling would probably work ok because of the limited range of the scale but I wanted to try to do a bit better. I came up with 3 intermediate scales above 0.5 and below 1.0 and spaced evenly on a logarithmic scale and then approximated them with single-digit ratios which can be implemented directly using upsample/filter/downsample filters. Even with very simple ratios they are quite close to the targets - within 0.7%. I then used octave to create a 5-tap filter for each phase of the upscaling and worked out (again) how to write a polyphase filter using it all.

(too lazy for images today)

This gives 4 scales including the original, and from there all the smaller scales are created by scaling each corresponding image by 1/2 in each dimension.

scale           approx ratio   approx value
  1.0             -              -   _
  0.840896       5/6           0.83333
  0.707107       5/7           0.71429
  0.594604       3/5           0.60000

Actually because of the way the algorithm works having single digit ratios isn't crticial - it just reduces the size of the filter banks needed. But even as a lower limit to size and upper limit to error, these ratios should be good enough for a practical implementation.

A full upfirdn implementation uses division but that can be changed to a single branch/conditional code because of the limited range of scales i.e. simple to put on epiphany. In a more general case it could just use fixed-point arithmetic and one multiplication (for the modulo), which would have enough accuracy for video image scaling.

This is a simple upfirdn filter implementation for this problem. Basically just for my own reference again.

  // scale ratio is u / d  (up / down)
  u = 5;
  d = 6;

  // filter paramters: taps per phase
  kn = 5;
  // filter coefficients: arranged per-phase, pre-reversed, pre-normalised
  float kern[u * kn];

  // source x location
  sx = 0;
  // filter phase
  p = 0;

  // resample one row
  for (dx=0; dx<dwidth; dx++) {

    // convolve with filter for this phase
    v = 0;
    for (i=0; i<kn; i++)
       v += src[clamp(i + sx - kn/2, 0, dwidth-1)] * kern[i + p * kn];
    dst[dx] = v;

    // increment src location by scale ratio using incremental star-slash/mod
    p += d;

    sx += (p >= u * 2) ? 2 : 1;
    p -= (p >= u * 2) ? u * 2 : u;

    // or general case using integer division:
    // sx += p / u;
    // p %= u;
  }
See also: https://code.google.com/p/upfirdn/.

Filters can be created using octave via fir1(), e.g. I used fir1(u * kn - 1, 1/d). This creates a FIR filter which can be broken into 'u' phases, each 'kn' taps long.

This stuff would fit with the scaling thing I was working on for epiphany and allow for high quality one-pass scaling, although i haven't tried putting it in yet. I've been a bit distracted with other stuff lately. It would also work with the NEON code I described in an earlier post for horizontal SIMD resampling and of course for vertical it just happens.

AFAIK this is pretty much the type of algorithm that is in all hardware video scalers e.g. xbone/ps4/mobile phones/tablets etc. They might have some limitations on the ratios or the number of taps but the filter coefficients will be fully programmable. So basically all that talk of the xbone having some magic 'advanced' scaler was simply utter bullsnot. It also makes m$'s choice of some scaling parameters that cause severe over-sharpening all the more baffling. The above filter can also be broken in the same way: but it's something you always try to minimise, not enhance.

The algorithm above can create scalers of arbitrarly good quality - the scaler can never add any more signal than is originally present so a large zoom will become blurry. But a good quality scaler shouldn't add signal that wasn't there to start with or lose any signal that was. The xbone seems to be doing both but that's simply a poor choice of numbers and not due to the hardware.

Having said that, there are other more advanced techniques for resampling to higher resolutions that can achieve super-resolution such as those based on statistical inference, but they are not practical to fit on the small bit of silicon available on a current gpu even if they ran fast enough.

Looks like it wont reach 45 today after some clouds rolled in, but it's still a bit hot to be doing much of anything ... like hacking code or writing blogs.

1 comment:

Frist said...

Another method is to use a fixed filter to do a high quality x2 interpolation, followed by a weighted linear interpolation between adjacent interpolated points. This separates the sampling from the interpolation function, allowing you to have an arbitrary sampling ratio, while keeping the FIR fixed, small and symetric.

Purists use a cubic interpolation as the second stage instead of linear interpolation, and/or an initial x4 interpolation (requiring 3 fixed FIRs instead of just 1).

* h264avc uses (2,-10,40,40,-10,2), resulting in 3 multiplies per input pixel (the symetric FIR) + some shifts and additions to get quarter-pel and/or 1/8-pel values. Arbitrary intermediate values can be computed with 2 additional multiplications if needed.
* HEVC DCTIF8 uses (-1,4,-11,40,40,-11,4)
* HEVC DCTIF6 uses (2,-9,39,39,-9,2)

relevant papers:
* A COMPARISON OF FRACTIONAL-PEL INTERPOLATION FILTERS IN HEVC AND H.264/AVC
* An Efficient NEON-based Quarter-pel Interpolation Method for HEVC
* HIGH-THROUGHPUT INTERPOLATION HARDWARE ARCHITECTURE WITH COARSE-GRAINED RECONFIGURABLE DATAPATHS FOR HEVC
etc