Faster convolution on ios

Question 1

You can do a 16x16 convolution using GPUImage, but you'll need to write your own filter to do so. The 3x3 convolution in the framework samples from pixels in a 3x3 area around each pixel in the input image and applies the matrix of weights you feed in. The GPUImage3x3ConvolutionFilter.m source file within the framework should be reasonably easy to read, but I can provide a little context if you wish to step beyond what I have there.

The first thing I do is use the following vertex shader:

 attribute vec4 position;
 attribute vec4 inputTextureCoordinate;

 uniform float texelWidth;
 uniform float texelHeight; 

 varying vec2 textureCoordinate;
 varying vec2 leftTextureCoordinate;
 varying vec2 rightTextureCoordinate;

 varying vec2 topTextureCoordinate;
 varying vec2 topLeftTextureCoordinate;
 varying vec2 topRightTextureCoordinate;

 varying vec2 bottomTextureCoordinate;
 varying vec2 bottomLeftTextureCoordinate;
 varying vec2 bottomRightTextureCoordinate;

 void main()
 {
     gl_Position = position;

     vec2 widthStep = vec2(texelWidth, 0.0);
     vec2 heightStep = vec2(0.0, texelHeight);
     vec2 widthHeightStep = vec2(texelWidth, texelHeight);
     vec2 widthNegativeHeightStep = vec2(texelWidth, -texelHeight);

     textureCoordinate = inputTextureCoordinate.xy;
     leftTextureCoordinate = inputTextureCoordinate.xy - widthStep;
     rightTextureCoordinate = inputTextureCoordinate.xy + widthStep;

     topTextureCoordinate = inputTextureCoordinate.xy - heightStep;
     topLeftTextureCoordinate = inputTextureCoordinate.xy - widthHeightStep;
     topRightTextureCoordinate = inputTextureCoordinate.xy + widthNegativeHeightStep;

     bottomTextureCoordinate = inputTextureCoordinate.xy + heightStep;
     bottomLeftTextureCoordinate = inputTextureCoordinate.xy - widthNegativeHeightStep;
     bottomRightTextureCoordinate = inputTextureCoordinate.xy + widthHeightStep;
 }

to calculate the positions from which to sample the pixel colors used in the convolution. Because normalized coordinates are used, the X and Y spacings between pixels are 1.0/[image width] and 1.0/[image height], respectively.

The texture coordinates for the pixels to be sampled are calculated in the vertex shader for two reasons: it's more efficient to do this calculation once per vertex (of which there are six in the two triangles that make up the rectangle of the image) than per each fragment (pixel), and to avoid dependent texture reads where possible. Dependent texture reads are where the texture coordinate to be read from is calculated in the fragment shader, not simply passed in from the vertex shader, and they are much slower on the iOS GPUs.

Once I have the texture locations calculated in the vertex shader, I pass them into the fragment shader as varyings and use the following code there:

 uniform sampler2D inputImageTexture;

 uniform mat3 convolutionMatrix;

 varying vec2 textureCoordinate;
 varying vec2 leftTextureCoordinate;
 varying vec2 rightTextureCoordinate;

 varying vec2 topTextureCoordinate;
 varying vec2 topLeftTextureCoordinate;
 varying vec2 topRightTextureCoordinate;

 varying vec2 bottomTextureCoordinate;
 varying vec2 bottomLeftTextureCoordinate;
 varying vec2 bottomRightTextureCoordinate;

 void main()
 {
     vec3 bottomColor = texture2D(inputImageTexture, bottomTextureCoordinate).rgb;
     vec3 bottomLeftColor = texture2D(inputImageTexture, bottomLeftTextureCoordinate).rgb;
     vec3 bottomRightColor = texture2D(inputImageTexture, bottomRightTextureCoordinate).rgb;
     vec4 centerColor = texture2D(inputImageTexture, textureCoordinate);
     vec3 leftColor = texture2D(inputImageTexture, leftTextureCoordinate).rgb;
     vec3 rightColor = texture2D(inputImageTexture, rightTextureCoordinate).rgb;
     vec3 topColor = texture2D(inputImageTexture, topTextureCoordinate).rgb;
     vec3 topRightColor = texture2D(inputImageTexture, topRightTextureCoordinate).rgb;
     vec3 topLeftColor = texture2D(inputImageTexture, topLeftTextureCoordinate).rgb;

     vec3 resultColor = topLeftColor * convolutionMatrix[0][0] + topColor * convolutionMatrix[0][1] + topRightColor * convolutionMatrix[0][2];
     resultColor += leftColor * convolutionMatrix[1][0] + centerColor.rgb * convolutionMatrix[1][1] + rightColor * convolutionMatrix[1][2];
     resultColor += bottomLeftColor * convolutionMatrix[2][0] + bottomColor * convolutionMatrix[2][1] + bottomRightColor * convolutionMatrix[2][2];

     gl_FragColor = vec4(resultColor, centerColor.a);

This reads each of the 9 colors and applies the weights from the 3x3 matrix that was supplied for convolution.

That said, a 16x16 convolution is a fairly expensive operation. You're looking at 256 texture reads per pixel. On older devices (iPhone 4 or so), you got around 8 texture reads per pixel for free if they were non-dependent reads. Once you went over that, performance started to drop dramatically. Later GPUs sped this up significantly, though. The iPhone 5S, for example, does well over 40 dependent texture reads per pixel pretty much for free. Even the heaviest shaders on 1080p video barely slow it down.

As sansuiso suggests, if you have a way of separating your kernel into horizontal and vertical passes (like can be done for a Gaussian blur kernel), you can get much better performance due to a dramatic reduction in texture reads. For your 16x16 kernel, you could drop from 256 reads to 32, and even those 32 would be much faster because they would be from passes that only sample 16 texels at a time.

The crossover point for which doing an operation like this is faster in Accelerate on the CPU than in OpenGL ES will vary with the device you're running on. In general, GPUs on the iOS devices have outpaced CPUs in performance growth on each recent generation, so that bar has shifted farther to the GPU side over the last several iOS models.

Question 2

You can use Apple's Accelerate framework for this. It's available on iOS and MacOS bythe way, so may be reuse your code later.

In order to achieve best performance, you may need to consider the following options:

if your convolution kernel is separable, use a separable implementation. This is the case of symmetric kernels (such as Gaussian convolution). This will save yo an order of magnitude in computation time;
if your images have power-of-two sizes, consider using the FFT-trick. Convolution in the spatial domain (complexity N^2) is equivalent to a multiplication in the Fourier domain (complexity N). Thus, you can 1) FFT your image and kernel, 2) multiply term-by-term the result and 3) invert FFT of the result. Since FFT algorithms are fast (e.g., Aple's FFT in the Accelerate framework), this series of operations can result in performance boost.

You can find more insight on iOS image processing optimization in this book that I did also review here.