Basic Knowledge
The Box Blur algorithm also uses the convolution operation, and each position of the convolution kernel it uses has the same weight. For example, a 3*3 size convolution kernel can be expressed as follows:
And the Box Blur convolution kernel is also linearly separable, and the twodimensional Box Blur convolution operation can be split into two onedimensional Box Blur convolution operations in the same way as in the separable Gaussian Blur. Take a 3×3 Box Blur convolution kernel as an example:
Each position of the convolution kernel has the same weight, which means that the value of each pixel obtained by processing is the average value of itself and other pixels in the field. The advantage of this is that it can be implemented using a faster accumulation algorithm instead of the sliding window algorithm [1]. When processing a matrix, the average of the cumulative sum is calculated and written to the output for the elements of the size of the first convolution kernel. For the new elements after that, the cumulative sum is subtracted from the elements that are no longer covered by the convolution kernel, plus the newly added elements, and then the average value is calculated and written to the output. It can be seen that for the onedimensional Box Blur operation, the time complexity of processing one element is: O (1), which has nothing to do with the length of the convolution kernel and is a constant:
（Source of Image: https://software.intel.com/content/www/us/en/develop/blogs/aninvestigationoffastrealtimegpubasedimagebluralgorithms.html）
Taking a 9×9 Box Blur convolution kernel as an example, the horizontal operation process can be expressed as:
（Source of Image: https://software.intel.com/content/www/us/en/develop/blogs/aninvestigationoffastrealtimegpubasedimagebluralgorithms.html）
But modern GPU execution is highly parallelized and outoforder, and optimizations that use accumulators to reduce time complexity are efficient for CPU computations, but not for GPUs. In order to achieve this effect, consider using Computer Shader to complete such operations. In this way, although the time complexity is reduced to a constant, the fixed cost of this algorithm is increased.
According to the subsequent implementation, the results we get are as follows:
It can be clearly seen that the quality of the results obtained by performing Box Blur once is far worse than the quality of the results obtained after Gaussian Blur processing. The central limit theorem points out that under appropriate conditions, the mean of a large number of independent random variables will converge to a normal distribution according to the distribution after proper standardization [2]. This means that if we do multiple Box Blur processing on an image, the final result will be close to Gaussian Blur quality. But this also means that if three Box Blur operations are performed, each time the horizontal and vertical Box Blur operations are performed twice, 6 Draw Calls will be generated. As a result, the fixed cost of this algorithm is further increased.
Unity Implementation
According to the above algorithm, ComputerShader is used to implement the Box Blur algorithm using an accumulator on Unity:
Create a new ComputerShader to postprocess a screen with a screen size of 1024×768.
First, horizontal processing is performed, and the method of selecting the pixel value of the sampling boundary for the sampling beyond the boundary of the picture is supplemented. In order to ensure the data sharing of each row, choose to let each row be a thread, a total of 768 threads:
[numthreads(1,768,1)]
void BoxBlurX(uint3 id : SV_DispatchThreadID)
{
float4 colourSum = InputRT[int2(0, id.y)] * BlurDiameter / 2;
for (int index = 0; index <= BlurDiameter / 2; index++)
colourSum += InputRT[int2(index, id.y)];
for (int index = 0; index < InputRTWidth; index++)
{
Result[int2(index,id.y)] = colourSum / BlurDiameter;
// move window to the next
float4 leftBorder = InputRT[int2(max(index – BlurDiameter / 2, 0), id.y)];
float4 rightBorder = InputRT[int2(min(index + BlurDiameter / 2 + 1, InputRTWidth – 1), id.y)];
colourSum = leftBorder;
colourSum += rightBorder;
}
}
Vertical processing is performed in the same way, with a total of 1024 threads:
[numthreads(1024, 1, 1)]
void BoxBlurY(uint3 id : SV_DispatchThreadID)
{
float4 colourSum = InputRT[int2(id.x, 0)] * BlurDiameter / 2;
for (int index = 0; index <= BlurDiameter / 2; index++)
colourSum += InputRT[int2(id.x, index)];
for (int index = 0; index < InputRTHeight; index++)
{
Result[int2(id.x,index)] = colourSum / BlurDiameter;
// move window to the next
float4 leftBorder = InputRT[int2(id.x, max(index – BlurDiameter / 2, 0))];
float4 rightBorder = InputRT[int2(id.x, min(index + BlurDiameter / 2 + 1, InputRTHeight – 1))];
colourSum = leftBorder;
colourSum += rightBorder;
}
}
Call ComputerShader on the Csharp side, and process BoxBlur one line at a time:
private void OnRenderImage(RenderTexture input, RenderTexture output)
{
BoxBlurShader.SetInt(“BlurDiameter”, BlurDiameter);
BoxBlurShader.SetInt(“InputRTWidth”, input.width);
BoxBlurShader.SetInt(“InputRTHeight”, input.height);
BoxBlurShader.SetTexture(kernelHandleX, “InputRT”, input);
BoxBlurShader.SetTexture(kernelHandleX, “Result”, tmp0);
BoxBlurShader.Dispatch(kernelHandleX, 1, 1, 1);
BoxBlurShader.SetTexture(kernelHandleY, “InputRT”, tmp0);
BoxBlurShader.SetTexture(kernelHandleY, “Result”, tmp1);
BoxBlurShader.Dispatch(kernelHandleY, 1, 1, 1);
Graphics.Blit(tmp1, output);
}
We choose a convolution kernel of size 36×36 to get the result:
It can be seen that there is a clear sense of mosaic, and the Box Blur processing is performed again:
It can be seen that the effect has been very good.
Summary
The biggest advantage of the Box Blur algorithm is that it reduces the time complexity of processing one element to a constant, making it no longer related to the size of the convolution kernel. However, its fixed cost is high. And because of the use of Computer Shader, some devices will not support this algorithm. The advantage of this algorithm only shows when the kernel size used exceeds a certain threshold. So, it may only work in some specific cases.
Reference:
[1]https://en.wikipedia.org/wiki/Box_blur
[2]https://zh.wikipedia.org/wiki/%E4%B8%AD%E5%BF%83%E6%9E%81%E9%99%90%E5%AE%9A%E7%90%86
https://ams.xyz/dev/FinalDegreeProjectReportAndr%C3%A9sValencia.pdf
https://cloud.tencent.com/developer/article/1035559
https://en.wikipedia.org/wiki/Box_blur
http://blog.ivank.net/fastestgaussianblur.html
http://dev.theomader.com/gaussiankernelcalculator/
http://genderi.org/framebufferpostprocessingeffectsindoublestealwreckl.html
Screen Post Processing Effects Series
Screen Post Processing Effects Chapter 1 – Basic Algorithm of Gaussian Blur and Its Implementation
That’s all for today’s sharing. Of course, life is boundless but knowing is boundless. In the long development cycle, these problems you see maybe just the tip of the iceberg. We have already prepared more technical topics on the UWA Q&A website, waiting for you to explore and share them together. You are welcome to join us, who love progress. Maybe your method can solve the urgent needs of others, and the “stone” of other mountains can also attack your “jade”.
YOU MAY ALSO LIKE!!!
UWA Website: https://en.uwa4d.com
UWA Blogs: https://blog.en.uwa4d.com
UWA Product: https://en.uwa4d.com/feature/got
You may also like

Dynamic Resolution in Games from Principle to Application
January 4, 2023 
Introduction of the Raytracing Technology Part 2
December 21, 2022