“Unity Mobile Game Performance Optimization Series” consists of four parts. Today, we will introduce the last part: the trade-off between screen performance and GPU pressure, a total of 8 sections, including bandwidth, Overdraw, rendering effects, post-processing, Rendering strategy, Shader complexity, and many other common game graphics performance explanations.
Now, more and more mobile game development teams are pursuing higher and higher graphics performance. The GPU has surpassed the CPU side and has become the main source of performance pressure for many projects. However, the GPU often involves more problems at the hardware level. Different chips from different manufacturers on the market will reflect different phenomena. From another point of view, the detection of GPU performance is often not provided by tools related to the Unity engine, and the various tools provided by hardware manufacturers are also difficult to use, which essentially brings huge difficulties to the troubleshooting and optimization of GPU performance problems.
2.1 GPU Stress and Heat/Battery Consumption
Although UWA has done some quantitative implementation of the relationship between bandwidth and heat generation/battery consumption internally, the underlying situation of the chip is far more complicated than we imagined. Our conclusion from our experience and communication with professional hardware engineers of chip manufacturers is that the pressure of GPU bandwidth on mobile devices still affects energy consumption, especially in terms of heat generation. But this is a qualitative statement. At present, neither we nor chip manufacturers have a specific quantitative formula to specify the size of its impact. Therefore, when a project generates heat or consumes a lot of battery, bandwidth is a place that developers need to pay special attention to.
UWA tools can monitor changes in hardware temperature during game testing. Generally speaking, when the temperature maintains above 55°C for a long time, it needs our attention. In UWA’s further GPU-specific services, the temperature of the GPU and CPU will be collected and displayed in more detail.
2.2 GPU Pressure and Frame Rate
As mentioned above, the GPU pressure will increase the time spent on the CPU waiting for the GPU to complete the work. At the same time, it will also increase the time consumption of the main functions on the CPU side of the rendering module, thereby affecting the frame rate and causing freezes.
In addition, due to the objective problems of small size and difficult heat dissipation of mobile hardware, the GPU heating will physically significantly affect the temperature of the CPU chip to rise at the same time, and in serious cases, frequency reduction will occur.
In addition to monitoring the time-consuming data of Gfx.WaitForPresent and the main function of the rendering module through the UWA tool, the GOT Online Overview mode also reflects the time-consuming data of the GPU in the test, so as to visually monitor the GPU performance bottleneck; and currently GOT Online also integrates Mali The API for GPU statistics bandwidth, using the Mali chip test machine to submit an Overview report, you can get GPU shading and bandwidth data. In the future, support for related functions of Qualcomm and other hardware will be gradually added.
In UWA’s further GPU-specific services, bandwidth data and Clocks data will also be combined to analyze the reasons for the high bandwidth usage and GPU time consumption on a DrawCall by DrawCall basis in high-voltage scenarios. Some common reasons will be discussed at the beginning of 4.3.
2.3 Optimizing Bandwidth
Bandwidth data is an important reference for measuring GPU pressure. For the relatively high-end Mi 10 model, in the case of full resolution, if you need to run at 30 frames and the heat is stable, you need to control the total bandwidth to less than 3000MB/s. To this end, common optimization methods are:
(1) Compression Format
As discussed more or less in the memory chapter, using a reasonable compression format can effectively reduce texture bandwidth.
(2) Texture Mipmap
For 3D scenes, enabling Mipmap settings for textures used by objects in them can effectively reduce bandwidth at the cost of a small memory increase. When the object is further away from the camera, the engine will use the lower-level mipmap for texture sampling. However, the Mipmap setting is also linked to a reasonable texture resolution. A common phenomenon is that in the actual rendering process, only the 0th layer of the Mipmap is used or most of them are used for sampling, which wastes memory, so consider reducing this type of resolution of the texture.
The UWA tool uses different colors to represent pixels sampled at different mipmap levels in order to locate problems; in further services, the textures that have the above wasteful phenomenon in each scene will be listed according to the usage rate of each mipmap level.
(3) Reasonable Texture Sampling Method
In addition to the reasonable use of Mipmap non-zero layer sampling, attention should also be paid to anisotropic sampling and trilinear interpolation sampling in the project. In general, when the texture is compressed and sampled, it will read the contents of the cache. If it is not read, it will read the System Memory farther away from the GPU, so the more cycles it takes. When the number of sampling points increases, the probability of cache miss will increase, resulting in increased bandwidth. The number of anisotropic sampling is set from 1 to 16 in Unity, and should be set to 1 as much as possible; trilinear sampling uses 8 vertices, which is doubled compared to bilinear sampling.
(4) Modify the Rendering Resolution
Directly modify the rendering resolution to 0.9 times or even lower, reduce the pixels involved in texture sampling, and reduce bandwidth more effectively.
Also, it is worth paying attention to the bandwidth of reading vertices. Compared with texture, its proportion will generally be smaller. But unlike textures, modifying the rendering resolution can effectively reduce the bandwidth of reading textures, but the bandwidth of reading vertices will not be affected. Therefore, when the above methods in controlling the mesh resources and the number of rendered patches on the same screen are effective, it is reasonable that the bandwidth value of reading vertices should account for 10%-20% of the total bandwidth.
Overdraw is the GPU overhead caused by drawing the same pixel multiple times. Under ideal conditions where the rendering order in the scene is reasonably controlled, the Overdraw of opaque objects should be controlled at 1 layer. Therefore, the main cause of Overdraw is the translucent objects, that is, particle systems and UI.
3.1 Particle System
Flexible using UWA’s performance analysis tools can effectively locate particle systems that contribute significantly to GPU pressure.
One way is to create a special empty test scene in which to play the particle systems used in the project in sequence and then use the UWA SDK to package the test and submit the GOT Online Overview report. Then, combined with the test screenshots, you can find the particle system with high GPU time-consuming when playing at the GPU time-consuming curve.
Another way is to directly use the UWA local resource detection report, in which you can directly see the list of particles that cause a high Overdraw as a reference. After filtering out the particle systems that need to be optimized, we can reduce their complexity and screen coverage as much as possible for low-end devices, thereby reducing their rendering overhead and improving the running fluency of low-end devices. The specific methods are as follows:
(1) The maximum number of particles in Max Particles of the particle system is limited on low-to-mid-range models;
After limiting the Max Particles to 10:
(2) Only “important” particle systems are retained on low-end and mid-range models. For example, for a special effect of flame burning, only the flame itself is retained, and the surrounding smoke and dust effects are turned off;
(3) Reduce the coverage area of particle effects on the screen as much as possible. The larger the coverage area, the easier it is to generate overlapping coverage, resulting in higher Overdraw.
(1) When a full-screen UI is opened, other UIs blocked by the background can be closed.
(2) For UI with an Alpha of 0, you can check CullTransparent Mesh on its Canvas Renderer component, which can ensure the response of UI events without rendering it.
(3) Minimize the use of the Mask component as much as possible, which not only increases the drawing overhead but also causes the DrawCall to rise. In the case of high Overdraw, consider using RectMask2D instead.
(4) Under the URP, we need to pay extra attention to whether there is an unnecessary Copy Color or Copy Depth. Especially when the same RendererPipelineAsset is used for the camera in the UI and the battle scene, unnecessary rendering time and bandwidth waste are prone to occur, which will cause unnecessary GPU overhead. It is generally recommended to use different RendererData for the UI camera and scene camera.
In addition to particle effects, we often like to use some cool rendering effects to enrich the performance of the game, such as volumetric fog, volumetric light, water, subsurface reflection, etc. However, the more such effects are used in the scene, the Shader will be more complex and the more stress it will put on the GPU far beyond what is acceptable. Optimization and balance are the primary means of deciding which renders to leave in the end.
On the one hand, compare and select the ones with better effect and performance from multiple schemes, and streamline and optimize the open source scheme according to the needs of its own project; Some well-optimized, practice-tested solutions can be found on UWA community blogs, academy, and open source repositories.
Bloom is almost the most popular and common post-processing effect for developers. A common problem is that Bloom defaults to down-sampling from 1/2 the rendering resolution. In this regard, you can consider down-sampling from 1/4 resolution on low-end models, or reduce the number of down-sampling.
The performance overhead and actual usage scenarios of various post-processing effects are different, and the problems encountered in actual projects are often different.
6.1 Drawing Order
When there is an opaque object that is farther away from the camera and then an object that is closer to the camera in the scene, and when the two objects overlap, the pixels of the part of the far object that is occluded by the nearer object may be drawn twice, resulting in Overdraw.
This often happens on terrain. Originally, when the Render Queue of opaque objects is the same, the engine will automatically judge and give priority to drawing objects closer to the camera. But for terrain, some parts are often closer to the camera than other objects, and some are farther away, so they are preferentially drawn.
Therefore, it is necessary to set the Render Queue, etc., so that the objects (such as tasks, objects, etc.) that are closer to the camera are drawn first, and the farther objects, such as terrain, are drawn last. On the mobile platform, through the Early-Z mechanism, the hardware will perform the depth test before the fragment shader, and the depth detection of the pixels that are occluded by the farther objects will not pass, thus saving unnecessary fragment calculations.
6.2 Invalid Drawing
There are some cases where the visual effects are not obvious and can be turned off, or a less expensive drawing scheme can be used.
For example, a more common situation is that some DrawCalls of backgrounds have a large screen ratio and a lot of overhead, but there is no obvious visual change in switching this DrawCall in the engine, which may be abandoned or used in the production process. The effect that other DrawCalls completely cover up can be considered to be turned off.
There is also a case where some backgrounds are drawn with models with additional rendering effects such as blur, fog, etc. However, in the scene, the viewing angle is fixed and the backgrounds hardly change. You can consider using static images to replace these complex renderings as backgrounds, leaving more performance for the main game logic and performance effects on low-end machines.
6.3 Rendering Area
The performance issues caused by the large rendering area have been reflected and discussed in particle effects. But, in fact, it also works for opaque objects. For a DrawCall, when its rendering area is large and there are many complex rendering resources, they will present a multiplication effect, which means that more pixels participate in texture sampling, in Shader calculation, and give GPU brings higher pressure.
In addition to textures, meshes, and Render Textures, there is also a rendering resource that contributes greatly to GPU pressure, that is, Shader. UWA pays special attention to the screen ratio, number of instructions, and number of clock cycles of the Fragment Shader. The more pixels rendered and the higher the complexity, the more the Shader resource needs to be optimized.
Among them, the number of instructions and clock cycles of the Shader can be obtained by using the Mali Offline Compiler tool. In the further service of UWA, the complexity of variants of all Shaders with high usage rates in the project under different keyword combinations will be tested in detail, so as to locate the Shader resources and their variants that need to be optimized.
Read more about UWA Unity Performance Optimization Series:
If you need any Unity engine-related game performance optimization services, please feel free to leave a comment or contact us at firstname.lastname@example.org.
YOU MAY ALSO LIKE UWA HERE!!!
UWA Website: https://en.uwa4d.com
UWA Blogs: https://blog.en.uwa4d.com
UWA Product: https://en.uwa4d.com/feature/got
You may also like
January 4, 2023
December 21, 2022
December 14, 2022