1/ 3324/ /4

UWA has released a series of special subjects explaining performance optimization knowledge points of Unity’s mainstream modules one by one four years ago. With the upgrading of the engine, hardware equipment, production standards, etc. in recent years, UWA is also continuing to update optimization rules and methods, and output them to all the game developers. As an “upgraded version” of the performance optimization manual, the Unity Performance Optimization Series will try to express it in a simple and easy-to-understand way so that more developers can learn from it. This issue will share the knowledge points related to the rendering module.

Rendering is a topic that cannot be escaped in the optimization of mobile devices. As the bulk of performance cost, almost all games are inseparable from the rendering of scenes, objects, and special effects. How to achieve the best balance between excellent visual effects of the scene and smooth running has always been a headache for game designers, artists, and also programmers.

Two Basic Parameters that Affect Rendering Efficiency: DrawCall and Triangle


  1. DrawCall

In the Overview mode of GOT Online, we can see the DrawCall curve in the rendering module. In this curve, we can see the specific DrawCall number and Batch number. As shown below:

At present, we recommend that the main body range (5%~95%) of Batch be controlled within [0,250] on the low-end and mid-end mobile device models.

In Unity, we need to distinguish between DrawCall and Batch. There will be multiple DrawCalls in a batch. In the FrameDebugger in the following figure, you can see that the two default ParticleSystems are combined into one batch, which will make two DrawCalls in such a Dynamic Batch.

There are usually four ways to reduce Batch, dynamic batching, static batching, GPU Instancing, and SRP Batcher.


  1. Triangle

Normally, the larger the number of triangles, the longer the rendering time will be. Therefore, the usage of triangles is provided in our UWA GOT report, and there is a distinction between translucent and opaque. It is generally recommended to reduce the number of patches in the scene through the LOD tool, thereby reducing the rendering overhead.

It should be noted that the number of patches here is not the number of patches of the scene model of the current frame, but the number of patches rendered in the current frame. The value is not only related to the number of patches, but also to the number of renderings. For example, the number of mesh model patches in the scene is 10,000, and the Shader used has 2 rendering passes, or 2 cameras are rendering it at the same time, then the Triangle value displayed here will be 20,000.

Camera.Render Function Stack Analysis

A very effective method to optimize the rendering module is to locate the specific performance bottleneck through the specific stack of the Camera.Render function. These functions can be viewed in [Code Efficiency] section of GOT Online report. Here are a few common functions when we optimize:


  • RenderForward.RenderLoopJob

In the unfolded stack of Camera.Render, you can see that the cost of RenderForward.RenderLoopJob is relatively high, which is usually caused by the high number of batches.


  • High Time-Consuming of Culling

Generally speaking, Culling’s time-consuming in the range of 10% to 20% is reasonable. If Culling takes a longer time, you can initially check the following aspects:

1) Culling time-consuming has a relatively large correlation with the number of same GameObjects in the scene. In this case, it is recommended that the R&D team optimize the production method of the scene and pay attention to whether there are too many small objects in the scene, which will increase the time-consuming of Culling. You can consider using methods like dynamic loading, block display, or Culling Group, Culling Distance, etc., to optimize the Culling’s time-consuming.

2) If the project uses multi-threaded rendering and Occlusion Culling is turned on, it will usually cause too much pressure on the child threads and result in a high overall Culling.

Although turning on Occlusion Culling reduces the rendering consumption, its own performance overhead is also worth noting since it needs to calculate the occlusion relationship based on the objects in the scene, and it is not necessarily suitable for all scenes. In this case, it is recommended that the R&D team selectively disable a part of Occlusion Culling to test the overall consumption of rendering data for comparison, and then decide whether to enable this feature.


  • Render.Mesh

Render.Mesh corresponds to the rendering time that cannot be batched, and its number of calls corresponds to the number of batches. In the figure below, we can see that the number of calls to Render.Mesh is 269, indicating that there are 269 opaque objects in the scene that are not batched, which is a large number.

Render.Mesh overhead is too high, usually due to the large number of objects that cannot be batched, which can be optimized from the following points:

1) For opaque rendering queues, it is recommended to check the redundancy of the Material. The same shader cannot be batched because of different instances.

2) For semi-transparent rendering queues, it is necessary to distinguish between non-NGUI and NGUI cases. For the case of using NGUI, the call of Render.Mesh is likely to be caused by UI’s DrawCall. The high number of Render.Mesh calls indicate the UI. DrawCall is likely to be high. It is necessary to check whether the atlas is properly packaged.

In the case of non-NGUI, it is necessary to consider whether there is an interspersing situation for translucent objects. You can adjust the RenderQueue to increase the objects of the same Material for batching.


  • ParticleSystem.ScheduleGeometryJobs and ParticleSystem.Draw

1) ParticleSystem.ScheduleGeometryJobs means that before culling, the main thread needs to wait for the child thread to calculate the particle position before culling. The cost of this function in the battle interface is relatively higher.

For the optimization of this function, it is recommended that the R&D team considers reducing the complexity of the particle system as much as possible on the low-end device, and try to pre-cut it through the view volume, and Deactive the particle system outside the view volume, thereby reducing unnecessary Schedule overhead in the particle system.

2) The number of calls of ParticleSystem.Draw corresponds to the number of DrawCalls of the particle system.

If the number of calls to this function is too high, it is recommended that the R&D team reduce the number of particle systems. You can refer to the list in the UWA Real Device Testing report [Memory Management-Specific Resource Information-Particle System] for further analysis and optimization.

In addition, you can reduce DrawCall by using TextureSheetAnimation, or by modifying the Order in Layer to reduce the interspersion of particle rendering to increase the probability of batching.


  • Shader.CreateGPUProgram

The CPU usage of this API is the time consumed when the Shader is rendered for the first time, and the time consumed is related to the complexity of the Shader rendering.

From the figure below, we can see that in a certain frame, the time consumed by Shader.CreateGPUProgram reached 203.87ms, which caused the game to freeze.

In this regard, we can preload the Shader through ShaderVariantCollection, and after loading, trigger Shader.CreateGPUProgram through ShaderVariantCollection.WarmUp, and cache this SVC to avoid triggering this API call when the game is running, thereby avoiding partial high CPU usage.

Turning On Multi-Threaded Rendering

After enabling multi-threaded rendering, the rendering time of the main thread will be significantly reduced. It is recommended that the R&D team enable it.

However, it should be noted that the CPU time of our online report only counts the time consumption of the main thread. If the project version has multi-threaded rendering enabled, only the time consumption of the main thread can be seen in the report, which is not conducive to analyzing the rendering. bottleneck. Therefore, we usually recommend that you submit two versions during internal testing; One enables multi-threaded rendering as a time-consuming reference for the rendering of the release version, and the other one closes multi-threaded rendering for detailed analysis of rendering bottlenecks.

GPU Instancing

Using GPU Instancing can render multiple copies of the same grid at once, but each instance can have different parameters (for example, Color or Scale) to increase variation. When rendering things that repeat in the scene, such as buildings, trees, and grass, GPU Instancing can effectively reduce the number of DrawCalls in each scene and significantly improve rendering performance.

However, the following points should be noted when using GPU Instancing:

  • Compatible platform and API
  • The mesh of the rendered instance is the same as the material
  • Shader supports GPU Instancing
  • Does not support SkinnedMeshRenderer

In some special cases, GPU Instancing rendering of a large number of semi-transparent objects may also be time-consuming.

SRP Batcher

More and more teams are beginning to use URP as a rendering pipeline, thereby greatly increasing the batch range and improving rendering efficiency through SRP Batcher. When using URP, the rendering function stack becomes as below:

When using SRP Batcher, you will need to pay attention to the following points:

  • Shader needs to be compatible with SRP
  • SRP Batcher does not support particle systems temporarily
  • Shader variants will interrupt the batching of DrawCall

The above are some of the issues that need to be paid attention to when optimizing the rendering module. How to operate it needs to be combined with the actual situation of the project. At the same time, using the UWA service can quickly help you locate the performance bottleneck.


UWA Website:

UWA Blogs:




Related Topics

1 Visitor Comment

Post a Reply

Your email address will not be published.