“Unity Mobile Game Performance Optimization Series” starts from some basic discussions on Unity mobile game optimization, exemplifies and analyzes some of the most common performance problems in mobile game projects developed based on Unity in recent years and shows how to use it. UWA’s performance detection tools identify and address these issues. The full series will cover the basic logic of performance optimization, UWA performance detection tools, and common performance problems, hoping to provide Unity developers with more efficient R&D methods and practical experience.
Today, we will introduce the third chapter: CPU time-consuming optimization divided by engine modules with a total of 9 sections, including rendering, UI, physics, animation, particle system, loading, CPU Stack, Lua, etc.
The first part of the article “Introduction to Unity Mobile Game Performance Optimization Sheet Music” and the second part “Unity Mobile Game Performance Optimization Sheet Music for Common Game Memory Control” can be reviewed here, and the full content can be viewed at UWA Academy.
1.1 Module Division
UWA divides the functions with clear work content and generally high time-consuming ratio in the CPU into rendering, UI, physics, animation, particles, loading, logic, and other modules. But it doesn’t mean that these modules work independently of each other. For example, the performance pressure of rendering modules is bound to be affected by complex UI and particles, and many operations of loading modules are actually called and completed on the logic end.
Dividing modules helps us identify problems and find key issues. At the same time, it is also necessary to build connections between modules, which helps to solve problems more efficiently.
1.2 CPU Time Bottleneck
When a project has a low frame rate and obvious freeze due to CPU performance bottleneck, how to find the source of the main problem causing the performance bottleneck becomes the key. Although we have sorted out the main modules in the engine, the problems that will occur between each module are still indistinguishable, and their contribution to the CPU performance pressure is not the same. Then we need to have an accurate understanding of what kind of cost can be considered a potential performance bottleneck.
In the mobile project, our goal of CPU performance optimization is to run a smooth game process of 30 fps most of the time on low-end and mid-range models. In order to achieve this goal, simply do a division to get our CPU time average should be controlled below 33ms. Of course, this does not mean that projects whose CPU average is already below 33ms have controlled their CPU time well. The performance pressure points are different during the running of the game. There may be very little pressure in a series of UI interfaces, but in turn, the frame rate in the most important battle scene in the game is very low, or there are a lot of crashes or pauses with hundreds of milliseconds or even a few seconds, and the final average is still less than 33ms.
For this reason, UWA believes that in a test, when the number of frames consuming 33ms and above accounts for less than 10% of the total number of frames, it can be considered that the overall CPU performance of the project is controlled within the normal range. The higher the proportion, the more serious the CPU performance bottleneck of the current project.
The above discussion mainly revolves around our macro optimization goal of CPU performance. Like memory, we still need to combine the specific data of specific modules to troubleshoot and solve the actual problems in the project.
2. Rendering Module
For more comprehensive content related to the optimization of the rendering module, please refer to “Unity Performance Optimization Series – Rendering Module“.
2.1 Multi-threaded Rendering
In general, in the single-threaded rendering process, during each frame of the game, the main thread (CPU1) executes Update first, and performs a large number of logical updates here, such as game AI, collision detection, and animation updates; then executes Render, and makes rendering-related instruction calls here. While rendering, the main thread needs to call the graphics API to update the rendering state, such as setting Shader, texture, matrix and alpha fusion, etc., and then execute DrawCall. All these graphics API calls interact with the driver layer, and the driver layer maintains all the rendering states, calls to these APIs may trigger the rendering state of the driver layer to change, resulting in a freeze. Since the state of the driver layer is transparent to the upper-layer call, whether the freeze will occur and the duration of the freeze is unknown to the caller of the API (CPU1). At this time, other CPUs may be in an idle waiting state, resulting in waste. Therefore, the rendering part can be extracted and placed in other CPUs to form a separate rendering thread, which is performed at the same time as the logical thread to reduce the main thread jamming.
The general implementation process is that the graphics API called in the main thread is encapsulated into a command and submitted to the rendering queue, which saves the overhead of calling the graphics API in the main thread, thereby improving the frame rate; the rendering thread obtains from the rendering queue. Rendering instructions and executing calls to the graphics API to interact with the driver layer. This part of the interaction takes time from the main thread to the rendering thread.
Unity supports Multithreaded Rendering in Project Settings and enables Multithreaded Rendering by default. It is generally recommended to keep it enabled. In the large amount of test data of UWA, it is still found that some projects have turned off multi-threaded rendering. When multi-threaded rendering is turned on, the time spent by the CPU waiting for the GPU to complete the work will be counted in the Gfx.WaitForPresent function, and when multi-threaded rendering is turned off, this part of the time will be counted in Graphics.PresentAndSync. Therefore, whether the Gfx.WaitForPresent function time-consuming is counted in the project as a basis for judging whether multi-threaded rendering is enabled. In particular, during the development and testing phases of the project, you can consider temporarily turning off multi-threaded rendering and packaging the test, so as to more intuitively reflect the performance bottleneck of the rendering module.
For projects in which multi-threaded rendering is normally enabled, the time-consuming trend of Gfx.WaitForPresent also has considerable reference significance. The greater the local GPU pressure in the test, the longer the CPU waits for the GPU to complete the work, and the longer the time required for Gfx.WaitForPresent. Therefore, when Gfx.WaitForPresent lasts for tens or even hundreds of milliseconds, which means that the GPU pressure of the corresponding scene is relatively high.
In addition, according to a large number of UWA projects and testing experience, excessive GPU pressure will also increase the overall time-consuming of the main functions on the CPU side of the rendering module (Camera.Render and RenderPipelineManager.DoRenderLoop_Internal). We’ll discuss optimizations specifically for the GPU part at the end.
2.2 The number of Triangles rendered on the same screen
The two most basic parameters that affect rendering efficiency are undoubtedly Triangle and DrawCall.
Usually, the number of Triangles is proportional to the GPU rendering time, and for most projects, the number of opaque Triangles is often much larger than that of translucent Triangles, which needs special attention. UWA generally recommends that the number of rendered Triangles on the same screen be controlled within 250,000 on low-end models, and it is not recommended to exceed 600,000 even for high-end models. When you use the tool to find that the number of locally rendered patches on the same screen is too high, you can use the Frame Debugger to check the rendering objects of keyframes.
A common optimization scheme is that the number of meshes of mesh resources needs to be strictly controlled in production, especially for some character and terrain models with meshes having tens of thousands of faces or more; Another good method is to reduce the number of meshes in the scene through LOD tools – such as using low poly on low-end machines, reducing the display of relatively unimportant small objects in the scene – and thus reducing the rendering overhead.
It should be pointed out that the number of Triangles concerned and counted by the UWA tool is not the number of Triangles of the scene model in the current frame, but the Triangles rendered in the current frame. Its value is not only related to the number of Triangles used for the model, but also to the rendering. The number of times is related, which more intuitively reflects the rendering pressure caused by the number of tiles rendered on the same screen. For example, if the number of Triangles of the mesh model in the scene is 10,000, and the Shader used has 2 rendering passes, or 2 cameras render it at the same time, or one of the post-processing effects such as SSAO and Reflection is used, then the Triangle value shown here will be 20,000. Therefore, on low-end devices, you should be strictly wary of these operations, which will double the number of rendered patches on the same screen.
2.3 Batch (DrawCall)
In Unity, we need to distinguish DrawCall and Batch. There will be multiple DrawCalls in a Batch. When this happens, we tend to care more about the number of Batches, because it is the unit that submits rendering data to the GPU, and it is also the real object we need to optimize and control the number of.
There are usually four ways to reduce batches: dynamic batching, static batching, SRP Batcher, and GPU Instancing. The following briefly summarizes the batching conditions, advantages, and disadvantages of static batching, SRP Batcher, and GPU Instancing.
(1) Static Batching
Condition: Different Mesh, just use the same shader.
Advantages: Binding saves vertex information; saves geometry information transfer; when adjacent materials are the same, saves material transfer.
Disadvantages: When merging offline, if there are duplicate resources in the merged Mesh, it is easy to make the merged package larger; when merging at runtime, the process of generating Combine Mesh will cause a short-term CPU peak; There are duplicate resources, which will increase the memory footprint after merging.
(2) SRP Batcher
Condition: Different Mesh, as long as the same Shader and the same variant are used.
Advantages: saves the write operation of the Uniform Buffer; divides the batch according to the Shader, generates the Uniform Buffer in advance, and there is no CPU Write inside the Batch.
Disadvantages: Constant Buffer (CBuffer) memory fixed overhead; MaterialPropertyBlock is not supported.
(3) GPU Instancing
Condition: The same Mesh and the same shader is used.
Advantages: It is suitable for the needs of rendering a large number of monsters of the same kind, and it can reduce the time-consuming of animation modules while batching.
Disadvantages: There may be negative optimizations, but instead make DrawCall rise; Instancing is sometimes disrupted, and you can use API rendering in groups.
This API often appears in the stack of the main function of the render module and causes most function spikes in the render module. It is the time-consuming generated when the Shader is rendered for the first time, and its time-consuming is related to the complexity of rendering the Shader. We should pay attention when it is called during gameplay and causes high time-consuming spikes.
In this regard, we can collect the variants to be used by Shader through ShaderVariantCollection and packaging AssetBundle. After loading the ShaderVariantCollection resource into the memory, trigger Shader.CreateGPUProgram by calling ShaderVariantCollection.WarmUp in the early game scene, and cache the SVC, so as to avoid triggering this API call when the game is running, and avoid local high CPU consumption Time.
However, even projects that have done the above operations often detect occasional spikes in the API’s time consumption during runtime, indicating that there are some “fish that slip through the net”. Developers can combine the Timeline mode of Profiler and select the frame that triggers the call to Shader.CreateGPUProgram to check which Shader triggers the API. You can refer to “An idea of Shader variant collection and packaging compilation optimization“.
In most cases, the self-cost of Culling is time-consuming and inconspicuous, and its significance is to reflect some rendering-related problems.
(1) The number of cameras is large
When the stack of the main function of the rendering module takes a relatively high proportion of Culling time (about 10%-20% in general projects).
(2) There are many small objects in the scene
Culling time is closely related to the number of GameObject widgets in the scene. In this case, it is recommended that the R&D team optimize the scene production method and pay attention to whether there are too many small objects in the scene, which will increase the time-consuming of Culling. You can consider using dynamic loading, block display, or methods such as Culling Group and Culling Distance to optimize the time-consuming of Culling.
(3) Occlusion Culling
If the project uses multi-threaded rendering and Occlusion Culling is turned on, it usually causes the pressure on the sub-thread to be too high and the overall Culling is too high.
Since Occlusion Culling needs to calculate the occlusion relationship according to the objects in the scene, although Occlusion Culling reduces the rendering consumption, its own performance overhead is also worth noting, and it is not necessarily applicable to all scenes. In this case, it is recommended that developers selectively disable part of Occlusion Culling to test the overall consumption of rendering data for comparison, and then decide whether to enable this function.
(4) Bounding box update
The FinalizeUpdateRendererBoundingVolumes that sometimes appear in Culling’s stack are time-consuming for bounding box updates. Commonly seen in the bounding box update of Skinned Mesh and particle systems. If the API appears frequently, you need to use screenshots to check whether there are a large number of Skinned Mesh updates or more complex particle system updates at this time.
The method PostProcessLayer.OnPreCull is related to the PostProcessing Stack used in the project. You can add the static variable GlobalNeedUpdateSettings in PostProcessManager.cs, and updateSettings by setting PostProcessManager.GlobalNeedUpdateSettings to true when cutting the scene. In this way, you can avoid doing the UpdateSettings operation every frame, thereby reducing part of the time-consuming.
It takes a high CPU Time to use WaterReflection.OnWillRenderObject that is related to the water surface reflection effect in the project. If its CPU Time is high, you can check whether there is room for optimization in the implementation method, such as removing some unnecessary particles, small objects, etc. reflection rendering.
3. UI Module
In the Unity engine, the mainstream UI frameworks include UGUI, NGUI, and FairyGUI which is become popular. This article mainly uses UGUI as an example. For more comprehensive content around UGUI-related optimization, please refer to “Unity Performance Optimization – UI Module“.
3.1 UGUI EventSystem.Update
The EventSystem.Update function is time-consuming for UGUI’s event system. When the time-consuming is high, the following two factors are mainly concerned:
(1) It takes a long time to trigger calls
As the main function of the UGUI event system, this function is mainly triggered when the touch is released. When there is a high CPU overhead, it is usually caused by calling other more time-consuming functions. Therefore, it is necessary to further detect the triggered logic by adding Profiler.BeginSample/EndSample management or GOT Online service and UWA API management, so as to find out which sub-function or code segment caused the high time consumption.
(2) Polling costs a long time
All UGUI components are created with the option of Raycast Target enabled by default, which is actually ready to accept event responses. In fact, most UI components such as Image and Text will not participate in event response, but will still participate in polling when the mouse/finger is swiped or hovered, so it is judged by simulating ray detection whether the UI component has been swiped or hovered, causing unnecessary cost. Especially when there are many UI components in the project, turning off the Raycast Target setting of the components that do not participate in the event response can effectively reduce the time-consuming of EventSystem.Update().
3.2 UGUI Canvas.SendWillRenderCanvases
The cost of the Canvas.SendWillRenderCanvases function represents the update time consumption caused by the change of the UI element itself, which needs to be distinguished from the time consumption of the grid reconstruction of Canvas.BuildBatch (as shown below).
The sustained high time cost is often caused by UI elements that are too complex and updated too frequently. Self-updating UI elements include: replacing images, changing text or colors, and more. Displacement, rotation, or scaling of UI elements does not incur any overhead in this function. The time-consuming of this function depends on the number of UI elements updated and the complexity of UI elements, so to optimize the cost of this function, you can usually start from the following points:
(1) Reduce the frequency of frequently updated UI elements
For example, the monster marker on the minimap, the health bar of the character or monster, etc. We can control the logic to update the UI display only when the change exceeds a certain threshold. Another example is the skill CD effect, damage floating characters, etc. to control the update of every frame.
(2) Try to keep the complex UI from changing
For example, some strings are very large and the Text with Rich Text, Outline or Shadow effect is used, and the Image Type is Tiled Image. Because of the large number of vertices, these UI elements will take a long time to update once they are updated. If some effects need to use Outline or Shadowmap, but they change frequently, such as the floating damage number, you can consider making it a fixed artistic word, so that the number of vertices will not be doubled by N.
(3) Pay attention to Font.CacheFontForText
This function tends to cause some time-consuming spikes. This API is mainly the overhead of generating the dynamic font Font Texture, which is very time-consuming at runtime. It is very likely that many new characters are written at one time, resulting in the expansion of the Font Texture texture. You can optimize the time-consuming of this item by reducing font types, reducing the font size, and displaying common words in advance to expand dynamic font FontTexture.
3.3 UGUI Canvas.BuildBatch
Canvas.BuildBatch initiates UI grid merging on the main thread. The specific merging process is handled in the child thread. When the pressure on the child thread is too high, or the merged UI grid is too complex, it will wait on the main thread. The waiting time will be counted in EmitWorldScreenspaceCameraGeometry.
These two functions are time-consuming, indicating that the reconstructed Canvas is very complex. At this time, the Canvas needs to be subdivided. Usually, the static elements are placed in a Canvas, and the updated UI elements are placed in a Canvas, so that the static Canvas will not update the grid due to the cache, thereby reducing the complexity of grid updating and reducing the time-consuming of grid reconstruction.
3.4 UGUI CanvasRenderer.SyncTransform
We often find that CanvasRenderer.SyncTransform is called frequently in some frames of some projects. As shown in the figure below, CanvasRenderer.SyncTransform is called up to 1017 times. When Canvas.SyncTransform is triggered very frequently, it will cause a high cost on its parent node UGUI.Rendering.UpdateBathes.
In Unity 2018 and later versions, calling SetActive (change false to true) on a UI element under the Canvas will cause other UI elements under the Canvas to trigger SyncTransform, resulting in an increase in the overall cost of UI updates. The Unity 2017 version will only cause the UI element itself to trigger a SyncTransform.
Therefore, for Canvas with many UI elements (such as Image and Text), you need to pay attention to whether there are some UI elements that are frequently SetActive. In this case, it is recommended to use SetScale (0 or 1) instead of SetActive (false or true). Alternatively, the Canvas can also be split appropriately, making the elements that need to be SetActive (true) and other elements under the different Canvas, so that SyncTransform will not be called frequently.
3.5 UGUI UI DrawCall
Usually, modules other than UI in the battle scene are time-consuming and stressful. Therefore, the cost of the UI module needs to be carefully controlled under these scenes. Generally speaking, it is best to control UI DrawCall in the battle scene to around 40-50.
On the premise of not reducing UI elements, the problem of controlling DrawCall is actually the problem of how to make UI elements as batched as possible. General batching requires the same materials, but in UI, it often happens that UI elements made with the same material and the same atlas cannot be batched together. This is actually related to the calculation principle of UGUI DrawCall. For a detailed introduction to the principle, please refer to the course “Detailed UGUI DrawCall Calculation and Rebuild Operation Optimization” by UWA Academy.
In the production process of UGUI, it is recommended to pay attention to the following points:
(1) UI elements under the same Canvas can only be batched together. Even if Different Canvas with the same Order in Layer is the same cannot be batched, so the reasonable planning and production of UI are very important;
(2) Try to integrate and make atlases, so that the material atlases of different UI elements are consistent. Small UI elements such as buttons, icons, etc. that need to use image in the atlas can be integrated and made into the atlas. When they appear densely at the same time, DrawCall is effectively reduced;
(3) Under the premise of the same Canvas and the same material and atlas, avoid interspersed levels. A brief summary is that the “level depth” of UI elements that meet the batching conditions should be the same;
(4) Set the Pos Z of the relevant UI to 0 as much as possible. UI elements whose Z value is not 0 can only try to batch with adjacent elements in the Hierarchy, so it is easy to interrupt the batch.
(5) For the Image with an Alpha of 0, you need to check the Cull Transparent Mesh option on its CanvasRender component, otherwise, DrawCall will still be generated and it will be easy to interrupt the batch.
4. Physical Module
For a more comprehensive analysis related to the optimization of the physics module, please refer to “Unity Performance Optimization – Physics Module“.
4.1 Auto Simulation
After the Unity 2017.4 version, the setting option Auto Simulation for physics simulation is enabled by default, that is, physics simulation is always performed by default during the project. But in some cases, this part of the time is wasted.
One criterion for judging whether physical simulation time is wasted is the number of Contacts, that is, the number of collision pairs when the game is running. In general, the higher the number of collision pairs, the more CPU-intensive the physics system is. But in many mobile projects, we have detected that the number of Contacts is always 0 throughout the game.
In this case, the developer can turn off the automatic simulation of physics for testing. If turning off Auto Simulation does not have any effect on the game logic, and you can still have good dialogues, battles, etc. during running the game, it means that you can save time in this regard. It should also be noted that if the project needs to use ray detection, then Auto Sync Transforms need to be turned on after Auto Simulation is turned off to ensure that ray detection can function normally.
4.2 Number of Physical Updates
The main time-consuming function of Unity’s physical simulation process is in FixedUpdate, that is to say, when the number of calls to this function per frame is higher, the number of physical updates is more frequent, and the time-consuming per frame is correspondingly higher.
The number of physical updates, or the number of calls per frame of FixedUpdate, is related to the minimum update interval (Fixed Timestep) and the maximum allowed time (Maximum Allowed Timestep) in the Time setting of the Unity Project Settings. Here we need to know the characteristics of the physics system itself, that is, when the game is stuck in the last frame, Unity will continuously call FixedUpdate.PhysicsFixedUpdate N times at the very early stage of the current frame. The meaning of Maximum Allowed Timestep is to limit the number of physical updates. It determines the maximum number of physical calls in a single frame. The smaller the value, the less the maximum number of physical calls in a single frame. Now set these two values to 20ms and 100ms respectively, then when a frame takes 30ms, the physical update will only be performed once; when it takes 200ms, it will only be performed 5 times.
Therefore, an effective method is to adjust the settings of these two parameters, especially to control the upper limit of the number of updates (the default is 17 times, and it is best to control it to less than 5 times), so that the cost of the physical module will not be too high; another one is to optimize the CPU time consumption of other modules first. When there are very few frames that take too much time during the project running, FixedUpdate will not always reach the upper limit of the number of updates per frame. This is the same for other functions in FixedUpdate, and for this reason, we generally do not recommend writing too much game logic in FixedUpdate.
As mentioned above, if we do use physics simulation, the more the number of collision pairs in general, the more CPU time the physics system takes. Therefore, strictly controlling the number of collision pairs is very important to reduce the time-consuming of physics modules.
First of all, there may be some unnecessary Rigidbody components in many projects, causing unnecessary collisions that the developer does not know, resulting in a waste of time; in addition, you can check and modify the Layer Collision Matrix in the Physics settings of the Project Settings, cancel unnecessary collision detection between layers, and reduce the number of Contacts as much as possible.
5. Animation Module
For more comprehensive content related to the optimization of the animation module, please refer to “Unity Performance Optimization – Animation Module“.
5.1 Mecanim animation system
The Mechanic animation system is a new version of the animation system introduced by Unity after Unity 4.0 (using Animator to control the animation). Compared with Legacy’s Animation control system, the Mecanim animation system has the following advantages in terms of function:
(1) A special workflow is provided for humanoid characters, including the creation of Avatar and the adjustment of Muscles muscles;
(2) The ability of animation redirection (Retarting) can easily apply an animation from one character model to other character models;
(3) A visual Animator editor is provided, which can quickly preview and create animation clips;
(4) It is more convenient to create state machines and Transition transitions between states;
(5) Mixed tree function for easy operation.
In terms of performance, for animations with skeletal animation and more curves, the performance of using Animator is better than that of Animation, because Animator supports multi-threaded computing, and Animator can be optimized by enabling Optimized GameObjects. For details, please refer to UWA Academy The course “Performance Optimization of Animation System in Unity Mobile Games”. On the contrary, for relatively simple animations like movement and rotation, using Animation control is more efficient than Animator.
For objects with a small number of faces and short animation duration, such as one or two thousand faces, such as minions in MOBA and SLG, you can consider using the SkinnedMeshRenderer.BakeMesh solution to replace CPU cost with memory. The principle is to bake the action at a certain time point of a skin animation into a Mesh without a skin so that animation can be converted into a set of Mesh sequence frames through a custom sampling interval. Then when playing the animation, you only need to select the nearest sampling point (that is, a Mesh) for the assignment, which saves time for a bone update and skin calculation (almost no animation, just the action of the assignment). The whole operation is more suitable for characters with a small number of patches because it saves the skinning calculation. Its role is to exchange memory for computing time. When a large number of the same animated model appears in the scene, the effect will be very obvious. The disadvantage of this method is that memory usage is greatly limited by the number of model vertices, the total duration of the animation, and the sampling interval. Therefore, this method is only suitable for models with a small number of vertices and a short total animation time. At the same time, Bake takes a long time and needs to be completed when loading the scene.
5.3 Number of Active Animators
The number of Animators in the Active state will greatly affect the time consumption of the animation module, and it is an important quantifiable criterion. Controlling the number to a relatively reasonable value is an important means for us to optimize the animation module. It is necessary for the developer to check whether the corresponding quantity is reasonable based on the screen.
(1) Animator Culling Mode
One way to control the number of Active Animators is to adjust the sensible Animator.CullingMode settings for each animated component. There are three options in this setting: AlwaysAnimate, CullUpdateTransforms, and CullComplete.
The default AlwaysAnimate makes the current object no matter whether it is in the viewport or not, or if it is LOD Culling in the viewport, everything in the Animator will still be updated; among them, AlwaysAnimate must be selected for UI animation, otherwise abnormal behavior will occur.
When it is set to CullUpdateTransforms, when the object is not in the viewport or is culled by LOD, the logic continues to be updated, which means that the state machine is updated, and the connection conditions in the animation resource, etc. will also be updated and judged; But the update of the display layer such as Retarget, IK and Transform returned from C++ will not be done. Therefore, trying to set some animation components to CullUpdateTransforms without affecting the performance can save the time-consuming display layer of the animation module when the object is invisible.
Finally, CullComplete is not updated at all. It is suitable for relatively unimportant animation effects in the scene. On low-end machines, objects that need to be displayed but can be considered to be static can be selected in stages.
(2) DOTween plugin
UI animation will sometimes also contribute a lot to Active Animator. For some simple UI animations, such as changing colors, zooming, moving, and other effects, UWA recommends using DOTween instead. After testing, the performance is much better than native UI animation.
5.4 Number of Animators with Apply Root Motion enabled
In the stack of Animators.Update, sometimes you can see that Animator.ApplyBuiltinRootMotion accounts for too much. This item is usually related to the model animation with Apply Root Motion turned on in the project. If its animation does not require displacement, this option does not have to be turned on.
The Animator.Initialize API will be triggered when the GameObject containing the Animator component is Active and Instantiate, which is time-consuming. Therefore, it is not recommended to perform Deactive/Active GameObject operations on GameObjects containing Animators too frequently, especially in battle scenes. For frequently instantiated characters and UI, you can try to deal with them by means of a buffer pool. When you need to hide the character, you should not directly deactivate the GameObject of the character, but Disable the Animator component, and move the GameObject off the screen; if you need to hide the UI When the UI object is not directly Deactive, but it is SetScale=0 and moved out of the screen, Animator.Initialize will not be triggered.
5.6 Meshskinning.Update and Animators.WriteJob
The impact of mesh resources in the animation modules is very significant.
On the one hand, Meshskinning.Update time-consuming time is high. The main factor is that the number of bones and patches of the skinned mesh is high, so the mesh resources can be reduced and LOD graded. On the other hand, under the default settings, we often find that the Transforms of the character’s skeleton nodes in many projects always exist in the scene, so that after the Native layer calculates their Transforms, they will be passed back to the C# layer, resulting in certain cost.
In a scene with a large number of characters, the return of the skeleton node will generate a certain overhead, which is reflected in the Animators.WriteJob sub-function of PreLateUpdate.DirectorUpdateAnimationEnd, is one of the main functions of the animation module. For this, developers can consider checking the Optimize Game Objects setting under the Rig tab in the FBX resource to “hide” the skeleton node, thereby reducing the time-consuming of this part.
5.7 GPU Skinning/Compute Skinning
In particular, for the GPU Skinning settings native to the Unity engine (Compute Skinning in the new version of Unity), in theory, the update method of meshes and animations will be changed to a certain extent to optimize the processing of skeletal animations, but from the mobile platform-specific According to multiple test results, whether on iOS or Android platforms, GPU Skinning provided by multiple Unity versions has no obvious effect on performance improvement, and even has negative optimization. It has been gradually optimized in the iteration of Unity, and related operations are put into the rendering thread, but its practicality needs further investigation.
For the needs of a large number of monsters of the same kind, you can consider using your own implementation of “GPU Skinning Accelerated Skeletal Animation” and GPU Instancing in the UWA open-source library for rendering, which can not only reduce the cost of Animator.Update, but also achieve batch processing effect.
6. Particle System
For more comprehensive content related to particle system optimization, please refer to “Unity Particle System Optimization – How to Optimize Your Skill Effects“.
6.1 Number of Playing particle systems
UWA counts the number of particle systems and the number of particle systems in the Playing state. The former refers to the total number of all ParticleSystems in memory, including those that are being played and those in the buffer pool; the latter refers to the number of ParticleSystem components that are being played, including on-screen and off-screen ones, we recommend that the number of peaks present in a frame does not exceed 50 (1GB device).
For these two values, on the one hand, we pay attention to whether the peak number of particle systems is too high, and select a peak frame to check which particle systems are cached, whether they are reasonable, and whether there is excessive caching; on the other hand, we pay attention to the number of Playing Whether the peak value is too high, you can select a peak frame to see which particle systems are playing, whether they are all reasonable, and whether some production optimizations can be made (for details, see the discussion in the GPU section below).
The time-consuming of ParticleSystem.Prewarm is also sometimes a concern. When a particle system with the Prewarm option turned on is instantiated in the scene or turned from Deactive to Active, a complete simulation will be performed immediately.
However, the operation of a Prewarm usually takes a certain amount of time. After testing, a large number of particle systems that enable Prewarm and SetActive will cause a time-consuming peak. It is recommended to turn it off when it is not necessary.
7. Loading Module
For more comprehensive content about loading module optimization, please refer to “Unity Performance Optimization – Loading and Resource Management“.
7.1 Loading of Shader
Shader.Parse refers to the operation of Shader loading and parsing. If this process is frequent, it is usually caused by the repeated loading of Shader. The repetition here can be understood as two layers of meaning.
The first layer is caused by the redundancy of the Shader, usually because when the AssetBundle is packaged, the Shader is passively entered into multiple different AssetBundles without dependency packaging, so that when the resources in these AssetBundles are loaded, These Shaders will be passively loaded, and multiple “repeated” Shader.Parse will be performed, so the same Shader will have multiple copies in memory, which is redundant.
The way to remove this redundancy is also very simple, which is to package these redundant Shader dependencies into a common AssetBundle package. This will actively package, instead of passively entering some packages that use this Shader. If this Shader is actively packaged, then other AssetBundles that use this Shader will only reference the public AssetBundle produced by this Shader so that there is only one Shader in memory, and when the other AssetBundles need to use it, they can reference it directly, without the need to do Shader.Parse multiple times.
The second layer means that the same Shader is loaded and unloaded multiple times without caching. Assuming that the AssetBundle is actively packaged and a public AssetBundle is generated, there is only this one Shader in memory, but because the Shader is loaded (that is, Shader.Parse) without being cached, and it will be unloaded immediately after use. The next time you use this Shader, there is no Shader in the memory, so it must be reloaded, so that the same Shader is loaded and parsed multiple times, resulting in multiple Shader.Parse. Generally speaking, the Shader written by the developer after the variant optimization does not take up much memory and can be loaded and cached uniformly at the beginning of the game.
In particular, for Unity’s built-in Shader, as long as the number of variants is small, it can be put into Always Included in Project Settings to avoid redundant and repeated parsing of this type of Shader.
This API also appears in the stack of loading module main functions and even UI modules, and logic code. The related discussion has been covered in previous articles, and the optimization method is the same, and will not be repeated here.
This API will be automatically called by Unity when the scene is switched. Generally, a single call takes a long time. Generally, manual calling is not recommended. However, in some projects that do not perform scene switching or use Additive to load scenes, this API will not be called, resulting in an upward trend in the overall number of resources and memory of the project. In this case, you can consider calling it manually every 5-10 mins.
The underlying mechanism of Resources.UnloadUnusedAssets is that for each resource, traverse all GameObject nodes in the Hierarchy Tree and objects in the heap memory to detect whether the resource is used by a GameObject or object (component), if not, the engine will recognize it as an Unused resource, and then perform the uninstallation operation. In simple terms, the one-time cost of Resources.UnloadUnusedAssets roughly increases as the product of ((Number of GameObjects + Number of Mono objects) * Number of Assets) increases.
Therefore, the process is extremely time-consuming, and the higher the number of GameObjects/Assets in the scene and the higher the number of objects in the heap memory, the greater the overhead. In this regard, our recommendations are as follows:
The R&D team can try to remove a resource that has been determined to no longer be used through Resources.UnloadAsset/AssetBundle.Unload(True) when the game is running. These two APIs are very efficient, and can also reduce the unified processing of Resources.UnloadUnusedAssets time pressure, thereby reducing the time-consuming of the API when switching scenes;
(2) Strictly control the number of material resources and particle systems used in the scene
These two resources are specifically mentioned, because in most projects, although their memory usage is generally not a big one, the number of resources is often much higher than that of other types of resources, which can easily reach the order of thousands so that a single Resources. UnloadUnusedAssets has a great contribution to the time-consuming.
(3) Reduce the resident heap memory
The number of objects in the heap also significantly affects the time taken by Resources.UnloadUnusedAssets, which has been discussed above.
7.3 Load AssetBundle
Using AssetBundle to load resources is a common practice in mobile game projects.
Among them, the AssetBundle should be packaged in the LZ4 compression format as much as possible, and loaded in the way of LoadFromFile. After testing, even a large AssetBundle package (containing 10 1024*1024 textures) under this combination takes only a few tenths of a millisecond to load. Using other loading methods, such as LoadFromMemory, the loading time increases to tens of milliseconds; while using WebRequest loading will cause the resident memory of the AssetBundle package to increase significantly. This is because LoadFromFile is an efficient API for loading AssetBundles in uncompressed or LZ4 compressed format from local storage such as a hard disk or SD card.
On desktop standalone platforms, consoles, and mobile platforms, the API will only load the header of the AssetBundle and leave the rest of the data on disk. AssetBundle’s Objects are loaded on demand, for example when a load method (eg AssetBundle.Load) is called or its InstanceID is indirectly referenced. In this case, it will not consume too much memory.
But in the Editor environment, the API will still load the entire AssetBundle into memory, just like reading bytes from disk and using AssetBundle.LoadFromMemoryAsync. This API may cause memory spikes during AssetBundle loading if the project is profiled in the Editor. But this shouldn’t affect performance on the device, and these spikes should be retested on the device before doing optimizations.
It should be noted that this API is only for uncompressed or LZ4 compressed formats because if LZMA compression is used, it is compressed for the entire generated data package, so the header information of the AssetBundle cannot be obtained before it is decompressed.
Since the loading efficiency of LoadFromMemory is significantly more time-consuming than other interfaces, we do not recommend large-scale use, and the heap memory will become larger. If there is a real need to encrypt AssetBundle files, you can consider encrypting only important configuration files, codes, etc., without encrypting resource files such as textures and meshes. Because there are already some tools on the market that can obtain and export rendering-related resources, such as textures, meshes, etc., from a lower-level way, it is not very necessary to encrypt this part of the resources.
In the resource management page of UWA GOT Online Resource mode, you can troubleshoot AssetBundles that take a long time to load, so as to troubleshoot and optimize the loading method, compression format, package size, and other problems, or consider caching the AssetBundles that are loaded repeatedly.
7.4 Loading Resources
Regarding the time-consuming caused by loading resources, if the loading strategy is reasonable, it usually occurs at the beginning of the game and when the scene is switched, and it often does not cause a serious performance bottleneck. However, it is not ruled out that some situations need to be paid attention to, so you can use the time-consuming sorting of resource loading as a basis for troubleshooting.
For resources that take too long to load for a single time, such as hundreds of milliseconds or even a few seconds, it is necessary to examine whether such resources are too complicated and consider simplifying them in terms of production. For resources that are loaded frequently and time-consuming, they should be cached after the first loading to avoid the overhead caused by repeated loading.
It is worth mentioning that, in Unity’s asynchronous loading, sometimes the maximum time that can be occupied by loading each frame is limited, but the main thread is idling. Especially when the scene is cut, the asynchronous loading is concentrated, sometimes it takes tens or even tens of seconds, but most of the time is wasted by idling. This is because the API Application.backgroundLoadingPriority, which controls the most time-consuming asynchronous loading of each frame, defaults to BelowNormal, and each frame only loads at most 4ms. At this time, it is generally recommended to set this value to High, that is, at most 50ms per frame.
In the resource management page of UWA GOT Online Resource mode, you can check the resources that take a long time to load, so as to check and optimize the loading method, the resources are too complicated, consider caching the repeatedly loaded resources.
7.5 Instantiation and Destruction
Instantiation also mainly exists in the phenomenon that the instantiation of a single resource takes too long or a certain resource is instantiated repeatedly and frequently. After arranging according to the time-consuming, the suspected problematic resources, the former considers simplification or considers the frame-by-frame operation. For example, for a more complex UI Prefab, you can consider changing to instantiate the conspicuous and important interfaces and buttons first. The content and decorative icons after page turning are instantiated; the latter establishes a cache pool and uses explicit and implicit operations to replace frequent instantiation.
In the resource management page of UWA GOT Online Resource mode, you can troubleshoot resources that take a long time to instantiate, so as to troubleshoot and optimize the problem of overly complex resources, or consider caching for repeatedly instantiated resources.
7.6 Activate and Deactivate
The time-consuming of activation and deactivation itself is not high, but you need to pay attention if there are too many operations in a single frame. It may be that some judgments and conditions in the game logic are not reasonable enough. Many projects often have too many explicit and hidden operations of a certain resource, and the number of SetActive(True) is much more than that of SetActive(False), or The opposite phenomenon, that is, there are a lot of unnecessary SetActive calls. Since the SetActive API will generate cross-layer calls between C# and Native, once the number increases, the time-consuming is still considerable. In view of this situation, in addition to checking whether the logic can be optimized, you can also consider establishing a state cache in the logic, and judge the current activation state of the resource before calling the API. It is equivalent to using the overhead of logic to replace the overhead of the API, which is relatively less time-consuming.
On the resource management page in UWA GOT Online Resource mode, you can check the resources that are activated and hidden frequently, so as to check and optimize the related logic and calls.
8. Logic Script
The CPU time-consuming optimization of logic code is more a process of testing the programmer himself based on the actual needs of the project, and it is difficult to discuss it quantitatively and qualitatively. However, the UWA SDK provides API&UWA GOT Online that is convenient for developers to manage logic code, so as to disassemble complex functions and check the stack in the report, which is time-consuming and faster to verify the optimization effect.
We found that more and more teams are using JobSystem to put part of the logic code in the main thread into sub-threads for processing. For the logic that can be operated in parallel, it is highly recommended to put it into sub-threads for processing. Effectively reduce the pressure on the main thread CPU to process logical operations.
The GOT Online Lua mode provides tools for analyzing CPU time-consuming caused by Lua, with a high degree of visualization, clear stacks, and a practical and characteristic reverse-order call analysis function. The following combined with a Lua report Demo briefly introduces the method of using this tool to analyze the CPU cost of Lua.
To reiterate: function names appearing in Lua reports have the format: function name@filename:linenumber.
The CPU time-consuming bottleneck function and the specific cause of the CPU time-consuming peak can be located by the Lua file name/line number/function name provided by the report. The naming format of Lua functions is X@Y:Z, where X is the function name. If it cannot be obtained, X will become the default unknown; Y is the file location where the function is defined; Z is the line where the function is defined No. It should be noted that when the Lua script is run as bytecode, this value will always be 0, so it is recommended to use Lua source code to run as much as possible when testing.
(1) Positive sequence call analysis – summary table (curve + list)
The curve selects the overall time cost of the Lua code and the time consumption of the first five functions sorted according to the average time consumption to form a time consumption curve graph. Each data point represents the time consumption (vertical axis) of the function in the current frame (abscissa). coordinates), which helps to locate time-consuming bottleneck functions.
By default, the list sorts Lua functions according to the average cost and roughly displays data such as function name, total CPU time, scene CPU time, and average time-consuming. By clicking a function, you can enter the corresponding single function analysis page.
(2) Positive sequence call analysis – single function page (screenshot + graph + stack information)
When the project is running, the screenshot roughly corresponds to the frame selected by the user, which helps to locate the problem.
The curve graph includes the CPU time consumption curve and the number of calls; you can also use the zoom curve below to observe the local time consumption.
It can be observed from the graph: whether the function has continuous high time-consuming; whether the function has a large amount of time-consuming for a short time, which leads to stuck; some functions are not very time-consuming at a time, but because they are called a lot, the function Total time is high.
Function XXXX stack information (list as shown above)
Among them, the time range of the list data can be selected in the upper right corner: when the overall stack information is specified, the time range is the entire test time; when the scene stack information is specified, the time range is the opening time of the specified scene; when the frame stack information is specified, the time range is The specified frame currently selected in the graph.
The meaning of each indicator in the list is: the overall proportion, the total time spent by the root node function is 100%, the total time spent by the current node function relative to the total time spent by the root node function; The total time spent is 100%, the proportion of the current node function’s own time relative to the root node function’s total time; The function called by the function) time-consuming and remaining time-consuming; the number of calls, the number of times the function is called within the time range; single time-consuming, total time-consuming/calling times, indicating the average time-consuming of each execution of the function; significant Call the number of frames, the function itself takes more than 3ms of frames.
(3) Reverse order call analysis – summary table (curve + list)
Curve graph: Different from the forward sequence call analysis, the first five functions with their own time-consuming forward ordering are selected, and each data point represents the time cost (ordinate) of the function in the current frame (abscissa).
List: Same as above.
(4) Reverse order call analysis – single function page (screenshot + graph + stack information)
Function XXXX stack information (list):
The meaning of each indicator (which is different from the positive sequence) has become: its own proportion, and the sum of the self-consumption time of the selected function is 100%, and the self-consumption time of the selected function under this calling path is relative to the selected node. The proportion of the total time-consuming function of the function; the time-consuming itself, within the time range, the sum of the time-consuming of the selected function itself under this calling path; the number of calls, the number of calls of this calling path; the single time-consuming, representing this The average time spent on the selected function under each call path.
After locating the function with high time-consuming through the above interface, common optimization methods include: optimizing the function body of the function to reduce the time-consuming of the function itself; locating the calling path with many calls, and reducing the number of calls.
The GC time is not included in the Lua CPU time; the Lua function time is equivalent to checking when entering and exiting the function, and the statistics time is time-consuming. Therefore, if the Lua script calls the C# function, this part of the C# function will be counted, so you need to pay attention to the situation of interspersed calls with C# and try to control it within 50 times.
That is all for the content of this article. For more content, you can visit the UWA Blog. The course will discuss some performance problems that often occur in current game projects from three dimensions: memory, CPU, and GPU.
YOU MAY ALSO LIKE UWA Here!!!
UWA Website: https://en.uwa4d.com
UWA Blogs: https://blog.en.uwa4d.com
UWA Product: https://en.uwa4d.com/feature/got
You may also like
January 4, 2023
December 21, 2022