After Epic released the UE5 technology demo at the beginning of 2021, the discussion about UE5 has never stopped. Related technical discussions mainly centered on two new features: global illumination technology Lumen and extremely high model detail technology Nanite. There have been some articles [1 ] analyzing Nanite technology in more detail. This article mainly starts from the RenderDoc analysis and source code of UE5, combined with some existing technical data, aims to provide an intuitive and overview understanding of Nanite, and clarify its algorithm principles and design ideas, without involving too many source code level Implementation details.
2. What do we need for next-generation model rendering?
To analyze the technical points of Nanite, we must first proceed from the perspective of technical requirements. In the past ten years, the development of 3A games has gradually tended to two main points: interactive film narrative and open world. In order to cutscene realistically, the character model needs to be exquisite; for a sufficiently flexible and rich open world, the map size and the number of objects have increased exponentially, both of which have greatly increased the requirements for the fineness and complexity of the scene: The number of objects must be large, and each model must be sufficiently detailed.
There are usually two bottlenecks in the rendering of complex scenes:
- The CPU-side verification and communication overhead between CPU-GPU brought by each Draw Call;
- Overdraw caused by inaccurate elimination and the resulting waste of GPU computing resources;
In recent years, the optimization of rendering technology has often revolved around these two problems, and some technical consensus in the industry has been formed.
In view of the overhead caused by CPU-side verification and status switching, we have a new generation of graphics APIs (Vulkan, DX12, and Metal), which are designed to allow drivers to do less verification work on the CPU side; different tasks are dispatched through different Queues For GPU (Compute/Graphics/DMA Queue); developers are required to handle the synchronization between CPU and GPU by themselves; make full use of the advantages of multi-core CPU and multi-thread to submit commands to GPU. Thanks to these optimizations, the number of Draw Calls of the new generation graphics API has increased by an order of magnitude compared with the previous generation graphics API (DX11, OpenGL) .
Another optimization direction is to reduce the data communication between the CPU and GPU, and more accurately remove triangles that do not contribute to the final picture. Based on this idea, GPU Driven Pipeline was born. For more information about GPU Driven Pipeline and culling, you can read this article by the author .
Thanks to the more and more widespread application of GPU Driven Pipeline in games, the vertex data of the model is further divided into more fine-grained Clusters (or Meshlets) so that the granularity of each Cluster can better adapt to the Vertex Processing stage. Cache size and various types of culling (Frustum Culling, Occlusion Culling, and Backface Culling) have gradually become the best practice for complex scene optimization, and GPU vendors have gradually recognized this new vertex processing flow.
However, the traditional GPU Driven Pipeline relies on Compute Shader culling. The removed data needs to be stored in the GPU Buffer. Via APIs such as Execute Indirect, the removed Vertex/Index Buffer is fed back to the GPU’s Graphics Pipeline, which has increased the invisible overhead of reading and writing. In addition, the vertex data will be read repeatedly (Compute Shader reads before culling and Graphics Pipeline reads through Vertex Attribute Fetch when drawing).
Based on the above reasons, in order to further improve the flexibility of vertex processing, NVidia first introduced the concept of Mesh Shader, hoping to gradually remove some fixed units (VAF, PD, and other hardware units) in the traditional vertex processing stage. And hand over these matters to the developer to process through the programmable pipeline (Task Shader/Mesh Shader).
Schematic of Cluster
The traditional GPU Driven Pipeline eliminates dependence on CS, and the eliminated data is passed to the vertex processing pipeline through VRAM
The Pipeline Based on the Mesh Shader, Cluster culling becomes part of the vertex processing stage, reducing unnecessary Vertex Buffer Load/Store
3. Are these enough?
So far, the problems of the number of models, the number of triangle vertices, and the number of faces have been greatly optimized and improved. However, high-precision models and small triangles at pixel level put new pressure on the rendering pipeline: rasterization and overdraw pressure.
Does soft rasterization have a chance to beat hard rasterization?
To clarify this problem, you first need to understand what hard rasterization does and what general application scenarios it envisages is. Read this article if you are interested in this . To put it simply: At the beginning of the design of traditional rasterization hardware, the size of the input triangle envisaged is much larger than one pixel. Based on this assumption, the process of hardware rasterization is usually hierarchical.
Taking the NVIDIA Rasterizer as an example, a triangle usually undergoes two stages of rasterization: Coarse Raster and Fine Raster. The former takes a triangle as input and 8×8 pixels as a block, and the triangle is rasterized into several blocks (it can also be understood as rough rasterization on the FrameBuffer whose original size is FrameBuffer 1/8*1/8).
At this stage, the occluded block will be completely eliminated through the low-resolution Z-Buffer, which is called Z Cull on the NVIDIA card; after the Coarse Raster, the block that passes through the Z Cull will be sent to the next stage for processing Fine Raster, which finally generates pixels for shading calculations. In the Fine Raster stage, we are familiar with Early Z. Due to the calculation needs of Mip-Map sampling, we must know the information of the adjacent pixels of each pixel, and use the difference of the sampled UV as the calculation basis for the Mip-Map sampling level. For this reason, the final output of Fine Raster is not a pixel, but a small 2×2 Pixel Quad.
For triangles close to the pixel size, the waste of hardware rasterization is obvious. First of all, the Coarse Raster stage is almost useless, because these triangles are usually smaller than 8×8. This situation is even worse for those long and narrow triangles because a triangle often spans multiple blocks, and these triangles cannot be removed by Coarse Raster but also add additional computational burden; in addition, for large triangles, the Fine Raster stage based on Pixel Quad will only generate a small number of useless pixels on the edge of the triangle, which is only a small part of the area of the entire triangle; But for small triangles, Pixel Quad will generate pixels four times the area of the triangle at worst, and these pixels are also included in the execution stage of Pixel Shader, which greatly reduces the effective pixels in WARP.
Rasterization waste of small triangles due to Pixel Quad
For the above reasons, soft rasterization (based on Compute Shader) does have a chance to defeat hard rasterization under the specific premise of pixel-level small triangles. This is also one of Nanite’s core optimizations. This optimization makes UE5 improve the efficiency of small triangle rasterization by 3 times .
The problem of redrawing has long been a performance bottleneck in graphics rendering, and optimizations around this topic are also emerging endlessly. On the mobile side, there is the familiar Tile Based Rendering architecture ; in the evolution of the rendering pipeline, some people have also proposed Z-Prepass, Deferred Rendering, Tile Based Rendering, and Clustered Rendering, these different rendering pipeline frameworks, In fact, they are all to solve the same problem: when the light source exceeds a certain number and the complexity of the material increases, how to avoid a large number of rendering logic branches in the Shader and reduce useless redrawing. On this topic, you can read my article .
Generally speaking, the deferred rendering pipeline needs a set of Render Targets called G-Buffers. These textures store all the material information needed for lighting calculations. In today’s 3A games, the types of materials are often complex and changeable, and the G-Buffer information that needs to be stored is increasing year by year. Taking the 2009 game “Kill Zone 2” as an example, the entire G-Buffer layout is as follows:
Excluding Lighting Buffer, in fact, the number of textures required for G-Buffer is 4, totaling 16 Bytes/Pixel; by 2016, the G-Buffer layout of the game “Uncharted 4” is as follows:
The number of textures in G-Buffer is 8, which is 32 Bytes/Pixel. In other words, in the case of the same resolution, due to the increase in material complexity and fidelity, the bandwidth required by the G-Buffer has been doubled, and this does not consider the factor of the game resolution that has increased year by year.
For scenes with high Overdraw, the read and write bandwidth generated by the drawing of G-Buffer will often become a performance bottleneck. So Academia proposed a new rendering pipeline called Visibility Buffer  . Algorithms based on Visibility Buffer no longer generate bloated G-Buffer alone but instead use Visibility Buffer with lower bandwidth overhead. Visibility Buffer usually needs the following information:
(1) Instance ID, which indicates which Instance (16~24 bits) the current pixel belongs to;
(2) Primitive ID, which indicates which triangle (8~16 bits) of Instance the current pixel belongs to;
(3) Barycentric Coord, which represents the position of the current pixel in the triangle, expressed in barycentric coordinates (16 bits);
(4) Depth Buffer, which represents the depth of the current pixel (16~24 bits);
(5) Material ID, which indicates which material the current pixel belongs to (8~16 bits);
Above, we only need to store about 8~12 Bytes/Pixel to represent the material information of all geometries in the scene. At the same time, we need to maintain a global vertex data and texture map. The table stores the vertex data of all geometries in the current frame. , As well as material parameters and textures.
In the lighting shading stage, only need to index to the relevant triangle information from the global Vertex Buffer according to the Instance ID and Primitive ID; further, according to the center of gravity coordinates of the pixel, the vertex information in the Vertex Buffer (UV, Tangent Space, etc.) Perform interpolation to obtain pixel-by-pixel information; further, according to the Material ID to index the relevant material information, perform operations such as texture sampling, and input to the lighting calculation link to complete the coloring. Sometimes this type of method is also called Deferred Texturing.
The following is the rendering pipeline process based on G-Buffer:
This is the rendering pipeline process based on Visibility-Buffer:
Intuitively, Visibility Buffer reduces the storage bandwidth of the information needed for shading (G-Buffer -> Visibility Buffer); in addition, it delays the reading of geometric information and texture information related to lighting calculations to the shading stage, so those non-visible Pixels on the screen do not need to read these data, but only need to read the vertex position. For these two reasons, Visibility Buffer’s bandwidth overhead is greatly reduced compared to traditional G-Buffer in complex scenes with higher resolution. However, maintaining global geometry and material data at the same time increases the complexity of engine design and reduces the flexibility of the material system. Sometimes it is necessary to use Graphics APIs such as Bindless Texture that is not yet supported by all the hardware platforms , which is not conducive to being compatible.
4. The Implementation in Nanite
Rome was not built in a day. Any technological breakthrough bred by mature academic and engineering fields must have predecessors’ thinking and practice, which is why we spend a lot of space to introduce the relevant technical background here. Based on the summary of previous solutions and the computing power of current hardware, Nanite, excellent engineering practice has been generated to meet the needs of next-generation game technology.
Its core idea can be simply disassembled into two parts: optimization of vertex processing and optimization of pixel processing. The optimization of vertex processing is mainly based on the idea of GPU Driven Pipeline; the optimization of pixel processing is based on the idea of Visibility Buffer combined with soft rasterization. With the help of RenderDoc frame grabbing and related source code demonstrated by UE5 Ancient Valley technology, we can get a glimpse of the true nature of Nanite’s technology. The entire algorithm flow is shown in the figure:
Instance Cull & Persistent Cull
When we explain the development process of GPU Driven Pipeline in detail, it is not difficult to understand the implementation of Nanite: each Nanite Mesh is cut into several Clusters in the preprocessing stage, each Cluster contains 128 triangles, and the entire Mesh is organized into a tree structure in BVH (Bounding Volume Hierarchy), and each leaf node represents a Cluster. There are two steps to culling, including frustum culling and occlusion culling based on HZB. Among them, Instance Cull uses Mesh as the unit. Each Mesh through the Instance Cull will send the root node of its BVH to the Persistent Cull stage for hierarchical elimination (if a BVH node is eliminated, its child nodes will not be processed).
This requires consideration of a question: how to map the number of culling tasks in the Persistent Cull phase to the number of threads in the Compute Shader? The easiest way is to give each BVH tree a separate thread, that is, one thread is responsible for a Nanite Mesh. However, due to the different complexity of each Mesh, the number of nodes and depth of the BVH tree vary greatly. Such an arrangement will cause the task processing time of each thread to be different, and the threads will wait for each other, which will eventually lead to poor parallelism; So can a separate thread be assigned to each BVH node that needs to be processed? This is of course the most ideal situation, but in fact, we cannot know in advance how many BVH nodes will be processed before the elimination, because the entire elimination is hierarchical and dynamic.
Nanite’s idea to solve this problem is to set a fixed number of threads, and each thread uses a global FIFO task queue to fetch the BVH node for elimination. If the node passes the elimination, all child nodes of the node are also added at the end of the tasks queue, then continue to loop to fetch new nodes from the global queue until the entire queue is empty and no new node is generated. This is actually a classic multi-threaded concurrent production-consumer model. The difference is that each thread here acts as both a producer and a consumer. Through this model, Nanite ensures that the processing time between each thread is approximately the same.
The entire elimination phase is divided into two Passes: Main Pass and Post Pass (it can be set to only Main Pass through console variables). The logic of these two Passes is basically the same, and the only difference is that the HZB used for the Main Pass occlusion culling is constructed based on the previous frame data, while the Post Pass uses the HZB of the current frame constructed after the Main Pass ends. This is to prevent the HZB of the previous frame from erroneously removing some visible Mesh.
It should be noted that Nanite does not use Mesh Shader. The reason is that on the one hand, the support of Mesh Shader is not yet popular; on the other hand, because Nanite uses soft rasterization, the output of Mesh Shader still needs to be written back to the GPU Buffer for the soft rasterization input. Therefore, there are not many bandwidth savings compared to the CS solution.
After the culling is over, each Cluster will be sent to different rasterizers according to the size of its screen space. Large triangles and non-Nanite Mesh are still based on hardware rasterization, and small triangles are based on soft rasterization written by Compute Shader. Nanite’s Visibility Buffer is an R32G32_UINT texture (8 Bytes/Pixel), where 0~6 bit of R channel stores Triangle ID, 7~31 bit stores Cluster ID, G channel stores 32-bit depth:
The logic of the entire soft rasterization is relatively simple: based on the scan line algorithm, each Cluster starts a separate Compute Shader, calculates and caches all Clip Space Vertex Positions to Shared Memory in the initial stage of the Compute Shader, and then each thread in CS reads Corresponding to the Index Buffer of the triangle and the transformed Vertex Position, calculate the side of the triangle according to the Vertex Position, perform backside elimination and small triangle (less than one pixel) elimination, and then use atomic operations to complete the Z-Test, and write the data into the Visibility Buffer. It is worth mentioning that, in order to ensure that the entire soft rasterization logic is concise and efficient, Nanite Mesh does not support models with skeletal animation, vertex transformation, or Mask in the material.
In order to ensure that the data structure is as compact as possible and reduce the read and write bandwidth, all the data required for soft rasterization is stored in a Visibility Buffer, but in order to mix with the pixels generated by hardware rasterization in the scene, we finally need to write the additional information within the Visibility Buffer into the unified Depth/Stencil Buffer and Motion Vector Buffer. This stage usually consists of several full-screen Passes:
(1) Emit Scene Depth/Stencil/Nanite Mask/Velocity Buffer. In this step, according to the RenderTarget data required by the final scene, up to four Buffers will be output. Among them, the Nanite Mask uses 0/1 to indicate whether the current pixel is a normal Mesh or a Nanite Mesh (Obtained according to the Cluster ID corresponded to the Visibility Buffer). For Nanite Mesh Pixel, convert the Depth in the Visibility Buffer from UINT to float and write it into the Scene Depth Buffer, and write the Stencil Value corresponding to the decal into the Scene Stencil Buffer according to whether the Nanite Mesh accepts decals; And calculate the Motion Vector of the current pixel according to the position of the previous frame and write it into the Velocity Buffer. If it is the non-Nanite Mesh, it will be discarded and skipped directly.
Scene Depth/Stencil Buffer
(2) Emit Material Depth. In this step, a Material ID Buffer will be generated. The difference is that it is not stored in a UINT type texture map, but the UINT type Material ID is converted to a float and stored in a Depth/Stencil Target with the format of D32S8 (we will explain the reason for this later); theoretically, it supports up to 2^32 types of materials (actually only 14 bits are used to store the Material ID), and the Nanite Mask will be written into Stencil Buffer.
Material Depth Buffer
Classify Materials & Emit G-Buffer
We have introduced the principle of Visibility Buffer in detail. One implementation in the shading calculation phase is to maintain a global material table. The table stores material parameters and the index of related textures. You can find the corresponding material according to the Material ID of each pixel, and parse the material information, and use technical solutions such as Virtual Texture or Bindless Texture/Texture Array to obtain the corresponding texture data. This is feasible for a simple material system, but UE includes a very complex material system; each material has a different Shading Model, and each material parameter under the same Shading Model can also be complicatedly connected and calculated through the material. This mode based on Blueprint dynamically generating material Shader Code obviously cannot be realized with the above scheme.
In order to ensure that the Shader Code of each material can still be dynamically generated based on the material editor, the PS Shader of each material must be executed at least once, but we only have the material ID information in the screen space, so unlike in the past that the corresponding material Shader is run at the same time when drawing objects one by one, Nanite’s material Shader is executed in Screen Space to decouple the visibility calculation and the material parameter calculation. This is also the origin of the name Deferred Material. But this has caused new performance problems: there are thousands of materials in the scene, and each material is drawn with a full-screen pass, so the bandwidth pressure caused by redrawing is bound to be very high. How to reduce meaningless redrawing has become a new challenge.
For this reason, Nanite does not have a full-screen pass for each material in the Base Pass drawing stage but divides the screen space into several 8×8 blocks. For example, when the screen size is 800×600, 100×75 blocks are generated when each material is drawn, and each block corresponds to the location on the screen. In order to be able to eliminate the entire block, Nanite will start a CS to count the types of Material ID contained in each block after Emit Targets. Since the Depth value corresponding to the Material ID is sorted in advance, this CS will count the maximum and minimum values of the Material Depth in each 8×8 block as the Material ID Range and store it in an R32G32_UINT map:
Material ID Range
After having this picture, each material will sample the Material ID Range of the corresponding position of the texture according to the position of its own block in its VS phase. If the Material ID of the current material is in the Range, the PS of the material will continue to be executed; Otherwise, it means that no pixels in the current block use the material, and the whole block can be eliminated. At this time, you only need to set the vertex position of the VS to NaN, and the GPU will eliminate the corresponding triangle. Since there are usually not too many types of materials in a block, this method can effectively reduce unnecessary Overdraw.
In fact, the idea of reducing material branching through block classification and simplifying the rendering logic is not the first time that it is proposed. For example, when “Uncharted 4” implements their delayed lighting , because the material contains multiple Shading Models, in order to avoid each Shading Model to starting a separate full-screen CS, they also divide the screen into blocks (16×16), and count the types of Shading Models in the block, and start a CS for each block separately according to the Range of the Shading Model in the block, and take the corresponding Lighting Shader in the range, to avoid multiple full-screen Passes or an Uber Shader that contains a lot of branching logic, thereby greatly improving the performance of delayed lighting.
Block Statistics of Shading Model Range in Uncharted 4
After culling block by block, Material Depth Buffer comes in handy. In the Base Pass PS phase, Material Depth Buffer is set to Depth/Stencil Target, while Depth/Stencil Test is turned on, and Compare Function is set to Equal. Only when the Material ID of the current pixel and the material ID to be drawn are the same (Depth Test Pass) and the pixel is Nanite Mesh (Stencil Test Pass), will PS be executed, so with the hardware Early Z/Stencil we have completed the pixel-by-pixel Material ID removal. The principle of the entire drawing and removal is shown in the figure below:
Red indicates the area that has been culled
The entire Base Pass is divided into two parts. First, draw the G-Buffer of the non-Nanite Mesh. This part is still executed in the Object Space, which is consistent with the logic of UE4. After that, the G-Buffer of the Nanite Mesh is drawn according to the above process. The additional VS information (UV, Normal, Vertex Color, etc.) required for the material is drawn is indexed to the corresponding Vertex Position by the Cluster ID and Triangle ID of the pixel, and transformed to Clip Space. The barycenter coordinates of the current pixel and the gradient of Clip Space Position (DDX/DDY) are calculated according to the Clip Space Vertex Position and the depth value of the current pixel. Substituting the center of gravity coordinates and gradient into various Vertex Attributes to interpolate all the Vertex Attributes and their gradients (gradients can be used to calculate the sampled Mip Map level).
So far, we have analyzed the technical background and complete implementation logic of Nanite.
 《A Macro View of Nanite》
 《UE5 Nanite实现浅析》
 《Vulkan API Overhead Test Added to 3DMark》
 《Mesh Shading: Towards Greater Efficiency of Geometry Processing》
 《A Trip Through the Graphics Pipeline》
 《Nanite | Inside Unreal》
 《Tile-Based Rendering》
 《The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading》
 《Triangle Visibility Buffer》
 《Bindless Texture》
 《Deferred Lighting in Uncharted 4》
Thanks to the author Luocheng for the contribution and translation by UWA. Welcome to forward and share, please do not reprint without the author’s authorization. If you have any unique insights or discoveries, please contact us and discuss them together.
More articles about UE5
UE5 Lumen Implementation Analysis
You may also like
Dynamic Resolution in Games from Principle to ApplicationJanuary 4, 2023
Introduction of the Raytracing Technology Part 2December 21, 2022