Analysis of UE5 Rendering Technology: Nanite

Discussion in 'Game Development (Technical)' started by UWA4d, Feb 22, 2022.

  1. UWA4d

    UWA4d New Member

    Joined:
    Dec 6, 2021
    Messages:
    16
    Likes Received:
    1
    1. Introduction
    After Epic released the UE5 technology demo at the beginning of 2021, the discussion about UE5 has never stopped. Related technical discussions mainly centered on two new features: global illumination technology Lumen and extremely high model detail technology Nanite. There have been some articles [1 ][2] analyzing Nanite technology in more detail. This article mainly starts from the RenderDoc analysis and source code of UE5, combined with some existing technical data, aims to provide an intuitive and overview understanding of Nanite, and clarify its algorithm principles and design ideas, without involving too many source code level Implementation details.

    2. What do we need for next-generation model rendering?
    To analyze the technical points of Nanite, we must first proceed from the perspective of technical requirements. In the past ten years, the development of 3A games has gradually tended to two main points: interactive film narrative and open world. In order to cutscene realistically, the character model needs to be exquisite; for a sufficiently flexible and rich open world, the map size and the number of objects have increased exponentially, both of which have greatly increased the requirements for the fineness and complexity of the scene: The number of objects must be large, and each model must be sufficiently detailed.

    There are usually two bottlenecks in the rendering of complex scenes:
    1. The CPU-side verification and communication overhead between CPU-GPU brought by each Draw Call;
    2. Overdraw caused by inaccurate elimination and the resulting waste of GPU computing resources;
    3. For more at https://blog.en.uwa4d.com/2022/02/15/analysis-of-ue5-rendering-technology-nanite/
    In recent years, the optimization of rendering technology has often revolved around these two problems, and some technical consensus in the industry has been formed.

    In view of the overhead caused by CPU-side verification and status switching, we have a new generation of graphics APIs (Vulkan, DX12, and Metal), which are designed to allow drivers to do less verification work on the CPU side; different tasks are dispatched through different Queues For GPU (Compute/Graphics/DMA Queue); developers are required to handle the synchronization between CPU and GPU by themselves; make full use of the advantages of multi-core CPU and multi-thread to submit commands to GPU. Thanks to these optimizations, the number of Draw Calls of the new generation graphics API has increased by an order of magnitude compared with the previous generation graphics API (DX11, OpenGL) [3].

    Another optimization direction is to reduce the data communication between the CPU and GPU, and more accurately remove triangles that do not contribute to the final picture. Based on this idea, GPU Driven Pipeline was born. For more information about GPU Driven Pipeline and culling, you can read this article by the author [4].

    Thanks to the more and more widespread application of GPU Driven Pipeline in games, the vertex data of the model is further divided into more fine-grained Clusters (or Meshlets) so that the granularity of each Cluster can better adapt to the Vertex Processing stage. Cache size and various types of culling (Frustum Culling, Occlusion Culling, and Backface Culling) have gradually become the best practice for complex scene optimization, and GPU vendors have gradually recognized this new vertex processing flow.

    However, the traditional GPU Driven Pipeline relies on Compute Shader culling. The removed data needs to be stored in the GPU Buffer. Via APIs such as Execute Indirect, the removed Vertex/Index Buffer is fed back to the GPU’s Graphics Pipeline, which has increased the invisible overhead of reading and writing. In addition, the vertex data will be read repeatedly (Compute Shader reads before culling and Graphics Pipeline reads through Vertex Attribute Fetch when drawing).

    Based on the above reasons, in order to further improve the flexibility of vertex processing, NVidia first introduced the concept of Mesh Shader[5], hoping to gradually remove some fixed units (VAF, PD, and other hardware units) in the traditional vertex processing stage. And hand over these matters to the developer to process through the programmable pipeline (Task Shader/Mesh Shader).

    [​IMG]

    [​IMG]
    Schematic of Cluster



    [​IMG]

    The traditional GPU Driven Pipeline eliminates dependence on CS, and the eliminated data is passed to the vertex processing pipeline through VRAM



    [​IMG]

    The Pipeline Based on the Mesh Shader, Cluster culling becomes part of the vertex processing stage, reducing unnecessary Vertex Buffer Load/Store

    3. Are these enough?
    So far, the problems of the number of models, the number of triangle vertices, and the number of faces have been greatly optimized and improved. However, high-precision models and small triangles at pixel level put new pressure on the rendering pipeline: rasterization and overdraw pressure.

    Does soft rasterization have a chance to beat hard rasterization?
    To clarify this problem, you first need to understand what hard rasterization does and what general application scenarios it envisages is. Read this article if you are interested in this [6]. To put it simply: At the beginning of the design of traditional rasterization hardware, the size of the input triangle envisaged is much larger than one pixel. Based on this assumption, the process of hardware rasterization is usually hierarchical.

    Taking the NVIDIA Rasterizer as an example, a triangle usually undergoes two stages of rasterization: Coarse Raster and Fine Raster. The former takes a triangle as input and 8×8 pixels as a block, and the triangle is rasterized into several blocks (it can also be understood as rough rasterization on the FrameBuffer whose original size is FrameBuffer 1/8*1/8).

    At this stage, the occluded block will be completely eliminated through the low-resolution Z-Buffer, which is called Z Cull on the NVIDIA card; after the Coarse Raster, the block that passes through the Z Cull will be sent to the next stage for processing Fine Raster, which finally generates pixels for shading calculations. In the Fine Raster stage, we are familiar with Early Z. Due to the calculation needs of Mip-Map sampling, we must know the information of the adjacent pixels of each pixel, and use the difference of the sampled UV as the calculation basis for the Mip-Map sampling level. For this reason, the final output of Fine Raster is not a pixel, but a small 2×2 Pixel Quad.

    For triangles close to the pixel size, the waste of hardware rasterization is obvious. First of all, the Coarse Raster stage is almost useless, because these triangles are usually smaller than 8×8. This situation is even worse for those long and narrow triangles because a triangle often spans multiple blocks, and these triangles cannot be removed by Coarse Raster but also add additional computational burden; in addition, for large triangles, the Fine Raster stage based on Pixel Quad will only generate a small number of useless pixels on the edge of the triangle, which is only a small part of the area of the entire triangle; But for small triangles, Pixel Quad will generate pixels four times the area of the triangle at worst, and these pixels are also included in the execution stage of Pixel Shader, which greatly reduces the effective pixels in WARP.

    [​IMG]

    Rasterization waste of small triangles due to Pixel Quad

    For the above reasons, soft rasterization (based on Compute Shader) does have a chance to defeat hard rasterization under the specific premise of pixel-level small triangles. This is also one of Nanite’s core optimizations. This optimization makes UE5 improve the efficiency of small triangle rasterization by 3 times [7].

    The rest details I would like to recommend you to read more on the official website at blog,en,uwa4d,com (commas are need to change to dots) for more details.
     

    Attached Files:

Share This Page

  • About Indie Gamer

    When the original Dexterity Forums closed in 2004, Indie Gamer was born and a diverse community has grown out of a passion for creating great games. Here you will find over 10 years of in-depth discussion on game design, the business of game development, and marketing/sales. Indie Gamer also provides a friendly place to meet up with other Developers, Artists, Composers and Writers.
  • Buy us a beer!

    Indie Gamer is delicately held together by a single poor bastard who thankfully gets help from various community volunteers. If you frequent this site or have found value in something you've learned here, help keep the site running by donating a few dollars (for beer of course)!

    Sure, I'll Buy You a Beer