Disclaimer: This article uses information from Diligent Graphics official GitHub repository.

Background

This article assumes a prior knowledge of 3D graphics programming APIs (Direct3D, OpenGL, Vulkan, etc.)

Introduction

Graphics Processing Units (GPUs) have come a long way from very specialized accelerators capable of performing very narrow set of operations to very generic high-performance computing devices. They still contain a set of fixed-function stages such as rasterizer or blending stage, but the general trend is that more and more of these stages are moving to general-purpose computation units.

One of the most recent steps in GPU evolution was the generalization of the vertex processing pipeline, by replacing it with two new fully programmable stages - mesh and amplification shaders.

Originally, hardware was only capable of performing a fixed set of operations on input vertices. An application was only able to set different transformation matrices (such as world, camera, projection, etc.) and instruct the hardware how to transform input vertices with these matrices. This was very limiting in what an application could do with vertices, so to generalize the stage, vertex shaders were introduced.

Vertex shaders were a huge improvement over fixed-function vertex transform stage because now developers were free to implement any vertex process algorithm. There was however a big limitation - a vertex shader takes exactly one vertex as input and produces exactly one vertex as output. Implementing more complex algorithms that would require processing entire primitives or generate them entirely on the GPU was not possible.

This is where geometry shaders were introduced, which was an optional stage after the vertex shaders. Geometry shader takes the whole primitive as an input and may output zero, one or more primitives. Developers were originally very enthusiastic about this stage and tried to used it for geometry processing (level-of-detail selection, tessellation, etc.), however it quickly become clear that the stage was largely impractical for any of these tasks due to very poor performance. As a matter of fact, the stage found very little adoption in the graphics community ever since it was introduced.

To give applications more flexibility and enable efficient geometry processing on the GPU, two new programmable and one fixed-function stage were later added: hull and domain shaders, and a tessellator stage capable of producing a predefined number of tessellations. The hull shader was working on surface patches (linear, triangular or square). Its responsibility was to process the patch as a whole by transforming its control points and instruct the following fixed-function tessellator how to subdivide the patch into basic primitives (triangles or lines). The domain shader was going after the tessellator and was similar to vertex shader in a way it was working on one vertex and producing one vertex in the tessellation generated by the tessellator.

During its evolution, the vertex processing pipeline absorbed the following stages:

Primitive assembly stage, which is where vertex data is fetched from one or multiple vertex buffers, assembled together and passed over to the vertex shader
Vertex shader, where vertices are typically transformed from object space to camera view space, but can also be animated or processed in any other way. Vertex shader takes exactly one vertex as input and produces exactly one vertex as output
Tessellation stage (optional), where an input primitive (quad or triangle patch) is broken up into smaller primitives using a set of predefined rules.
- Hull shader processes the patch control points and defines tessellation factors
- Fixed-function tessellator breaks up the patch domain into basic primitives
- Domain shader is called for every new vertex in tessellation generated by tessellator
Geometry shader stage (optional), where whole primitives (triangles, lines or points) are processed and another primitives are output.

It should be now clear that the vertex processing pipeline became a pile of multiple programmable and fixed-function stages that were designed at some time to solve some specific problem and required another generalization to make it more general purpose and efficient.

Mesh Shading Pipeline

This is where mesh shaders come into play. The mesh shading pipeline replaces all the stages above with just two new programmable shading stages, one of which is optional. Mesh shaders are exposed through Direct3D12 and Vulkan APIs and as of the time of this writing were supported by NVidia RTX GPUs.

The required stage is called mesh shader and comes after the optional amplification stage that we will talk about later. The mesh shader alone can mostly replace all stages in the old vertex processing pipeline. It can transform vertices, generate tessellation, process primitives as a whole. Mesh shader invocations run in compute groups very similar to compute shader invocations. The threads in the compute group can communicate and share the results between them. However, unlike the compute shader, the number of threads in the compute group is limited (e.g., no more than 128 threads in Direct3D12).

The first stage is optional and is called amplification (or task shader). It also runs in compute groups very much alike compute shaders. The goal of this stage is to produce the mesh shader invocations. The amplification shader can be used to perform such tasks as view-frustum culling, level-of-detail selection, etc.

In the rest of this article, we will take a look at mesh shaders in Diligent Engine, a modern cross-platform low-level graphics library and see how the new stages can be used to implement a basic view frustum culling and object LOD selection on the GPU. For an introduction to Diligent Engine, please refer to this article.

Amplification Shader

As we discussed above, amplification shader is the first new programmable stage and in our application will be doing view frustum culling and level-of-detail (LOD) selection.
The shader looks similar to a compute shader and executes 32 threads per group:

#define GROUP_SIZE 32

[numthreads(GROUP_SIZE,1,1)]
void main(in uint I  : SV_GroupIndex,
          in uint wg : SV_GroupID)
{

Since, unlike traditional vertex pipeline, mesh and amplification shaders don't have any built-in input attributes except for the thread ID and group ID, we will use structured buffer to pass draw command arguments down to them. A real application may use transformation, mesh ID, material ID and other attributes for each object. In our case, for simplicity, we will only provide the objeсt position in a 2D grid (BasePos), its scale (Scale), and time offset (TimeOffset) that is used by the shader for animation:

struct DrawTask
{
    float2 BasePos;
    float  Scale;
    float  TimeOffset;
};
StructuredBuffer<DrawTask> DrawTasks;

In the shader, we will need to use some global data: a view matrix and cotangent of the half of the field of view (FOV) of the camera to calculate the cube detail level (LOD); six frustum planes for frustum culling; elapsed time to animate cube positions. This information is stored in a regular constant buffer:

struct Constants
{
    float4x4 ViewMat;
    float4x4 ViewProjMat;
    float4   Frustum[6];
    float    CoTanHalfFov;
    float    ElapsedTime;
    uint     FrustumCulling;
    uint     Padding;
};
cbuffer cbConstants
{
    Constants g_Constants;
}

Another piece of information that the amplification shader will use is the object geometry, which is provided through another constant buffer. The shader only needs the radius of the circumscribed sphere, which it will use for frustum culling:

struct CubeData
{
    float4 SphereRadius;
    ...
};
cbuffer cbCubeData
{
    CubeData g_CubeData;
}

A read-write-buffer Statistics will be used to count the number of visible cubes after the frustum culling. The value will not be used in the shaders, but will be read back on the CPU to show the counter in the UI:

RWByteAddressBuffer Statistics;

The data that the amplification shader invocations will be working on (vertex positions, scale, LOD)
will be stored in a shared memory. Each thread in the group will work on its own element:

struct Payload
{
    float PosX[GROUP_SIZE];
    float PosY[GROUP_SIZE];
    float PosZ[GROUP_SIZE];
    float Scale[GROUP_SIZE];
    float LODs[GROUP_SIZE];
};
groupshared Payload s_Payload;

To get a unique index in each thread, we will use the s_TaskCount shared variable. At the start of the shader, we reset the counter to zero. We only write the value from the first thread in the group to avoid data races and then issue a barrier to make the value visible to other threads and make sure that all threads are at the same step.

groupshared uint s_TaskCount;

[numthreads(GROUP_SIZE,1,1)]
void main(in uint I : SV_GroupIndex,
          in uint wg : SV_GroupID)
{
    // Reset the counter from the first thread in the group
    if (I == 0)
    {
        s_TaskCount = 0;
    }

    // Flush the cache and synchronize
    GroupMemoryBarrierWithGroupSync();
    ...

The shader reads the thread-specific values and computes the position of the object using its draw task arguments. The gid is the global index of the draw task data:

    const uint gid   = wg * GROUP_SIZE + I;
    DrawTask   task  = DrawTasks[gid];
    float3     pos   = float3(task.BasePos, 0.0).xzy;
    float      scale = task.Scale;
    float      timeOffset  = task.TimeOffset;

    // Simple animation
    pos.y = sin(g_Constants.CurrTime + timeOffset);

It then performs frustum culling using the object position and if the object is visible, atomically increments the shared s_TaskCount value and computes the LOD.
The index variable returned by InterlockedAdd stores the index to access the arrays in the payload. The index is guaranteed to be unique for all threads, so that they will all be working on different array elements:

    if (g_Constants.FrustumCulling == 0 || IsVisible(pos, g_CubeData.SphereRadius.x * scale))
    {
        uint index = 0;
        InterlockedAdd(s_TaskCount, 1, index);
        
        s_Payload.PosX[index]  = pos.x;
        s_Payload.PosY[index]  = pos.y;
        s_Payload.PosZ[index]  = pos.z;
        s_Payload.Scale[index] = scale;
        s_Payload.LODs[index]  = CalcDetailLevel(pos, g_CubeData.SphereRadius.x * scale);
    }

IsVisible() function calculates the signed distance from each frustum plane to the sphere and computes
its visibility by comparing the distances to the sphere radius. The LOD calculation (CalcDetailLevel function) is based on computing the object bounding sphere radius in screen space. For detailed description of the algorithm, see the link in References section.

After the payload has been written, we need to issue another barrier to wait until all threads reach the same point.
After that we can safely read the s_TaskCount value. The first thread in the group atomically adds this value to the global Statistics counter. Note that this is much faster than incrementing the counter from each thread because it minimizes the access to global memory.

The final step of the amplification shader is calling the DispatchMesh() function with the number of groups and the payload that will spawn an s_TaskCount mesh shader invocations.
The DispatchMesh() function must be called exactly once per amplification shader and ends the amplification shader group's execution.

    GroupMemoryBarrierWithGroupSync();

    if (I == 0)
    {
        // Update statistics from the first thread
        uint orig_value;
        Statistics.InterlockedAdd(0, s_TaskCount, orig_value);
    }
    
    DispatchMesh(s_TaskCount, 1, 1, s_Payload);

Mesh Shader

The purpose of the mesh shader in this example is to compute the vertex positions, very much like vertex shader in a traditional pipeline, and also to output primitives. Unlike vertex shader though, mesh shader invocations run in compute groups very much like compute shaders and can share the data between threads.

We will use 24 threads out of the 32 maximum threads available. We will produce 24 vertices and 12 primitives with 36 indices. SV_GroupIndex indicates the mesh shader invocation index (0 to 23 in our case). SV_GroupID indicates the amplification shader output (0 to `s_TaskCount-1`).

[numthreads(24,1,1)]
[outputtopology("triangle")]
void main(in uint I   : SV_GroupIndex,
          in uint gid : SV_GroupID,
          in  payload  Payload  payload,
          out indices  uint3    tris[12],
          out vertices PSInput  verts[24])
{
    // Only the input values from the first active thread are used.
    SetMeshOutputCounts(24, 12);

We read the amplification shader output using the group id (gid):

float3 pos;
float  scale = payload.Scale[gid];
float  LOD   = payload.LODs[gid];
pos.x = payload.PosX[gid];
pos.y = payload.PosY[gid];
pos.z = payload.PosZ[gid];

The mesh shader uses the same cube constant buffer as the amplification shader, but it also uses the cube vertex attributes and indices. Each mesh shader thread works on only one output vertex identified by the group index I.
Much like regular vertex shader, it transforms the vertex using the view-projection matrix:

verts[I].Pos = mul(float4(pos + g_CubeData.Positions[I].xyz * scale, 1.0), 
               g_CubeData.ViewProjMat);
verts[I].UV  = g_CubeData.UVs[I].xy;

In our example, LOD doesn't affect the vertex count, and we simply display it as color. A real amplification would want to adjust the tessellation based on the LOD value:

float4 Rainbow(float factor)
{
    float  h   = factor / 1.35;
    float3 col = float3(abs(h * 6.0 - 3.0) - 1.0, 2.0 - abs(h * 6.0 - 2.0), 
                 2.0 - abs(h * 6.0 - 4.0));
    return float4(clamp(col, float3(0.0, 0.0, 0.0), float3(1.0, 1.0, 1.0)), 1.0);
}
...

verts[I].Color = Rainbow(LOD);

Finally, we output primitives (the 6 cube faces consisting of 12 triangles total). Only the first 12 threads write the indices. Note that we must not access the array outside of its bounds.

if (I < 12)
{
    tris[I] = g_CubeData.Indices[I].xyz;
}

Preparing the Cube Data

To make cube geometry available to the shaders, we will put it into a constant buffer. A real application may need to use a structured or an unordered access buffer if the size of the data is large. Note that all elements in the array in the constant buffer must be 16-byte aligned.

struct CubeData
{
    float4 sphereRadius;
    float4 pos[24];
    float4 uv[24];
    uint4  indices[36 / 3];
};

const float4 CubePos[] =
{
    float4(-1,-1,-1,0), float4(-1,+1,-1,0), float4(+1,+1,-1,0), float4(+1,-1,-1,0),
    ...
};

const float4 CubeUV[] = 
{
    float4(0,1,0,0), float4(0,0,0,0), float4(1,0,0,0), float4(1,1,0,0),
    ...
};

const uint4 Indices[] =
{
    uint4{2,0,1,0},    uint4{2,3,0,0},
    ...
};

CubeData Data;
Data.sphereRadius = float4{length(CubePos[0] - CubePos[1]) * std::sqrt(3.0f) * 0.5f, 0, 0, 0};
std::memcpy(Data.pos, CubePos, sizeof(CubePos));
std::memcpy(Data.uv, CubeUV, sizeof(CubeUV));
std::memcpy(Data.indices, Indices, sizeof(Indices));

BufferDesc BuffDesc;
BuffDesc.Name          = "Cube vertex & index buffer";
BuffDesc.Usage         = USAGE_STATIC;
BuffDesc.BindFlags     = BIND_UNIFORM_BUFFER;
BuffDesc.uiSizeInBytes = sizeof(Data);

BufferData BufData;
BufData.pData    = &Data;
BufData.DataSize = sizeof(Data);

m_pDevice->CreateBuffer(BuffDesc, &BufData, &m_CubeBuffer);

Initializing the Pipeline State

The initialization of amplification and mesh shaders is largely identical to initialization of shaders of other types (see this article). The only difference is the new shader types (SHADER_TYPE_AMPLIFICATION and SHADER_TYPE_MESH). Similar is true for pipeline state initialization. Notice also the new PIPELINE_TYPE_MESH pipeline type. Some fields of the GraphicsPipeline struct like LayoutElements and PrimitiveTopology are irrelevant for mesh shaders and are ignored. The mesh shader pipeline state uses the same pixel shader stage as the traditional vertex pipeline.

PSODesc.PipelineType = PIPELINE_TYPE_MESH;

RefCntAutoPtr<IShader> pAS;
{
    ShaderCI.Desc.ShaderType = SHADER_TYPE_AMPLIFICATION;
    ShaderCI.EntryPoint      = "main";
    ShaderCI.Desc.Name       = "Mesh shader - AS";
    ShaderCI.FilePath        = "cube.ash";
    m_pDevice->CreateShader(ShaderCI, &pAS);
}

RefCntAutoPtr<IShader> pMS;
{
    ShaderCI.Desc.ShaderType = SHADER_TYPE_MESH;
    ShaderCI.EntryPoint      = "main";
    ShaderCI.Desc.Name       = "Mesh shader - MS";
    ShaderCI.FilePath        = "cube.msh";
    m_pDevice->CreateShader(ShaderCI, &pMS);
}

RefCntAutoPtr<IShader> pPS;
{
    ShaderCI.Desc.ShaderType = SHADER_TYPE_PIXEL;
    ShaderCI.EntryPoint      = "main";
    ShaderCI.Desc.Name       = "Mesh shader - PS";
    ShaderCI.FilePath        = "cube.psh";
    m_pDevice->CreateShader(ShaderCI, &pPS);
}
    
...

PSODesc.GraphicsPipeline.pAS = pAS;
PSODesc.GraphicsPipeline.pMS = pMS;
PSODesc.GraphicsPipeline.pPS = pPS;
m_pDevice->CreatePipelineState(PSOCreateInfo, &m_pPSO);

Draw Task Data Initialization

As we discussed earlier, amplification shaders read their arguments from the structured buffer. We initialize the draw tasks data by distributing the cubes on a 2D grid and generating random scales and animation time offsets:

struct DrawTask
{
    float2 BasePos;
    float  Scale;
    float  TimeOffset;
};

const int2          GridDim{128, 128};
FastRandReal<float> Rnd{0, 0.f, 1.f};

std::vector<DrawTask> DrawTasks;
DrawTasks.resize(GridDim.x * GridDim.y);

for (int y = 0; y < GridDim.y; ++y)
{
    for (int x = 0; x < GridDim.x; ++x)
    {
        int   idx = x + y * GridDim.x;
        auto& dst = DrawTasks[idx];

        dst.BasePos.x  = (x - GridDim.x / 2) * 4.f + (Rnd() * 2.f - 1.f);
        dst.BasePos.y  = (y - GridDim.y / 2) * 4.f + (Rnd() * 2.f - 1.f);
        dst.Scale      = Rnd() * 0.5f + 0.5f; // 0.5 .. 1
        dst.TimeOffset = Rnd() * PI_F;
    }
}

We use the structured buffer because the data size may be larger than supported by a constant buffer:

BufferDesc BuffDesc;
BuffDesc.Name          = "Draw tasks buffer";
BuffDesc.Usage         = USAGE_DEFAULT;
BuffDesc.BindFlags     = BIND_SHADER_RESOURCE;
BuffDesc.Mode          = BUFFER_MODE_STRUCTURED;
BuffDesc.uiSizeInBytes = sizeof(DrawTasks[0]) * static_cast<Uint32>(DrawTasks.size());

BufferData BufData;
BufData.pData    = DrawTasks.data();
BufData.DataSize = BuffDesc.uiSizeInBytes;

m_pDevice->CreateBuffer(BuffDesc, &BufData, &m_pDrawTasks);

Rendering

To issue the draw command, we first prepare the data that will be required by the mesh shader: we calculate the field of view (FOV) of the camera and cotangent of the half FOV, as these values are used to build the projection matrix and to calculate LODs in the shader.

const float m_FOV            = PI_F / 4.0f;
const float m_CoTanHalfFov   = 1.0f / std::tan(m_FOV * 0.5f);

and upload the values into the constant buffer:

MapHelper<Constants> CBConstants(m_pImmediateContext, m_pConstants, MAP_WRITE, MAP_FLAG_DISCARD);
CBConstants->ViewMat        = m_ViewMatrix.Transpose();
CBConstants->ViewProjMat    = m_ViewProjMatrix.Transpose();
CBConstants->CoTanHalfFov   = m_LodScale * m_CoTanHalfFov;
CBConstants->FrustumCulling = m_FrustumCulling;
CBConstants->ElapsedTime    = m_ElapsedTime;
CBConstants->Animate        = m_Animate;

We use ExtractViewFrustumPlanesFromMatrix() function to calculate the view frustum planes from the view-projection matrix. Notice that the planes are not normalized and we need to normalize them to use for sphere culling. The resulting plane attributes are written into the constant buffer as well:

ViewFrustum Frustum;
ExtractViewFrustumPlanesFromMatrix(m_ViewProjMatrix, Frustum, false);

for (uint i = 0; i < _countof(CBConstants->Frustum); ++i)
{
    Plane3D plane  = Frustum.GetPlane(ViewFrustum::PLANE_IDX(i));
    float   invlen = 1.0f / length(plane.Normal);
    plane.Normal *= invlen;
    plane.Distance *= invlen;

    CBConstants->Frustum[i] = plane;
}

We are now finally ready to launch the amplification shader, which is very similar to dispatching compute groups.
The shader runs 32 threads per group, so we need to dispatch just m_DrawTaskCount / ASGroupSize groups
(in this example, the task count is always a multiple of 32):

DrawMeshAttribs drawAttrs{m_DrawTaskCount / ASGroupSize, DRAW_FLAG_VERIFY_ALL};
m_pImmediateContext->DrawMesh(drawAttrs);

And that's it! The full sample source code is available on GitHub.

Credits

Thanks to Andrey Zhirnov for implementing mesh shaders in Diligent Engine and working on the initial version of this article.

References

History

23^rd September, 2020 - Initial version