GPU Rasterizer Pattern
Visualizing the pattern in which the fragment processors are called for the individual fragments are some kind of ‘Hello World’ for atomic counters. The order in which the fragments generated by a draw call are created and the fragment shaders are calles are not defined, so the hardware can choose each order that best matches the architecture. In theory all fragments can be shaded in parallel, in practice it depends on the GPUs ALU count and other factors. No fragment shader was allowed to modify shared resources or read the output of other fragment shaders in the same pass – this constrained was necessary to ensure best parallelism.
With OpenGL 4.2 atomic operations were introduced that changed all that, but as the hardware is still tuned for massive parallelism using these operations is quite expensive. But performance is not a problem if you want to visualize the pattern in which fragments are created and shaders are executed.
My test application works in two passes:
In the first pass each vertex and fragment shader reads the same atomic variable and increases it. It’s a 32 bit integer so each shader execution knows its ‘position’ in the pipeline. In case of a vertex shader this number will get transferred to the fragment stage so each fragment shader knows also the unique number of on of the involved vertices (as these numbers are integers I passed them on as flat so only the number of the provoking vertex gets transmitted, but that inaccuracy is fine for this). Both numbers get written out into a two channel 32 bit integer render target. The color of the rendering gets written to a normal 8 bit per channel RGBA texture.
The second pass gets now called repeatedly with a ‘time’ uniform that goes from 0 to the number the counter had after pass one. The step width that gets added to the ‘time’ each frame controls the speed of the animation (and the framerate, but we can limit this to 60Hz). In the fragment shader all fragments that stored a value > ‘time’ for the fragment shader counter get the color from the color texture. If the value of the vertex shader counter is larger than ‘time’ (but not the value of the FS counter) it gets colord grey. Otherwise discard.
This way we get a ‘replay’ of the rasterization:
Note that the latest fragments are additionally highlighed in bright green. The Killeroo is not fully rendered (the geometry got transformed in batches), some of the grey areas are filled with fragments.
Here a close up of the tail (it was the last part to get rasterized):
The pattern itself is best visible when texturing a quad constisting of just two triangles:
The smallest blocks consist of 4*8 fragments which is not surprising as the NVidia GTX 580 which calculated this image has 32 ALUs (the marketing devision of NVidia calls them CUDA cores) per shader processor (or Streaming Multiprocessor if you like NVidias term). I can only assume that the same SM is responsible for the whole 16*16 fragment blocks that are clearly visible as those are quickly finished as soon as one 4*8 block of it gets started. As one is finished, the next block that gets started right away is the one in the upper right relative to the last one.
Minor update: Similar experiments can also be found here: