from geometry to pixels

Understanding the parallelism of GPUs

A lot of tasks in (3D) graphics are independent from each other, so the idea to parallelize those tasks is not new. But while parallel processors are nowadays very common in every desktop PC and even on newer smartphones, the way GPUs parallelize there work is quite different. Understanding the ways a GPU works can help understanding the performance bottlenecks and is key to design algorithms that fit the GPU architecture.

I will mostly focus on how the programmable parts of the GPUs are designed and less on the remaining fixed-function parts. So let’s take a look at what is needed to build a GPU for current graphics APIs:

We need a processor that can run arbitrary code: our shaders or compute kernels. We will focus on float-point performance as these operations are needed often in graphics. Hardware support for more exotic functions as tan, sin, pow, sqrt etc are also needed. We will end up with some logic to decode the instruction to execute, some registers, a float-point ALU (working with ints is also needed) and a cache. If we copy the design of a multi-core CPU, we would copy&paste the whole thing until the DIE space is filled up. Ok, CPUs are not that simple, but let’s stick with this simplification for a moment.

In reality modern CPU cores are quite complex: they speed up the program execution by reordering independent operations on the fly (out-of-order execution), they try to guess the outcome of a branch before it was evaluated to keep the deep pipelines filled (branch prediction) and implement parallel concepts in a single core: SIMD and simultaneous multithreading. A GPU core is simpler when it comes to execution logic, they are in-order and try less to be clever. While this means that CPUs can handle a single thread better, the GPU design saves a lot of space on the DIE which can be used to add more cores from which parallel applications benefit more. But GPUs also implement the last two tricks (SIMD, simultaneous multithreading), even in more extreme ways.

Why do it once if you can do it twice?

Lets look at them individually: SIMD stands for single instruction, multiple data. One instruction is performed on a vector of operands at a time, that’s why we talk about vector processors. This idea is not new, there have been vector supercomputers in the 70th (CDC Star-100 or Texas Instruments ASC). The trick is, that an instruction only has to be decoded once and multiple ALUs can than perform the work on multiple data elements at once. This design was used for graphics in 1990 with the Intel i750, this chip could store two 8 bit values in one 16 bit register and perform the same operation on both in parallel.

Around the same time, 1989 Intel also introduced a CPU with a new instruction set, the Intel i860 with a VLIW architecture which could also run SIMD like instructions. This chip actually ended up on a ‘graphics card’, the RealityEngine, in 1993 as what we would now call a vertex shader.

Even though these chips weren’t very successful, the technique was added in 1996 into the Pentium MMX which added vector integer operations on 64 bit wide registers (each could hold e.g. 8 8-bit values or 4 16-bit values). AMD introduced 3DNow in 1998 adding float point operations in a similar way. As back then graphics cards were mostly rasterizers and the geometric transformations were still performed by the driver on the CPU, this vector float operations had the potential to speed up 3D graphics. Intel didn’t adapt 3DNow but instead extended its own set of vector operations with float operations, gave it a larger register set (128-bit) and named this SSE in 1999, first implemented on the Pentium 3. The latest iteration on Ivy Bridge has 256-bit wide registers which can operate on 8 float values or 4 double in parallel (it’s also renamed to AVX).

When we have a loop doing the same stuff with a lot of data points and each loop iteration is independent from each other, SIMD can help. Think about blending two images together, each new pixel is the weighed sum of the two input pixels from the two input images. The operations for all pixels are exactly the same, just the data is different. If however inside this loop there is a branch, things get tricky. In case all simultaneously evaluated data points take the same branch, everything is fine. But if even one point branches differently, we have to evaluate two pathes sequentially. This can be done by evaluating both branches and masking out the data points that took the other branch. Of course, the wider our SIMD registers, the higher the performance (given that the CPU can get the data fast enough from memory) but the higher the risk a branch will screw up the parallelism.

Transforming vertices and shading pixels are tasks that do (mostly) the same stuff on different data points so this can get accelerated by SIMD. When you write a shader or a compute kernel, you write the inner part of a loop (over all elements, fragments, vertices…). The compiler can now just merge N shader calls to one stream of SIMD instructions. On a modern Pentium this would mean shading 8 vertices in one go, on a current NVidia it means shading one ‘warp’ of 32 vertices (called ‘wavefront’ over at AMD). Yup, the SIMD width is 32 floats or 1024-bit. This means, that shading 32 fragments is a fast as shading just one! But it also means, that shading just one fragment is as slow as shading 32…

If you have to wait, do something else!

Another technique that we also know from CPUs is simultaneous multithreading, or ‘Hyper-Threading’ as Intel’s marketing guys call it. The basic idea is this: CPUs are fast, really fast, but memory is slow and can’t keep up. This is why data is cached in on-die caches and even quicker registers! But every once in a while the CPU can’t calculate as it has to wait for data. At other times it can calculate but the memory bus is idle. So why not switch from one thread that waits for data to another one that is ready? Well, for once the decision would need slow OS intervention and the switch would need to swap in the data of the other thread … from memory. To make this idea possible, the decision has to be made by the CPU alone and the data has to reside on the CPU the whole time. For this the registers have to be (at least) duplicated and some additional logic has to be added. Note that the ALUs are not doubled so only one thread is actually calculating stuff at any given time but the ALUs can be kept busy even if one thread stalls as it has to wait for memory.

Some chips can hold more than two threads on the CPU like the SPARC T4: 8 threads per core. On a GPU this can be even more extreme: instead of a fixed set of registers they can have large register files and the amount of threads that these can hold is defined by the sum of registers needed by all threads. This number can go in the hundreds.

But there is a difference between simultaneous multithreading of CPUs and (at least some) GPUs: The threads on a CPU are completely independent of each other, e.g. they can belong to different applications. On some GPUs they have to belong to the same shader or kernel. While this looks like a major drawback, it’s actually not that bad: If you think of a 16k triangle mesh, one SIMD thread will evaluate (to stick with the earlier number) 32 vertex shader “threads” simultaneously, so in total 512 such threads have to be started (for 16k vertices) – enough opportunity for lots of simultaneous multi-threading. This limitation can also be exploited in a couple of ways: constants can be placed one time in the register file and don’t have to be copied for each thread. Uniforms can be cached for all threads as those are read-only anyway. Memory and texture access pattern will most likely be similar and this way make best use of the shared (texture) cache. The number of threads that can run in parallel are easy to determine as each thread uses the same number of registers. The threads can even share one instruction cache.

We have the ingredients, lets make a GPU!

Let’s put everything together. At the lowest level, we have a very wide SIMD processor, if it can operate on N floats in parallel, NVidia, AMD and co. would count that as N ‘cores’. Actually, a NVidia Fermi Kepler chip (e.g. in a GeForce 680) has processors that can issue six 32-float wide SIMD instructions at once in a Streaming Multiprocessor (SMX), counting for 192 ‘cores’ at the marketing department. These processors can often run less of the more complex operations as trigonometric functions in parallel as simple float point operations, those have to be done more sequentially in a ‘special function ALU’ (called ‘special function unit’ SFU at NVidia). This is done to save transistors on less used operations and add more ALUs and wider SIMD for more common operations.

Each core has a (texture) cache, a register file and runs multiple threads in parallel with simultaneous multithreading. Fixed-function blocks can also be added here, e.g. texture units or even the fixed-functions of the geometry pipeline (vertex fetch, tessellator, viewport transform etc.) as it’s done on recent NVidia hardware.

Multiple of these processors are then placed on the GPU DIE, together with the remaining parts of the fixed-function pipeline, more cache and the memory controllers. Of course the whole thing also gets a command processor that distributes the workload coming from the host to the various processors. This way it’s easy to build different versions of the chip for the low-end to high-end market: just choose a different number of SIMD processors (and/or you can also change some number of those processors as well, e.g. the SIMD width).

Crunch the numbers on the number crunchers:

We already started to use the NVidia Kepler architecture as an example, so lets complete this: What I called SIMD processor is called SMX or streaming multiprocessor here. It has in fact 192 float point ALUs (‘CUDA cores’) and 32 special function units. Threads are run in batches of 32, called a ‘warp’. The register file holds 64k 32bit values (e.g. float or ints) for all threads. It also has a texture cache, uniform cache, instruction cache and 64kb that can be used as an L1 cache or shared memory (in case it’s used for compute). Also 16 texture units to get in more data. All fixed-functions up to the viewport transformation is handled here as well. To get to the rasterization, we have to look one level up: Two SMX and one shared ‘Raster Engine’ form one ‘Graphics Processing Cluster’ and up to four of those are placed one one DIE together with 512kb L2 cache, memory controller and a ‘GigaThread Engine’ to keep the SMX busy. To sum it up, we get 8 SMX with each 6 32-wide SIMD lanes for a total of 1536 ‘cores’. Each SMX can run a different shader or kernel, so no more as 8 different programs are running at any time but from those thousands of threads doing the same work on different data points.

At AMD, the latest architecture has processors called ‘Graphics Core Next’ (GCN Compute Units): Each has just four SIMD lanes with are just 16 floats wide. Simultaneous multithreading works a bit different here: A GCN is limited to 10 threads per SIMD lane and thus 40 threads per processor (vs. ‘whatever fits in the register file’ on NVidia hardware), but these don’t have to belong to the same shader/kernel making this design more flexible (and in fact a little bit more CPU-like). Each processor has 64kb local memory and 16kb L1 cache. The fixed-function parts for geometry and rasterization are together with a L2 cache ‘globally’ on the DIE. Where or GeForce example has only 8 SMX processors, AMD puts in 32 GCN processors. So here the total is 32 GCNs with 4 16-wide SIMD lanes for a total of 2048 cores.

Just for fun let’s compare this to a (4 core) Core i7 Ivy Bridge: What’s called ‘core’ on a CPU is what I called ‘processor’ on the GPU examples above as individual SIMD ALUs are called ‘cores’ there. So if an i7 would be a GPU, we would count: 4 processors with two threads each, each processor has a 8 float wide SIMD unit. This gives a total of 32 ‘cores’ and a maximum of 64 ‘threads’ in parallel. Still not much, even if taking the higher clock rate into account.

If that’s not enough performance for your graphics needs, you can always put two GPU DIEs on one graphics card and/or install multiple cards to work together…


, , ,

2 thoughts on “Understanding the parallelism of GPUs
  • sriravic says:

    680 is kepler architecture. It has 192 cores per SM. Fermi has 32 cores per SM. Please do correct the mistake in the section “We have the ingredients, lets make a GPU!”.

    • Robert says:

      Thanks for the note, yes I was talking about a Kepler (it’s fixed now). The Kepler has 192 ‘cores’ but a warp still is 32 floats wide, so it has a SIMD width of 32 floats but can run 6 of those in parallel in one SMX. Hope this makes it clearer.
      Fermi had also a 32 float SIMD but could only issue one instruction at a time (at least on most Fermis, the 560Ti has 48 ALUs per SM).

Leave a Reply

Your email address will not be published. Required fields are marked *