What’s the big deal with Apples Metal API?
During the WWDC 2014 keynote Apple surprised us all by announcing a new 3D graphics API: called Metal (try to google that…). But this one is not a new high level API on top of OpenGL ES (like SceneKit) but a new low level rendering and compute API which can replace OpenGL ES in games. It’s said to be up to 10 times faster than OpenGL ES (or to be more precise: being able to generate up to 10 times the amount of draw calls) and is only available on iOS devices with the latest A7 processor.
The announcement of Metal has (again) started some discussions (in other blogs and Twitter) about the need of newer, closer to the metal APIs and why, or why not those are needed to replace OpenGL. This post is not intended to participate in this discussion, but to explain why this discussion is going on by explaining what Metal sets apart form OpenGL ES (which it could replace on iOS). To understand what is or isn’t special about Apples Metal API we have to look at some background on how graphic APIs and GPUs work.
How GPUs and graphics drivers work
A naive user could assume that an API call directly does something on the GPU or lets something happen inside of the GPU (reconfigure it in case of state changes or starting to draw something), more naively she/he could assume the GPU actually finished that call when the API call returns. In fact, this assumptions are far from what really happens. If the driver would execute the rendering commands as soon as they were created and would wait for the rendering to complete before returning from the API call, neither the CPU or the GPU could work efficiently as one processor would always block the other one.
As a simple improvement the GPU could run asynchronously: the GPU would not block the CPU and the API calls could return nearly instantly. In this scenario the GPU might not get used 100% as it might have to wait for the CPU to create render calls (think of the start of a frame) while later into the frame multiple commands might have to wait for others to complete. This is one reason why most graphic drivers collect all draw calls (and other tasks which should be executed on the GPU, e.g. data transfer, state changes etc.) for the whole frame before sending them to the GPU. Those buffered command will then be send at the beginning of the next frame and thus use the GPU as efficiently as possible. Of course, this adds one frame of latency: While the CPU creates work for this frame, the last frame gets rendered on the GPU. In fact, more than one frame can get buffered to get a higher utilisation of the GPU and thus higher framerates – at the cost of even more latency.
This is called pipelining and some API calls will force the GPU to catch up to the CPU, effectively blocking the CPU and stalling the pipeline: querying data back from the GPU. It’s only save to query “old”, some frames old data from the GPU as we can assume that this data is already present – for example occlusion queries should get read back at least one frame after they were generated as we have to assume that the just stopped queries will take at least one frame to get processed by the GPU (you can check if a query result is already availably in OpenGL and thus find out, that it can take one to three frames until the results are ready).
Another mistake in our naive assumption (other than the time of command execution) is what some of the state changing calls are doing. For example to define the vertex data inputs to a vertex shader, the shader itself has to define the input, but a vertex buffer object also has to define the buffers from which to pull what kind of data. Note that these definitions can differ in some way, for example a value from an integer buffer can get converted to a float value inside of the shader. By configuring the vertex buffer object and binding it the hardware does not set some magic flags inside of the GPU to pull and convert the data for you, on most GPUs (probably on all recent GPUs) the driver will add some shader code to your vertex shader to perform the pulling and conversion for you – it’s the best way to use the existing hardware. Daniel Rakos has found in his “OpenGL Insights” article “Programmable Vertex Pulling” that if the data pulling from the buffers is performed inside of the shader by his own code, the performance is identical to what the “fixed function” pulling does on AMD as well as NVIDIA GPUs.
Conversion of fragment outputs to the correct texture format as well as some rarely used texture sampling techniques are other states that can get implemented by changing the application provided shaders. So when will the driver patch the shaders? Whenever a corresponding state change happens? Imagine shader A is bound, a couple of states are changed (e.g. the bound VAO gets configured) and the driver would update and recompile the shader for each GL call just to find that the application will switch to shader B just before the actual draw call happens…
To ensure only those shader patching operations happen that are actually needed, the driver waits for the first draw to happen (as we have learned, it waits for a whole frame anyway unless a read back operations from GPU results force it to work otherwise) and caches the patched shader so the next time this configuration is needed, a precompiled shader is ready. Some applications set all states they might need once at start-up and start one unimportant draw call to ensure all cacheable states are created and cached by the driver. This way, during the actual run loop of the application no delays based on those expensive driver operations occur anymore.
So far we have learned that at least two important things happen behind the scenes to fit (e.g.) OpenGL to modern GPUs: State changes can get complex if a new state combination is required and GPU operations are delayed for a considerable amount of time.
Number two has another side effect: copying data to the “graphics card”. In case of a “classical” desktop GPU (you know, the ones that take a PCI/AGP/PCI-E slot – or two, or three) with its own graphics memory the driver has to copy the data to that memory – in case of a shared memory system (mobile GPUs, but also the Intel embedded GPUs and consoles) this is a memcopy to a new location. Updating a uniform buffer for example will consist of a mapping of the buffer data to the applications memory space (a pointer to the data gets provided), editing the data and unmapping it. As the GPU runs asynchronously and might be one or more frames behind, the driver can’t provide a pointer to the actual data but to a copy which we can update and which the driver will then set aside to upload when the current frame gets processed.
Per application one stream of actual commands for one frame to be executed by the GPU is produced and then send in one go to the GPU (well, actually to the driver which has to handle command buffers from multiple applications and schedule between them, but lets keep the example simple).
More details about what actually happens in the graphics pipeline can be found in Fabian Giesens excellent series “A trip down the Graphics Pipeline“.
Why another programming model can have benefits
We can see, that a lot of complexity is hidden from the programmer and that a lot of tricks (probably way more than I have mentioned here) have to be performed to hide what is actually going on. Some of those tricks make the life of the developer simpler, others force him/her to find ways to trick the driver (e.g. the mentioned “useless” draw calls to force the driver to cache states early on) or to learn the possible side effects of API calls (for example which can stall the GPU and how to force a stall to reduce latency).
Some graphics APIs now try to remove most of these tricks by exposing more of the actual complexity – in some cases by leaving it to the program so solve the resulting problems. It’s been said that the graphics API of the PS3 went in this direction (as I’m not a PS3 developer I don’t get access to the documents to check and even if I would be one, the NDA prohibits all devs from describing any details), Mantle is going in this direction (we will see more about how it’s done by AMD when the documents get released), as will Microsoft with DirectX 12 and now Apple is doing the same with Metal.
So what has changed?
Command buffers are exposed and the application has to fill those buffers and commit them to the command queue which will execute the buffers in order on the GPU – this way the application has full control over when the work is send to the GPU and how many frames delay it is willing to add (thus adding latency but increasing GPU utilisation). Buffering GPU commands and sending them asynchronously in the next frame has to be implemented by the application itself.
As it is clear that those buffers are not executed right away (during creation) and that multiple buffers can get created and then added in a specific order to the command queue to be executed, an application can start building them in multiple threads in parallel. It is also more obvious to the programmer what results of the computations are available and which are not.
State changes are now organised in state objects which can get switched easily while creating those objects is more expensive. For example the MTLRenderPipelineState holds the shaders and all states that are implemented by patching the shaders (e.g. vertex shader input). This way it’s clear that this object should get created at startup and can not be modified later on, which eliminates shader recompiling e.g. during the game loop.
Another benefit from having a new API is that it does not have to be compatible with older versions and thus has less redundancy in it. Lets look at how uniforms are set in OpenGL for example: You can use the default uniform block and call one of the glUniform() functions after binding the shader or use one of the glProgramUniform() functions without binding the shader (a function that got added later). You can also use a uniform buffer (a later added feature) and update all uniforms at once (no shader binding needed). In Metal you set uniforms similar to the uniform buffer way.
As the Metal API is “designed” for the A7 chip, it is intended to run on a shared-memory system. This means that the CPU and GPU can directly access the same data without the need to go over the PCIe bus. Updating or modifying buffer data from one of the processors to be used on the other one can thus be very efficiently done as long as it can be guaranteed, that the data isn’t in use by the other processor. Metal gives the program direct access to buffers from the CPU, it is the responsibility of the program to ensure that the data is not used by the GPU at the same time. This is a very nice feature to have and can be used to mix the computation of CPUs and GPUs: On the PS3, also a shared-memory system with a close to metal API, some games used the Cell SPUs to perform post-processing effects on the framebuffers which normally would be done by fragment shaders on the GPU – but the GPU was at its limits and there were still CPU resources available to use and access to the same data was basically free (no memory transfer). The same could be done with Metal (the A7 CPU isn’t such a SIMD “beast” as the Cell, but tasks more suited for a CPU can be interleaved with GPU and compute tasks).
How is it ten times faster?
Each draw costs some time on the CPU and some time on the GPU. The Metal API reduces the time spend on the CPU by making state handling simpler and thus reducing error checks by the driver if the state combination is valid. Precomputing states also helps: not only can the error check be done at state build time, the state change itself requires fewer API calls. Being able to build command buffers in parallel also increases the number of possible draw calls if the application is CPU bound.
The rendering on the GPU on the other hand is not faster, an application that only makes a few calls to draw very huge meshes will not benefit.
Could the same be done with OpenGL?
GDC 14 featured a great talk named “Approaching Zero Driver Overhead” by Cass Everitt, John McDonald, Graham Sellers and Tim Foley. The general idea is to reduce the work of the driver in OpenGL by doing more work per draw call and to use newer GL objects and fewer GL calls to be more efficient. With multi draw indirect for example a lot of different draws can get started by just one function call – the parameters are stored on the GPU memory. State changes can get reduced by bindless APIs (e.g. all textures are available to the shaders at all time, so no rebinding of textures is needed).
When all ressources needed for a lot of draws are bound at the same time (e.g. all textures are available as texture arrays) or don’t have to be bound (bindless textures) and all vertex data gets pulled from one VAO or programmatically by the shader itself, one multi draw indirect can replace a lot of small draw calls. The parameter lists for this as well as the content of uniform buffers or shader buffers, which can provide additional parameters per “draw”, can be build in parallel on different threads as the creation of those buffers does not require any GL calls. Objects can be immutable (immutable textures) so only the content can get change but not other parameters like the texture size, number of mipmaps etc. This works like all objects in Metal and has the advantage that fewer possible object state changes result in fewer error checks for the driver.
Some of these ideas require extensions and newer OpenGL versions, but it would be possible to bring a lot of this to OpenGL ES as well. What we don’t get is the direct handling of command buffers with all its pros and cons.
OpenGL can develop in a direction with bindless objects and direct state access (in the core GL specs) where few draw calls can achieve everything needed. The driver overhead would be minimal, and GL could compete with “closer to the metal” APIs. Ideally it would also remove the then outdated parts of the API to better direct the programmers to only use the efficient ways and possibly also to reduce driver complexity.
How big are our chances to see this? The introduction of “deprecated” functions to OpenGL was later removed and nothing was deprecated, then a core profile was introduced that had a cleaner API but the “compatibility” or “full” profile including everything is still supported on all systems except Macs. We might see a “modern core” profile with only the new and recommended functions, but it probably has to be a 100% compatible subset of the “full” version supporting everything back to glBegin(). This might limit the option for future OpenGL evolution and might make alternative, cleaner and closer to the metal API more attractive…