I already talked about the swap test which helps us determine if velvet graphics is indeed possible. The second step I usually take is to benchmark how much graphics we can put on screen before things starts to stutter.
If we focus on pure graphics for now, there are two things worth looking at. Fillrate, which is the systems ability to put pixels on screen; and number of draw calls, which is the number of times we can tell the system to draw something.
When working with software graphics, fill rate is usually the biggest obstacle. The CPU needs to process millions of pixels per frame, potentially both reading and writing. The memory bus will be heavily taxed, and this will eat from the performance budget of the rest of the application. When you have a dedicated GPU however, fill rate is usually not the problem. I’m speaking now in the context of UIs, as both industrial 3D applications and games can easily push the GPU beyond its limits.
A couple of things affect our pixel throughput:
Doing source over alpha blending requires that we take our source pixel and mix it with the destination pixel. The destination pixel is both read and written. So there is a bit of extra work involved. In addition, some GPUs can do even further optimizations like hidden surface removal and early z-test. Such tricks are great for overdraw performance, but they only work when blending is turned off (as in glDisable(GL_BLEND)). On a desktop with a discrete graphics card, chances are this will never be a problem, but for an onboard laptop or mobile/embedded GPU, it just might.
Another aspect that can greatly reduce the throughput of an application is textures. Textures are memory blobs and need to be fetched from graphics memory. The GPU doesn’t always have a large cache, so that means that a lot cycles are spent just fetching texture memory. Working with small textures or small regions of a large texture will in most cases be cheap, while working with full screen images will take its toll.
The part of the GPU pipeline that decides the color of a pixel is called a fragment shader. There are other shader stages, like the vertex shader, but the ratio of pixels to vertices is usually so high that the fragment shader is the one that ends up counting. With user interfaces, the majority of fragment shaders are quite simple. The Qt Quick Scene Graph will typically alternate between “colored pixel”, “textured pixel” and “distance field textured pixel”. They translate into a few GPU instructions each. Compared to 3D graphics, we don’t have to deal with lights, normals, bump mapping, shadows and all the other goodness that modern 3D has to offer. There are a couple of exceptions though. For instance anything involving blurring, like Qt Graphical Effects’ GaussianBlur or DropShadow, requires a lot of texture samples to produce a single output pixel. It is very likely that an onboard laptop chip is not capable of running live Gaussian Blur with 32 samples at 60 fps.
So all in all, assuming that the underlying OpenGL stack works properly with vsynced swap and all, raw GPU throughput will usually not be the biggest problem (for user interfaces). Lets look at some numbers.
The benchmark can be found here. It creates a fullscreen Qt Quick window and draws variety of stuff into it. The goal is to see how far we can go with a specific testcase while sustaining a perfect 60 fps. Skipping one frame every 2-3 seconds is in this case considered a failure. Even though we're testing raw graphics performance I use Qt Quick because it is easy to put things together, and for the testcases I've written, the delta between raw OpenGL and what the OpenGL that the scene graph produces is small enough to not impact the results. When they do, I will make a comment about it.
Note: If you try the benchmark, you will see that it does skip frames while increasing and reducing complexity. I'll get back to this in a later post.
I've run the benchmarks on the following hardware:
- Desktop Computer: Intel i7-3770K @ 3.50GHz, 24 GB RAM, NVidia GT-210, 1920x1080 screen, Ubuntu 14.04, proprietary driver. Tested with kwin w/o compositor. Also tested with compositor and with unity.
- MacBook Pro: Early 2011, 4 GB RAM, Intel HD 3000 or AMD Radeon HD 6750M, 1650x1050 screen, OSX 10.9.5
- iPad Retina Mini: A7, PowerVR G6430, 2048x1536 screen, iOS 7.1
- Jolla: Snapdragon 400 1.4 GHz dual-core, Adreno 305, 540x960 screen, Sailfish OS u9 (public RC)
What we can see in the graphs confirms what I already talked about. Solid opaque fills are generally extremly cheap, even the mobile GPUs can do 40-50 fullscreen rectangles on top of each other. Same with opaque textured fills, though we should keep in mind here that the scene graph renders opaque content front-to-back with z buffer enabled, so GPUs that implement early-z will end up only having to read the front-most texture. Blending is worse with blended textures being the most costly, though still decent. It is however, something that is worth taking note of. Take, for instance, the following example based on Qt Quick Controls:
The left image is how it is rendered normally, and the right image is rendered using QSG_VISUALIZE = overdraw in the environment. As can be seen from the visualization, the background is opaque (green) and can be ignored. Then are the three group boxes stacked on top of each other. Each rendered as a separate blended (red) texture. If this or a similar pattern of background stacking was used on a machine that has either a lowend GPU or a GPU that doesn't match its screen size, it would eat a large part of the performance budget. And this is before we start adding the actual application content.
The graphs also seem to indicate that both kwin and unity compositors are quite bad for performance. I could maybe tolerate a 10-20% drop for going through the composition step, but what I'm measuring does not seem right. If anybody knows what's up with that, please give me ping.
I should also mention that the iPad didn't start skipping frames when I reached 30 opaque textures. It ran out of memory and was killed!
When running similar benchmarks previously, I have seen embedded chips with overdraw performance as low as 1.5. That means the application can fill the screen 1.5 times before the application starts stuttering. In terms of content, that means a background image and a few icons and some text. Not a lot of flexibility. When working with such a system, the application will need to be very careful about what gets drawn if a sustained 60 fps UI is desirable.
What about more complex shaders
I mentioned that complex shaders would be problematic, especially on the less powerful GPUs. To test this, I incorporated the GaussianBlur from QtGraphicalEffects into the benchmark and tested how many samples I could have before it started to stutter. Now, the gaussian blur is implemented as a two pass algorithm. That means that in addition to doing a lot of texture sampling, we’ll also be rendering the equivalent of the screen several times per frame. First into the FBO which will be the source for the first pass. Then blur in one direction into a second FBO. Then blur the second FBO in the other direction onto the screen.
The Jolla managed 55-ish fps with 2 and 3 samples and i7/kwin/composited maxed out at 30 fps, which is why they are marked with 0. Neither managed to sustain a velvet frame rate. The only chip that managed to run with a high sample count, was the discrete graphics chip on the MacBook. This is in line with what is expected, as the complexity of the per-pixel operation grows, so does the requirement for the graphics hardware. What we can read from this is that these kinds of operations needs to be taken into use with some care, applied to smaller areas or otherwise in moderation.
There are alternatives though. For instance, it is possible to do fast blurring even on lowend using a combination of downscaling, simplified blurring and upscaling, such as this. For drop shadows, prefer to use pre-generated ones. Only use live drop shadow if the area is small and there are few instances or you know you’ll be running on a high end chip. Keep in mind that cheating looks equally good. It just runs faster!
Number of Draw Calls
The other factor I mentioned up top which was worth looking at was the number draw calls. When compared to software graphics, this is where hardware based APIs are much worse. In many ways, the problem to solve is the inverse. With software graphics, at least when looking at QPainter with its software rasterizer, draw calls are cheap and the impact of state changes in the graphics pipeline are small. With OpenGL, and DirectX for that matter, pipeline changes and draw calls can be quite bad. As part of the benchmarks, I’ve created a bunch of unique ShaderEffect items, the scene graph renderer can not batch these, so they will all be scheduled using one glDrawElements() call per item. Lets look at some numbers:
One conclusion to draw from this is that without any form of abstraction layer to handle batching, it is going to be hard to do complex controls, such as a table, using a hardware accelerated API. In fact, this is one of the primary reasons we’ve been pushing the raster engine for the widgets stack in Qt. If we compare this to items that do batch, one of Qt’s autotests will create 100.000 rectangles and can translate them at 60 fps, so the difference is quite significant.
Something else to take note of is that the scene graph is in no way perfect. There are several types of QML level operations which will force items to be rendered one by one. ShaderEffects is one. Item::clip is another. The following is a visualization of Qt Quick Control’s “Gallery” example’s “Itemviews” page using QSG_VISUALIZE = batches in the environment:
At first glance this looks a bit like a christmas tree. If we look beyond that, we see that the majority of the list is drawn in three separate colors. That means that the various background and text elements of the list view have been batched together. Running the same page with QSG_RENDERER_DEBUG = render, we can see from the console output that 109 nodes were compressed to 8 batches. If we added clipping to those list entries, those 109 nodes would be drawn using separate glDrawXxx() calls and eat quite a bit out of our performance budget.
Another thing that breaks batching is large images, as these do not fit into the scene graph’s built-in texture atlas. If you are curious if the application’s images are atlassed, run the application with QSG_ATLAS_OVERLAY = 1 and look for tinted images.
Benchmarks are one thing and the real world is another, but keeping basic rules in mind based on findings in benchmarks can greatly help the overall performance of the resulting application. It is one of the premature optimizations that do pay off. If application performance starts slipping, it can take a lot of work to get it back..