Game Performance Improvements in Latest Mac OS X Update
Aug 17, 2010
When we launched Steam on Mac OS X back in May, there was a lot of buzz about performance, particularly relative to Windows running on the same machine. While we met our goal of making sure all of our customers had an acceptable gaming experience at launch, we have spent a large chunk of effort in the intervening months working with Apple and their GPU vendors to close the performance gap with Windows. The combination of changes in our code and the latest graphics update available from Apple today removes a variety of software bottlenecks, resulting in significant graphics performance enhancements for Mac gamers.
In addition to low-level implementation changes which have improved performance across the board, Apple has also removed some implementation inefficiencies which allow us to improve visual quality, most notably in the area of GPU occlusion queries.
When we first released our Source engine games on the Mac, we had to turn occlusion queries off but, with the latest update to 10.6.4, we can turn them back on, giving players higher visual quality. If you're not familiar with an occlusion query, it is a mechanism for an application (running on the CPU) to notify the GPU via OpenGL that it would like to know, for some set of draw calls, how many pixels are drawn to the screen, after shading, z-buffering etc. We use GPU occlusion queries for a variety of effects in our game, but one of the easiest to explain is light glows. It's a simple technique that just about every game uses, but it produces a surprisingly convincing effect. What we do is issue queries for light-emitting objects in our scene such as light bulbs or a disc representing the sun. A given light source may be partly or wholly occluded by other geometry in the scene and we use the occlusion query to determine how occluded it is. The percentage of a given light source's screen area which is actually visible is used to scale the intensity of an additive glow sprite which is drawn over the frame without any z-buffering. Because its intensity is attenuated with the occlusion of the light source geometry, the glow sprite fades gently in and out as it becomes more or less occluded by the rest of the scene. It's a simple technique but it appears to be a complex optical effect which helps to convince your brain that you're looking at something truly bright. To illustrate the importance of using light source occlusion to drive glow sprite intensity, we've created a little clip from Half-Life 2: Episode One, which shows the effect with and without the query. Clearly, the left side of the movie shows a convincing glare effect due to the scaling of the glow sprite intensity while the right side of the movie gives a very crude effect, with the glow showing through parts of the scene, breaking the illusion.
From a technical standpoint, the key to the occlusion query feature is that our game can be written in such a way that it can receive the result of a given query asynchronously, at some point later (usually during a subsequent frame). So, while we are using the GPU to perform this computation for us, we aren't stopping CPU execution to wait for the result---we can pick up the result later, since it's OK for this algorithm that the query results are a fraction of a second "stale." So, if implemented properly by the graphics API, occlusion queries cause no synchronization between the GPU and CPU, allowing both processors to stay busy doing work. Unfortunately, prior to the latest software update, occlusion queries caused the CPU-side driver to synchronize with the GPU, perhaps multiple times per frame. This caused large amounts of time during which either the GPU or the CPU was doing no work, significantly reducing system throughput and consequently the game's framerate. This behavior also caused the application's CPU thread to stop processing, as it waited for the driver thread to synchronize with the GPU, resulting in significant loss of CPU throughput as shown in the Shark capture in the image below. In this image, the green horizontal bar is Portal's thread and the pink bar is Apple's OpenGL driver thread. At the point that we've selected on the timeline, you can see from the callstack that we're querying the driver to find out if an earlier GPU occlusion query has been completed. Before the driver can even respond to the game thread, the driver flushes all of its queued work to the GPU in order to synchronize with the GPU, causing the big gap in the timeline on our thread (the big gap in the green bar), during which our game can do no further processing. The way that the occlusion query mechanism is designed, this CPU-GPU synchronization is unnecessary and, with the new software update from Apple, that big gap in the timeline is gone.
Floating Point Validation
Apple has some very nice performance analysis tools that allow us to diagnose performance issues like the occlusion query stall described above. Using these tools, another area that we've seen the driver spending a significant amount of time is in code which validates floating-point parameters that we hand off to OpenGL to drive the logic in our GPU-side shader code. If you're not familiar with the way that processors encode floating point numbers, there are a few special bit patterns that are reserved to handle illegal results that can happen when code inadvertently does something nonsensical like divide by zero. Unfortunately, the way that the OpenGL specification is written, Apple must spend valuable CPU time doing floating point validation to guarantee that their OpenGL implementation behaves correctly in these exceptional cases according to OpenGL conformance tests. Of course, in a high performance application like a game, the CPU time spent in Apple's driver validating floating point data can really add up. We have been able to measure performance improvements in this area with the latest software update, but we are anticipating even more speedups in this area if Apple implements the uniform_buffer_object extension and GLSL 1.3 in a future update. With these additional features, we will be able to sidestep this particular CPU bottleneck, allowing us to win back a bunch of CPU time and, ultimately, performance.
We are seeing dramatic performance improvements on iMac (Late 2009 and Mid 2010), Mac mini (Early 2009 and Mid 2010), Mac Pro (Early 2009), MacBook (Early 2009 and Mid 2010) and MacBook Pro (15-inch, Mid 2010) and MacBook Pro (17-inch, Mid 2010) models. Depending on the game, video settings and the hardware, we have measured frame rate improvements from 15% to 120% on these systems. On older systems, we are generally already operating at the limits of the hardware, so it is not obvious that any significant performance improvements can be achieved in the future.
We're very excited about the performance improvements that Apple and the GPU vendors have been able to deliver this summer and we are working with them to further improve performance.