I’ve been playing with some 2D API ideas built on top of Flash’s Stage3D and Actionscript 3.0. I call it Backstage2D, the GPU augmented flash display list.
Currently, Backstage2D’s code base is mostly a playground for proof of concept of some API ideas. Some stuff in this post may not match the git repo (for example, I’m still using “layer” instead of “surface”). There’s a bunch left to do, but it is working enough to run a modified version of MoleHill_BunnMark that some folks from Adobe put together (I actually lifted most of my GPU code from that example code, heh). The BunnyMark example was adapted from Iain Lobb’s BunnyMark, with some additions from Phillipe Elsass. You can view the Backstage2D version of BunnyMark here (and check out the original BunnyMark MoleHill here).
Fork Backstage2D at GitHub.
The rest of this post describes the thought process that went into Backstage2D.
The Flash AS3 display list API is not the best way to utilize the massively parallel capabilities of a GPU, and deal with the other limitations of a CPU/GPU architecture. The display list’s deeply nestable DisplayObject metaphor, and all the fun filters and blend modes just doesn’t translate well to very parallel, flat GPU hardware renderer. All of this is especially true on mobile like iPhones, iPads and Android devices, and that’s the primary target for Backstage2D.
With an API like the traditional flash display list, it’s easy to create situations that can’t easily be batched due to branching operations and other things which change the GPU state, and break parallel processing – slowing everything down. You see this in Adobe AIR’s GPU render mode, where seemingly random things can have a huge negative impact on performance. Behind the scenes AIR attempts to break the content into batches to speed things up. The use of certain features, or normal features in certain ways can drop you out of a batch. When performance degradation happens, it’s not always clear why. Because of that, to get great performance you must target just a subset of the normal features, and apply a lot of discipline to make sure everything keeps working smoothly.
I wanted do something different. I wanted to play with an API that is intentionally unlike the Flash display list – one designed to help the implementor (Flash developer or designer) understand how to arrange their content, so that it renders very quickly, even on mobile devices – and still get the benefits of all the glorious Flash stuff we are all used to.
Here are some of the primary principles I came up with, which impact the API:
- In order to take advantage of the parallel nature of GPUs, we need to batch many Quads (think, DisplayObject) into batches. The API should make batches easy to understand and use, so there’s no guessing about what’s going on.
- GPUs like shallow content – they draw a lot triangles all at the same time. There is no nesting on the GPU, so while some form of organization is necessary, the infinite nesting model must be reigned in.
- Backstage2D shouldn’t do too much in an automagic kind of way. Guessing about the impact of nesting things a certain way, or using a blend mode, or the performance impact of using certain features translates into extra effort and cost during production because of the unpredictable negative impact features can have on performance. Features should work as you expect them to, and the performance impacts of doing certain things should be clear.
- Think of the GPU as a remote server that you send instructions to. Uploading things like Textures to the GPU from system memory is slow (especially on mobile). Backstage2D should make these stress points clear.
- Flash’s vector engine is tops, and working with Bitmaps (and sprite sheets) sucks! The API should enable the continued use of the display list, in GPU friendly ways. Drawing vector art on the GPU is hard, and is ugly anyway. So leverage the CPU rasterizer, and make sure the API makes the GPU upload bandwidth and render time overhead clear.
- Backstage2D objects shouldn’t look like traditional display list objects – we’ll use names other than Sprite, MovieClip, DisplayObject, etc.
Of these, batching is the starting point, since it is the most necessary for advanced performance, and effects how data must be organized the most. You can draw each Quad (think Flash DisplayObject, or Sprite) individually by setting up the vertex, program, texture, etc. data for each quad, and calling drawTriangle for each and every Sprite. But the GPU can’t optimize to run in parallel if you do that – most of its processing cores end up underutilized in that model.
Batching let’s more than one quad be draw simultaneously, but there are limitations – Every item in a batch must use the same vector data (a giant array of x,y,other data), single texture and other state information, like blend modes. Additionally, the entire batch must be drawn without interruption, which means you can’t insert items from other batches (with other state settings like a different blend mode) in the middle of the batch.
So batches resemble layers, or surfaces. The model for Backstage2D will be a series of stacked surfaces, instead of a deeply nested tree structure starting at the root.
In this paradigm, the surface gets batched, and the children it contains get rendered in parallel – perfect for GPUs. To eliminate batch breaking APIs, certain “state changing” operations can be applied to only an entire layer, not to each element – operations like blend mode settings – or adding and removing elements from a surface. The limitations of surface API should help the implementor understand the impact of doing certain things. If you need to have 100 elements, and every other element has a blend mode of Multiply, while the one below it has a blend mode of Normal – in the traditional Flash API, this is fine, and can actually run pretty well. On the GPU, all 100 elements must be rendered individually in 100 distinct surfaces. Having that many surfaces feels heavy because it is heavy.
Texture changes are one of the things that break batching – a Shader can deal with only one texture (well, actually up to 8 – but that makes the pixel shader more expensive to run), so a set of elements in a batch must be combined into a sprite sheet or texture atlas. If you’ve tried to use a texture atlas in another 2D rendering engine, you may have noticed these are a pain to deal with – and usually it requires setting them up manually before compilation. This is one thing that Backstage2D handles for you – at runtime – in an automagic kind of way.
This feature was actually done for a bunch of reasons. I’d like to add is a resolution (screen DPI) independent measurement mode, where assets get generated on each device an app runs on, from high quality vector art, for exactly the necessary DPI the system is running at, and scaled to real life sizes. Type specified at 12-point, should truly measure at 12-point.
Additionally, Flash vector art looks great (especially with the new 16X16 quality modes), but they look their best when rendered to match the screen exactly. Resizing prerendered vector art can ruin the beautiful anti-aliasing in vector art. Proper sizing can also help performance with older hardware like an iPhone 3GS, which is actually pretty capable, but doesn’t cope well with iPhone 4 retina screen sized material (4x more pixels than will be displayed).
Setting all this up is expensive – especially generating the sprite sheet. But just setting up vector data and loading even predigested textures is already expensive enough that you wouldn’t want to do those tasks while your app is running some smooth animation – it will cause missed frames, and your users will notice. So Backstage2d’s API should guid the user to avoid doing expensive things while an app or animation is running. It exposes a build, load, and/or upload commands per layer. That way, the implementer always knows that what they are doing is computationally expensive (down the road, the plan would be to move much of that into concurrency – more on that another time).
The characteristics of this are very different from normal Flash, which is to load only the minimum of whats needed, when it’s needed, and try to keep as much as you can off the display list. In the Backstage2D model (in the standard surface type anyway), an entire surface, and all it’s children (called “Actors” to avoid colliding with AS3’s “Sprite”, etc.) gets rendered up front to a big TextureAtlas, and stored in memory or on disk. How to optimize and organize your assets to avoid running out of memory becomes an entirely different matter from the way to optimize for the CPU. A surface will have an associated sprite sheet bitmapData asset though, which can be measured.
With these restrictions in mind, the idea would be to create a variety of surface types to suit differing kinds of content. For classic content, a standard static Quads surface (done!), still frame animations (sprite sheet animations – generated at runtime), tweened animations (inverse kinematics – the bone tool), and streaming animations (dynamic MovieClips, large MovieClips, or video) – maybe even some surfaces useful for standard UI, like scroll panes. For more advanced 2D assets, a variety of different mesh layer types could be added (that’s where GPU stage3D programming gets fun!).
I’d love to flesh this out with more features, including an animation subsystem that would include a couple of different Animation display types. Alas, free time is short, and I’ll probably never get to it. But I already spent a lot of time on this (I broke my foot, and was couch bound for a while) so thought I’d share where I got to.