It’s not “ordering by Z and texture” but “grouping by Z and texture”.
Every
render with a Z of 1 will get sent before every render with a Z of 2, and
so
on. That’s why I said you end up with an array of multimaps.
Not exactly. The optimization assumes that the principal rendering
bottleneck is the overhead involved in sending scene data to the
graphics card, which assumption is borne out by testing data. It
intends to minimize the number of drawing calls by delaying
primitives
and sending them in batches, ordered by Z and texture.
You avoid having to cache “the entire GL state” by the simple
expedient
of flushing the to-do buffer if a call comes in that changes the GL
state.
All you need to keep cached is the map of textures to arrays of
coordinates.
And transparency works fine as long as you have a Z parameter to order
by. Things get drawn on top of each other in the prescribed order.
I’ve
been using this for a while now. The system works.
From: John <john at leafygreengames.com <mailto:john at leafygreengames.com>
<mailto:john at leafygreengames.com <mailto:john at leafygreengames.com>>>
To: sdl at lists.libsdl.org <mailto:sdl at lists.libsdl.org> <mailto:sdl at lists.libsdl.org <mailto:sdl at lists.libsdl.org>>
Sent: Monday, April 15, 2013 8:36 PM
Subject: Re: [SDL] External dependencies in the renderer?
Ok, so the optimization assumes that a rendering bottleneck is the cost
of
switching textures, and intends to minimize the number texture switches
by
delaying primitives, then re-ordering them by texture and Z.
I’ve seen this before. It can be done, but there are caveats. The
biggest
challenge is you need to cache the entire GL state for each delayed
primitive.
The implementation is effectively an “intermediate mode” layer unto
itself.
The
layer is a massive todo
buffer with three phases: queue everything,
analyze
(re-order) the queue, then execute the queue as a batch. If you don’t
choose
the
batch size wisely, it’s possible to lose any parallelism that you might
have
had
when GL calls were mixed in with scene graph calls. The second
challenge is
to
support transparency and other effects that depend on multiple passes
in a
specific order, or that play games with the z-buffer (or other tests.)
On 04/15/2013 10:41 PM, Mason Wheeler wrote:
Here’s the basic idea.
The internals of SDL’s rendering API are atrocious, to put it bluntly.
It
does
everything in Immediate Mode, which modern versions of OpenGL and
Direct3D
have
moved away from because it’s so slow. GLES doesn’t even support
Immediate
Mode,
so if you look at SDL’s GLES renderer, it does the closest thing it
can
find to
Immediate Mode, sending one call to OpenGL every time someone calls
SDL_RenderCopy.
The way to do rendering fast is to keep the number of library calls to
a
minimum, and pass as much data as possible all at once in an array.
Of
course,
that’s not the way people use SDL; they use SDL to draw a bunch of
sprites, one
at a time. So to be fast, SDL has to keep track of the bookkeeping
for
them.
The way to do this is with a multimap, mapping textures to lists of
drawing
coordinates. You turn SDL_RenderCopy into an operation that adds a
pair
of
rects to a texture’s mapped list, and SDL_RenderPresent into an
operation
that
iterates over the multimap and for each texture, builds two arrays of
vertices
(one for screen coordinates and one for texture coordinates) as
buffers
and
passes them to the renderer all at once.
I’ve got a Delphi implementation that sped up my rendering
significantly,
about
3x faster than stock SDL rendering. With a multimap in C, I could
port
this
concept to the SDL internals.
The one tricky thing here, the concept that my renderer has that SDL
doesn’t, is
Z-order. If you’re no longer deterministically drawing in the order
in
which
draw calls are received, but instead grouping them by texture, which
are
in turn
sorted by hash order (essentially random,) you need a Z-order
parameter to
make
sure the right things draw on top of the right things, and what you
end up
with
is an array of multimaps.
I know it probably sounds very complicated, but it’s only a few
hundred
lines of
code (plus the implementations of the hash and the dynamic array,
because
C
doesn’t have them built in) and it makes rendering much faster.
Mason