That’s indeed interesting. Here for reference the SDL 1.2 version:
http://www.libsdl.org/cgi/viewvc.cgi/branches/SDL-1.2/src/video/SDL_video.c?view=markup
Where it calls video->FlipHWSurface if it has the SDL_DOUBLEBUF flag…
What do you mean? It seems that in SDL 1.2, it is a hardware page flip (or
do I missinterpret the source there?).
… but the windib driver doesn’t have a FlipHWSurface at all (it’s
NULL, I expect it just doesn’t apply the SDL_DOUBLEBUF flag to any
surface it creates), the x11 implementation is “return(0)” (and
probably doesn’t apply the flag either), and the Quartz driver (on Mac
OS X) seems to have some sort of emulation set up, not really doing
anything in hardware…
Page flipping is a rather ancient method, at this point in time, and
even when it’s available (I think the windx5 SDL driver has it?), it’s
strictly fullscreen, for obvious reasons. This was useful in the era
of DOS games, where all you had was basic VGA register twiddling,
which only included setting the video mode, setting the display offset
(a full screen down, and you have page flipping), and querying whether
we were vblanked (the cathode ray being on and drawing down the
screen, or off and going back up to the top). Oh, and changing the
video memory page, for those cards with more than 64K of video memory
(WOO!!!)…
Since the mid to late 90’s, video cards have had 2D accelerators which
could at least do solid block copies within the video memory, some of
them even with a bit to automatically vsync the command (so you send
the command whenever, and the video card will take care of waiting
until the vblank is on before doing it, and the CPU is free to keep
going with other stuff). So you just draw offscreen, maybe even in a
smaller than fullscreen buffer (for windowed mode), and have
accelerated blits that effectively take zero CPU to move it to the
screen. I’m pretty sure that resolution and 2D accelerator performance
went hand-in-hand well enough that you could pretty much always copy a
full screen from off-screen to on-screen within a vblank cycle, so
this was, for all intends and purpose, as fast as page flipping.
Now, graphics APIs don’t even let us access the video memory at all,
so not only is page flipping rarely useful, it’s rarely available to
us at all.
I didn’t inspect all the code paths, but for most platforms (at least
Mac OS X, X11 and the WinDIB driver, I think), you’re already
effectively getting double buffering, so if you’re doing double
buffering in your program, it’s become triple buffering at that point,
and you’re spending a lot of time moving pixels around…
When we used SDL_Flip in the beginning, I thought of it as a very fast
function because a surface flip seemed to be something very fast, should be
done almost immediatly. I think the name of the function leads to such a
thought.
For some systems (don’t remember anymore the settings; I think hardware
surfaces was a requirement), this was also true.
If “in the beginning” was on Windows at the time SDL has the windx5
driver at higher priority than the windib one, then with fullscreen
and hardware surfaces, you could indeed get correctly working page
flipping.
For some others though, it was not. When we did some profiling, it seemed
that about 30% of the whole runtime of the program was spent in SDL_Flip!
For all those other cases (which are now most of the cases that you
probably care about in SDL 1.2, and all the time in 1.3), this is
now what happens. SDL_UpdateRect takes an amount of time proportional
to the area of the rectangle, so that’s why you can be clever and try
to update a smaller area than the whole screen (which is what the
SDL_Flip fallback does), and save significant amount of time. A
tile-based scheme is fairly common, but there’s also a (minimal)
overhead per call to SDL_UpdateRect, so you don’t want 1x1 pixel tiles
either, there’s diminishing returns, until the per-call overhead
becomes more than you save.
That was when we introduced another layer, because I thought I can make a
swap of two SDL_Surface pointers much faster. Now, we have kind of Video
post processor which has two SDL_Surfaces and a pointer two one of those. We
always draw to the pointed surface in our main loop. The video post
processor has an own thread where it blits the (back-)surface to the actual
video surface and where it does the SDL_Flip(). (Optionally, it can do also
some post processing, we have different post processors you can use, e.g.
some resampling, or scaling up, or whatever.) Then there is a flip()
function, which just swaps the pointer to the other surface.
By this way, we have the slow SDL_Flip() mostly parallel to the main loop
and we can also do some post processing, also in parallel to the main loop.
If the SDL_UpdateRect (hidden in SDL_Flip) is done in software (like
in many of the useful drivers now, unfortunately), then your strategy
is pretty good. That’s also what the SDL_ASYNCBLIT flag was created,
which does something along the lines of what you’re doing, using
another thread for the memory copy, but that wasn’t so popular, and
has a few drawbacks… For single CPU/core machines, you still lose
that 30% of CPU time (and it actually probably increases to something
like 40%, because of cache misses and context switches), but instead
of it being spent all in one chunk, it’s automatically intermixed with
the rest of your code (the CPU will blit a little bit, then run a bit
of your game logic, then blit a little bit more, and so on).
For top-notch performance, you want to go with the new “renderer” SDL
1.3 API (stay clear of the SDL_compat.h file!), that uses "textures"
instead of surfaces (they’re the same kind of thing, really, not much
to do with 3D or anything, it’s just a name for “the new fancypants
surfaces”), and use the SDL_TEXTUREACCESS_STATIC flag on as many
textures as possible (you can’t access the pixels, but the accelerator
has a lot more flexibility to do whatever it takes to put it in video
memory and use hardware acceleration, like encode it in a special
format or whatever). This will have more variability than the SDL 1.2
API (or the “compat” API in 1.3), where it could give you 500 fps on
one machine and only 30 on another, but in any case, it should not be
worse than the 1.2 API performance, which did its best to emulate a
crummy old DOS-era state of things that didn’t really map all that
well and usually ended up doing things in software…On Fri, Mar 27, 2009 at 6:54 AM, Albert Zeyer <albert.zeyer at rwth-aachen.de> wrote:
–
http://pphaneuf.livejournal.com/