Frank Ramsay wrote:
Darrell Johnson wrote:
[snip, a whole bunch of stuff about inline asembly and linear memory access]
Just a quick comment about using assembly, I’m working on a x86 CPU, If I use
assembly then someone on a PPC CPU can’t use my code. The portability
is a huge part of the reason I’ve chosen SDL/C to code in.
A few little snippets can make a big difference. Certainly you should
write a portable C version first, but if you optimize one platform at a
time with maybe a couple hundred lines of assembly, you can get big
performance boosts for a small price.
As for the fellow who was talking about using this type of optimization
in a Starcraft-type game, think for a second: how much of the screen do
you think you could typically reuse in Starcraft? Maybe a lot, in the
beginning phase, but when you get to the exciting major battles where
you want a high frame rate, the screen is filled with sprites and you
can’t reuse any of it. An inconsistent frame rate is worse than a just
plain low frame rate; there’s nothing worse than thinking your machine
is fast enough to run a game at a certain resolution, then having it all
fall apart just when it gets interesting.
I didn’t say the a game should run at the fastest possible frame rate, you
put the screen refresh on a timer so it happens at 18(ish)fps on any speed
computer and use the other cpu cycles to do the business of running the game.
And you plan for a slow computer.
Please bear in mind that I wasn’t just responding to you. I believe
that only 3d games look better in higher framerates, where they generate
sort of a psuedo-motion-blur effect as your eye superimposes multiple
frames. If 24 FPS is fast enough for a movie, it’s fast enough for a
realtime strategy game (I think 18 FPS might be a bit slow). 2d games
use pre-drawn animation, which is never animated at 60 FPS, so generally
at high frame rates you’re just drawing the same screen multiple times.
Nonetheless, wouldn’t it be annoying to find that your computer couldn’t
keep up to that 18(ish) fps when the screen filled up, even though it
worked okay before that? Dirty rectangle strategies can give you kind
of a false sense of security, setting you up for a rude shock when you
see the performance you end up with from scenes where most of the screen
needs updating. It’s best to make your updating as fast as possible,
then consider these kinds of strategies. In some cases (especially
older computers), you will get a performance gain from dirty rectangles,
but the tile and sprite drawing are more fundamental operations, and
they should be optimized first, lest your higher-level optimizations
hide their crustitude.
As for maintaining a reusable
background layer, you are talking about a whole extra full-screen blit
per frame! This is slower than an efficient full regeneration,
full-screen blit? (OK, I want to make sure we are on the same page here, I use the
term blit to refer to puting data on the actuall video surface,
not into a buffer.)
You seem to assume that I’d copy the sprites directly to the screen(or to a HW buffer).
I don’t, it’s too many RAM-Video memory accesses. You stated (I believe) quite correctly that
video->video copying if faster than RAM-Video. Well RAM->RAM is faster still. Build you
entire display in RAM and do a single loop of memcpy’s to put in into video memory.
Blit is a phonetic spelling of of BLT, short for BLock Transfer. A
memcpy, in C terms. I did not assume anything about whether you were
using system or video memory; BTW, I’m not entirely sure that RAM->RAM
is still significantly faster than RAM->Video, and there are many cases
in which Video->Video is not any faster than RAM->Video.
The problem with block transfers is that with modern systems you can
perform several cycles of computation between each memory access.
A memcpy (on 80x86) basically compiles to a “REP STOSD” plus setup code,
which blits as fast as is possible. The problem with this is that it
wastes all those extra cycles you could use. While it is as fast as
possible, an assembly-coded copying loop is just as fast, with cycles to
spare (whether C loops can keep up is entirely up to the quality of the
optimizing compiler). This wasn’t true with older computers, on which
the “REP STOSx” worked faster.
You can use those extra cycles to intelligently switch between copying
sources and theoretically draw the screen as quickly as you can copy it
(though this will, as I said, cause some cache misses, the problem
shouldn’t be too severe, especially if you don’t have a great many
different tiles).
btw, the groundlayer->buffer move is:
memcpy(backBuffer,groundBuffer,_bufferSize);
Wouldn’t it more typically be two of these, adding up to the
groundbuffer size? (don’t get me wrong, there wouldn’t be any
significant performance difference, the setup overhead is light, I’m
just nitpicking) Or do you not let the groundbuffer get split?
I don’t think you can write a full screen regernation that is faster than that.
Certainly not faster, but nearly as fast (the difference being cache
misses from switching from one tile to another). Where the full
regeneration could be faster is that you can combine the sprite drawing
step with it. I’m assuming the backbuffer is a system RAM
double-buffer, which is then blitted into the video ram during the
retrace. If you’re not writing a bit here and a bit there, you can
efficiently write to a true page-flipper and drop the double-buffer blit
(if it’s available, which it should usually be).
The really great thing about a linear memory access renderer is that in
many cases you’ll be fast enough to do your rendering directly onto the
front buffer during the retrace(heresy! another case where fast
computers change the rules); of course, you have to test to be sure,
unless you want flickering. After all, if there’s more than enough
bandwidth to do a back-buffer blit to the screen each retrace and CPU
cycles to spare, why not render on the fly?
I have to admit, though, that I was thinking of something a little
different. There is not always an extra blit added by your method, but
it still holds true in some cases.
if you
are using tiles (if you are using some sort of voxel engine or more
complex 3d system which isn’t well suited to hardware acceleration, you
just might find this kind of strategy worthwhile; but I was never
talking about that stuff).
Nope, isomentric tiles is what I’m talking about, at least were on the same
page. I would have hated to think this debate came to nothing because we
where talking about different things.
A custom linear renderer is especially beneficial for non-square tiles.
Remember that writing one or two bytes is as slow as writing four
word-aligned bytes (on a 32 bit computer), so if you aren’t writing
pixels in aligned words, you’re slowing your writes down to a half or
quarter speed.
I didn’t originally set out to prove the utter uselessness of background
caching and dirty rectangle methods, and I still don’t mean to.
However, I do want to make it clear that they do not /necessarily/ mean
a performance gain, and they severely limit the capabilities of the
engine (remember that with full-redraw, the background could just as
easily be made up entirely of animated tiles, and would take a minimal
performance hit from having huge numbers of sprites). Often it’s easier
to succumb to the temptation to write a clever and interesting
high-level optimization than it is to grind through the relative tedium
of cutting the fat away from basic operations. The very fastest engines
use both, but you can guess which one I would say comes first.
Linearizing your renderer is an unsexy process that involves computer
science fundamentals and hard concentration, not just slapping together
a few clever recipes. You have to sort your sprites efficiently and
maintain an ordered list that is accurate as you move through each
scanline, keep track of which tile you’re in and which sprites you’re
over, all with the utmost concern for efficiency to keep the overhead
down to a few extra cycles per pixel. It’s pretty trivial compared to
efficient 3d, of course, but much more than I see most doing. There are
endless thousands of library jockeys and cut-and-pasters who can make a
game run by gluing this bit to that, but very few true optimizers who
can get their algorithms and data structures right and code them
properly to squeeze the best possible performance out of their target
platforms. It’s a matter of practice; too many programmers are
impressed with their own ability to get things to work, and never
develop their ability to make really efficient code. They won’t develop
this ability with quick fixes and clever tricks.
It’s kind of depressing to see how many 2d games are as slow or slower
than 3d ones, even when the 2d engine is relatively low-res and very
simple. The reason is clear: most good graphics optimizers work in 3d,
and most of the professional 2d graphics programmers aren’t very good,
still working from the outdated notes the good ones made before they
moved to 3d.
Of course, full linear rendering locks you out of using such strategies
as dirty rectangles, and I wouldn’t really call it a totally fundamental
improvement, but optimizing your tile and sprite drawing even as
isolated functions should be tried before higher-level optimizations
(geez, I’m starting to sound like a broken record ).
Cheers,
Darrell Johnson
P.S. I know, I said I’d drop the subject, but it looks like someone here
is interested in what I have to say and it would be rude to not reply.