OK. This is one of the longest messages ever posted to the mailing list…
So, enjoy it.UNFORTUNATELY… I’ve now replaced a 6 line macro with a
180+ line macro (Ouch!)Argh. This cannot be the answer, too many branches that stall your
pipeline (not being regular enough to be predicted).I was thinking, “Holy carp, bad code” myself
OK… on alpha blitting, here’s some extracts from the Kosmos Online (free
MMORPG project; ask me if you want to help) archives on alpha blitting. The
first message is by Quetzal, following up with one by me, then one by
Quetzal, then another one by Quetzal where he “bows down before my idea”
My apologies for bad formatting; keep in mind that the times are only
accurate for Quetzal’s PPC machine. Don’t know how well it’d turn out on an
x86.----- Message #1 -----
Seeing how much optimization is a hot topic around here, I’m happy to
announce that I have sped up the 2d alpha blitting code I sent to the list
last week by almost 40%.
Last week it ran at about 7.5 million pixels/second. This week it goes
about 10.4 million pixels/second. (on 233Mhz G3 PPC)
Original code did the following roughly:for each pixel read in source pixel
read in destination pixel
break up source pixel into red, green, blue, and alpha component
break up destination pixel into red, green, and blue
multiply source compoments by source alpha
multiply destination compoments by 255-source alpha
shift and mask until it was back into a pixel store pixel into
destinationNow the realization I had was that the source components in a given pixel
were always multiplied by the same value (the source alpha). So instead
why not precompute that math! When the images are loaded (from PNG files
because PNG has great alpha channel support) each pixel is loaded, teh
components are broken out, the components are multipleid by the alpha, and
then stored back in. The alpha is replaced by 255-alpha. (actually, I
multiply by the source components by alpha+1… this is a slight
improvement over the older version which never resulted in pixel valuesof
255).
Now to blit we dofor each pixel read in source pixel read in destination
pixel
get alpha from source pixel (one shift)
get red, green, and blue from destination pixel
multiply red,green,and blue components by alpha
shift and mask them back into a pixel
destination = source_pixel + destinatino_pixel
Note that one add instruction can simultaneously add all components of the
pixel without fear of overflow (the previous multiplies and hifts insures
that none of the individual componetns added will be greater than 255).
This results in garbage being stored in the alpha component of the
destination pixel, but the original version just stored 0 there anyway.
It shouldn’t cause any problems and there is no real point to storing 0 or
0xff in the space. Typically, the destination pixel is in a scratch
buffer, being built up before being sent to the video memory.----- Message #2 -----
As I understand it, this is quite a clever algorithm and has probably
already been done, actually. My thoughts are:1) It's important that the inner loop which handles the pixel component
seperation be pipeline optimized (for Pentiums) as much as possible. I think
I got it down to 7 cycles once, but I can’t remember. Are there any MMX
instructions which handle this sort of thing? I think there might be.[[author’s note: still haven’t looked into this, and we probably don’t want
to optimize this way anyways; not until we get more stuff stabilized, then
we can start looking at wierd chipset-dependant optimizations (well, more of
them)]]2) Why not precompute that color splitting function, so that you have 3
64k tables of components and just do a look up? 192k of memory isn’t much of
an issue, although I’m not sure if there will be a performance gain on all
platforms… depends how heavily optimized the shifts are. You’ll have to
tell me on the PPC platform, I think it would still be a bit faster on the
x86… this is just a thought.----- Message #3 -----
1) It's important that the inner loop which handles the pixel
component
seperation be pipeline optimized (for Pentiums) as much as possible. I
think
I got it down to 7 cycles once, but I can’t remember. Are there any MMX
instructions which handle this sort of thing? I think there might be.I’m sure that this can be optimized by using MMX. At least, it should
be. As soon as I get a G4 processor, I’ll do the Altivec optimizations
myself drool. I’ll leave Pentium pipelines and MMX optimizations to
those who care grin2) Why not precompute that color splitting function, so that you have
3
64k tables of components and just do a look up? 192k of memory isn’t much
of
an issue, although I’m not sure if there will be a performance gain on all
platforms… depends how heavily optimized the shifts are. You’ll have to
tell me on the PPC platform, I think it would still be a bit faster on the
x86… this is just a thought.I’m confused. Are you referring to extracting the components from the
original pixel? Or the multiplication and shifting of the components by
the alpha value? Let me see… three 64k tables…I can’t see how tables could improve the dividing up of the pixel into
components. If you mean using tables instead of multiples and shifts,
then the tables would need to be 64k*4 or 262,144 bytes each. Or 756,432
bytes for all three. Then the destination pixel would be calculated asdestination = source + (table1[red][alpha] | table2[green][alpha] |
table3[blue][alpha])If the tables just contained bytes, thus 64k tables, then you would need
something likedestination = source + (table1[red][alpha]<<16 | table2[green][alpha]<<8
| table3[blue][alpha])Which doesn’t seem like much of a gain. Might be worth a try.
Here is the existing code (init stuff snipped)
#define kRedComponent(pix) ((pix>>16)&0xff)
#define kGreenComponent(pix) ((pix>>8)&0xff)
#define kBlueComponent(pix) ((pix>>0)&0xff)
#define kAlphaComponent(pix) ((pix>>24)&0xff)… for (j=0; j<src_height;
j++) {
for (i=0; i<len_x; i++) { unsigned int src_pix = src_data++;
unsigned int dst_pix = dst_data;
unsigned int dA = kAlphaComponent(src_pix);
unsigned int dR = kRedComponent(dst_pix);
unsigned int dG = kGreenComponent(dst_pix);
unsigned int dB = kBlueComponent(dst_pix); dR = (dAdR) & 0xFF00;
dG = (dAdG) & 0xFF00; dB = (dA*dB);
dst_pix = (dR<<8) | (dG) | (dB>>8); *dst_data++ = src_pix +
dst_pix;
} src_data += src_add_x; dst_data += dst_add_x; }On the PowerPC, a shift and a mask can be done in one instruction “rlwinm”
(Rotate Left Word Immediate then AND with Mask (I kid you not)). So the
kXComponent macros above reduce to one register to register assembly
instruction. On the Intel chips it is probably a win to remove the mask
from the kAlphaComponent definition…#define kAlphaComponent(pix) (pix>>24)
pix should be an unsigned int so correct code will be generated. (zero
fill left bits)----- Message #4 (OK, forgot about this one) -----
Well, I wasn’t convinced about using a lookup table. But I went ahead and
coded one anyway. My first crack at it resulted in a speed up from 10.4
million pixels/second to 11.35! Tweaked it a bit more and got it up to
12.6 million pixels/second. Nicholas, I bow down before your idea! Even
if what I did isn’t exactly what you had in mind. Here is what I havenow.static unsigned char table[256][256] = {{ (00)>>8, (01)>>8, (02)>>8,
… },
{ (10)>>8, (11)>>8, (12)>>8, …},…};
unsigned int src_pix = *src_data++; unsigned int dst_pix =
*dst_data;
unsigned int dR = kRedComponent(dst_pix);
unsigned int dG = kGreenComponent(dst_pix);
unsigned int dB = kBlueComponent(dst_pix);
unsigned char b = ((unsigned char)table) + ((src_pix>>16)&0xff00);
dR = b[dR]; dG = b[dG]; dB = b[dB];
dst_pix = (dR<<16) | (dG<<8) | dB; *dst_data++ = src_pix +
dst_pix;I can skip a step by getting dA by ((src_pix>>16)&0xff00) instead of
((src_pix>>24) and then later src_pix<<8.
the dA indexing can be shared by the other three lookups. The total size
of the table is only 64k, only one table. (easy to fit in L2 cache on any
modern CPU, and maybe even mostly in L1 cache). The three red green and
blue references exhibit some locality of reference (all within the same
256 byte region in the lookup table)Any more good suggestions? The original code did 7.5 million pixels per
second. If we eek it up to 15 million pixels/second I’ll be in heaven!
(The theoretical maximum is 18 million pixels/second because this is how
long it takes to do two reads, an add, and a store!!)
It is currently at 12.6 million pixels/second. Filling a 640x480 screen
this way could be done 41 times/second. (times on my 233Mhz G3 PowerPC,
512k L2 cache running at 116Mhz, 32k L1 data cache)----- Message transcripts end -----
Food for thought, Loren. IMO 12.5 million pixels/sec ain’t bad for the sort
of hardware he was using. My machine can probably handle about 80fps @
640x480 this way.Cheers,
Nicholas
----- Original Message -----
From: f91-men@nada.kth.se (Mattias Engdegard)
To: sdl at lokigames.com
Date: Thursday, May 04, 2000 5:32 PM
Subject: Re: [SDL] Bug in Alpha blitting???
static unsigned char table[256][256] = {{ (00)>>8, (01)>>8, (0*2)>>8,
This is on the large side, since it’ll blow most L1 caches… Being symmetric,
it can be cut in half, but at the cost of a branch. Maybe worth a try.
The special case of blending 2 surfaces with constant alpha=0.5 is particular
amenable to optimization, and you can even do 2 (or 4, even 8) pixels at the
same time, in portable C.
Good point though; table-based optimization shouldn’t be underestimated,
even in these days of memory bandwidth worship.
blitting???)
static unsigned char table[256][256] = {{ (00)>>8, (01)>>8, (0*2)>>8,
This is on the large side, since it’ll blow most L1 caches… Being
symmetric,
it can be cut in half, but at the cost of a branch. Maybe worth a try.Good point about it being symmetrical; but you’re right, there were some
little side threads out there that were all about cache misses/hits. Quetzal
actually thought that it would be slower, but in reality it seems to
survive. IMO it’s a tradeoff… cache misses vs. slower calculations. On
the PPC at least cache misses win.I don’t know what we can do about the branching… also a good idea.
The special case of blending 2 surfaces with constant alpha=0.5 is
particular
amenable to optimization, and you can even do 2 (or 4, even 8) pixels at
the
same time, in portable C.Yes, this is the classic alpha-blend. IMO a separate blitter might be
useful?Good point though; table-based optimization shouldn’t be underestimated,
even in these days of memory bandwidth worship.Absolutely! lookup tables are one of the best things you can do for your
source… especially if you can avoid cache misses. I see people who
calculate their sines, each time, by hand, and I want to throttle them.Nicholas
----- Original Message -----
From: f91-men@nada.kth.se (Mattias Engdegard)
To: sdl at lokigames.com
Cc: vining at pacificcoast.net
Date: Thursday, May 04, 2000 6:30 PM
Subject: Re: Alpha Blitting with Tables (was: Re: [SDL] Bug in Alpha