Alpha pixel routine

Hi this is my current Alpha pixel routine and it is SLOW does anybody have
any optimizing ideas. I don’t want to use assembler.

the table is from 0 to 255 and 0.0 to 1.0 in 0.1 increments

inline void AlphaPixel(int x, int y, Uint32 col, int alpha, SDL_Surface
*sur)
{
int a;

unsigned int sR, sG, sB, dR, dB, dG;
Uint32 dest = ((Uint32)sur->pixels + y*sur->pitch/4 + x);
sR = (col >> 16) & 0xff;
sG = (col >> 8) & 0xff;
sB = (col) & 0xff;
dR = (dest >> 16) & 0xff;
dG = (dest >> 8) & 0xff;
dB = (dest) & 0xff;

a = 10 - alpha;

if (sR == 0) { dR = 0; } else { dR = (table[sR][alpha]); }
if (sG == 0) { dG = 0; } else { dG = (table[sG][alpha]); }
if (sB == 0) { dB = 0; } else { dB = (table[sB][alpha]); }

if (a > 0)
{
dR += (table[dR][alpha]);
dG += (table[sG][alpha]);
dB += (table[sB][alpha]);
}

((Uint32)sur->pixels + y*sur->pitch/4 + x) = dR << 16 | dG << 8 | dB;
}

Piotr Dubla

Hi,

Of course you already know some assembly would speed this up a lot, but here
are a few improvements using standard C.

Organize your table with the alpha first (i.e. unsigned char table[10][256])

unsigned char MyAlphaTable[10][256];

void
MyAlphaPixel( int x, int y, Uint32 col, int alpha, SDL_Surface * sur )
{
int a;
unsigned char sR, sG, sB, dR, dB, dG;
Uint32 * pDest = ((Uint32*)sur->pixels + y*sur->pitch/4 + x); // do this
calc ONCE!
Uint32 dest = *pDest;
unsigned char * pTable = &MyAlphaTable[alpha][0];

sR = (col >> 16) & 0xff;
sG = (col >> 8) & 0xff;
sB = (col) & 0xff;
dR = (dest >> 16) & 0xff;
dG = (dest >> 8) & 0xff;
dB = (dest) & 0xff;

a = 10 - alpha;

if (sR == 0) { dR = 0; } else { dR = (pTable[sR]); }
if (sG == 0) { dG = 0; } else { dG = (pTable[sG]); }
if (sB == 0) { dB = 0; } else { dB = (pTable[sB]); }

if (a > 0)
{
dR += (pTable[dR]);
dG += (pTable[dG]);
dB += (pTable[dB]);
}

*pDest = (((Uint32)dR) << 16) | (((Uint32)dG) << 8) | ((Uint32)dB);
}

Hope this helps.

Hi this is my current Alpha pixel routine and it is SLOW does anybody have
any optimizing ideas. I don’t want to use assembler.

the table is from 0 to 255 and 0.0 to 1.0 in 0.1 increments

For inspiration, look at SDL_RLEaccel.c and SDL_blit_A.c in the SDL sources.
SDL uses integer multiplies, but only one or two per pixel. Modern CPUs
(say, P6, ppc, etc) have fairly fast integer multipliers so this is done
in a couple of cycles. Your 64K lookup table may waste L1 D-cache space,
or (if it doesn’t fit) cause evil stalls, up to three times per pixel.
On the other hand, if it fits it can be quite fast. I imagine it would
run well on a HP PA8500, with its huge L1-cache :slight_smile:

As has been mentioned before, hand-written assembly can be faster yet
but this is nothing I want to encourage people to do unless they have
exhausted all other reasonable alternatives

if (sR == 0) { dR = 0; } else { dR = (table[sR][alpha]); }
if (sG == 0) { dG = 0; } else { dG = (table[sG][alpha]); }
if (sB == 0) { dB = 0; } else { dB = (table[sB][alpha]); }

Drop the conditionals. Avoid branches in inner loops

if (a > 0)
{
dR += (table[dR][alpha]);
dG += (table[sG][alpha]);
dB += (table[sB][alpha]);
}

I’m not sure what your intention was, but it looks like this can be folded
into the table as well

the table is from 0 to 255 and 0.0 to 1.0 in 0.1 increments

Sorry, I didn’t read your message properly and missed this. Such a small
alpha table should be less of a problem, if you swap the indices
(saves a shift and an add for every table access, which should amount to
a cycle or two). Or you could make the alpha index a power of two

In general however, use SDL’s built-in alpha blitting functions as far
as possible. If they are optimized, your application will automatically
benefit from it