Source Alpha enabled blitting - off by one?

Cheers again,

I have a question on the ALPHA_BLEND routine. It seems to be making a
slight calculation error on the RGB value calculations.

From SDL_blit.h

/* Blend the RGB values of two pixels based on a source alpha value /
#define ALPHA_BLEND(sR, sG, sB, A, dR, dG, dB)
do {
dR = (((sR-dR)
(A))>>8)+dR;
dG = (((sG-dG)(A))>>8)+dG;
dB = (((sB-dB)
(A))>>8)+dB;
} while(0)

For the case of blitting a white box sR,sG,sB=255 onto a black box
dR,dG,dB=0 and the source alpha is A=255 (fully opaque) we get a new
dR,dG,dB of 254 rather than 255 (as would be correct).

Correct would be to use the a /255 rather than the >>8 operation. This
is obviously done for speed reasons?

Wouldn’t it be better to precalculate the term (((sC-dC)*(A))/255) for
all (sC-dC) and (A)'s (adds a 128Kbyte lookup table) saving the bitshift
alltogether and gaining the added correctness of the /255 term.

Bye now
Andreas

Andreas Schiffler wrote:

Correct would be to use the a /255 rather than the >>8 operation. This
is obviously done for speed reasons?

yes

Wouldn’t it be better to precalculate the term (((sC-dC)*(A))/255) for
all (sC-dC) and (A)'s (adds a 128Kbyte lookup table) saving the bitshift
alltogether and gaining the added correctness of the /255 term.

no. what could be done is to add a few terms of the power series
expansion to get better accuracy, but I don’t think it’s worth it
(this is for game graphics after all)

Fully opaque alpha is handled separately in the code so that is not
a problem

[…]

Correct would be to use the a /255 rather than the >>8 operation. This
is obviously done for speed reasons?

Well, considering that shift operations are from a few times through tens
of times faster than divisions on nearly all CPUs in existence (even
including most DSPs!) - yeah, it’s all about speed.

Still, it’s not all that bad WRT accuracy. I don’t think you’ll see the
difference on any “consumer” hardware. (And high end machines often use
10 bit RGB, which would just make this error even less significant -
provided SDL actually supports such modes.)

And as Mattias already pointed out, opaque blits aren’t done by this code
anyway.

Wouldn’t it be better to precalculate the term (((sC-dC)*(A))/255) for
all (sC-dC) and (A)'s (adds a 128Kbyte lookup table) saving the
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
No way!!! There are lots of CPUs out there with only 128 kB cache, and
even CPUs with bigger cache, this table would end up being thrown out of
the cache all the time, and then cause massive cache miss slowdowns
during alpha blits. Besides, I wouldn’t be surprised if even fetching hot
table data from the L2 cache is slower than shift operations.

In short, never use LUTs on current CPUs, if there’s any remotely
reasonable way around. In case you really need to do calculations that
cannot be approximated with fast integer code, well, FPUs have grown
incredibly fast for a reason.

//David Olofson — Programmer, Reologica Instruments AB

.- M A I A -------------------------------------------------.
| Multimedia Application Integration Architecture |
| A Free/Open Source Plugin API for Professional Multimedia |
`----------------------------> http://www.linuxdj.com/maia -' .- David Olofson -------------------------------------------. | Audio Hacker - Open Source Advocate - Singer - Songwriter |`-------------------------------------> http://olofson.net -'On Tuesday 13 November 2001 20:17, Andreas Schiffler wrote:

David,

In short, never use LUTs on current CPUs, if there’s any remotely
reasonable way around.

Good point, never though of it this way.

I’ll hunt down any LUT >256bytes in my code now and thow it out …

Cheers
Andreas

Well, if you already spent the time hacking the code, it might be a good
idea to benchmark the code first - there are cases where even bigger
LUTs (as long as they fit in the cache) can speed things up.

Of course, this is assuming that the LUTs replace calculations that take
quite a few operations. Things like "multiplication with surface pitch"
stopped being a candidate for LUTs somewhere around the time Pentium MMX
and K6 got a 3 cycle multiply instruction. (I’m suspecting that such a
LUT is often a loss even on the pre-MMX Pentium, where a MUL takes 10
cycles IIRC. Depends on the cache situation.)

To be entirely correct, the actual rule is more like “LUTs where most
entries are used several times without the LUT being flushed might be
faster than not using a LUT”. Consider keeping LUTs of up to a few kB,
provided they’re used frequently, and for massive “bursts” of look-ups
with little other cache thrashing stuff going on.

For example, an index->RGB LUT can be acceptable from the cache POV; fits
in cache and the whole table is used many times over while converting a
full screen. But then again, that’s one of the situations that cannot be
solved without a LUT anyway, so it’s not really a good example.

//David Olofson — Programmer, Reologica Instruments AB

.- M A I A -------------------------------------------------.
| Multimedia Application Integration Architecture |
| A Free/Open Source Plugin API for Professional Multimedia |
`----------------------------> http://www.linuxdj.com/maia -' .- David Olofson -------------------------------------------. | Audio Hacker - Open Source Advocate - Singer - Songwriter |`-------------------------------------> http://olofson.net -'On Tuesday 13 November 2001 22:59, Andreas Schiffler wrote:

David,

In short, never use LUTs on current CPUs, if there’s any remotely
reasonable way around.

Good point, never though of it this way.

I’ll hunt down any LUT >256bytes in my code now and thow it out …

Am Dienstag, 13. November 2001 22:09 schrieben Sie:

In short, never use LUTs on current CPUs, if there’s any remotely
reasonable way around. In case you really need to do calculations that
cannot be approximated with fast integer code, well, FPUs have grown
incredibly fast for a reason.

This is certainly true for straightforward calculations. I don’t think that
applies to calculations where a lot of branching takes place, since branches
(especially if they’re hard to predict) can waste a lot of time on modern
CPUs.

In the end you’ll probably have to benchmark critical code sections anyway.

cu,
Nicolai

David,

In short, never use LUTs on current CPUs, if there’s any remotely
reasonable way around.

Good point, never though of it this way.

I’ll hunt down any LUT >256bytes in my code now and thow it out …

Well, if you already spent the time hacking the code, it might be a good
idea to benchmark the code first - there are cases where even bigger
LUTs (as long as they fit in the cache) can speed things up.

Of course, this is assuming that the LUTs replace calculations that take
quite a few operations. Things like "multiplication with surface pitch"
stopped being a candidate for LUTs somewhere around the time Pentium MMX
and K6 got a 3 cycle multiply instruction. (I’m suspecting that such a
LUT is often a loss even on the pre-MMX Pentium, where a MUL takes 10
cycles IIRC. Depends on the cache situation.)

To be entirely correct, the actual rule is more like “LUTs where most
entries are used several times without the LUT being flushed might be
faster than not using a LUT”. Consider keeping LUTs of up to a few kB,
provided they’re used frequently, and for massive “bursts” of look-ups
with little other cache thrashing stuff going on.

For example, an index->RGB LUT can be acceptable from the cache POV; fits
in cache and the whole table is used many times over while converting a
full screen. But then again, that’s one of the situations that cannot be
solved without a LUT anyway, so it’s not really a good example.

//David Olofson — Programmer, Reologica Instruments AB

.- M A I A -------------------------------------------------.
| Multimedia Application Integration Architecture |
| A Free/Open Source Plugin API for Professional Multimedia |
`----------------------------> http://www.linuxdj.com/maia -' .- David Olofson -------------------------------------------. | Audio Hacker - Open Source Advocate - Singer - Songwriter |`-------------------------------------> http://olofson.net -’

SDL mailing list
SDL at libsdl.org
http://www.libsdl.org/mailman/listinfo/sdl

would a LUT be good for sins and cosins?

----- Original Message -----
From: david.olofson@reologica.se (David Olofson)
To:
Sent: Tuesday, November 13, 2001 11:56 PM
Subject: Re: [SDL] Source Alpha enabled blitting - off by one?
On Tuesday 13 November 2001 22:59, Andreas Schiffler wrote:

Am Dienstag, 13. November 2001 22:09 schrieben Sie:

In short, never use LUTs on current CPUs, if there’s any remotely
reasonable way around. In case you really need to do calculations
that cannot be approximated with fast integer code, well, FPUs have
grown incredibly fast for a reason.

This is certainly true for straightforward calculations. I don’t think
that applies to calculations where a lot of branching takes place,
since branches (especially if they’re hard to predict) can waste a lot
of time on modern CPUs.

Yeah, those branches… Indeed, a LUT + switch() construct is just one
LUT look-up and one branch, which is probably going to be faster than
even relatively simple “multiple branch operations”. (BTW, a properly
written switch() is actually a LUT in itself.)

In the end you’ll probably have to benchmark critical code sections
anyway.

Yes, especially in cases where a LUT replaces only a few instructions or
a few if()s. (Better be suspicious, though - the difference between two
ways of doing the same thing can sometimes be much bigger than one
would guess by looking at the code!)

//David Olofson — Programmer, Reologica Instruments AB

.- M A I A -------------------------------------------------.
| Multimedia Application Integration Architecture |
| A Free/Open Source Plugin API for Professional Multimedia |
`----------------------------> http://www.linuxdj.com/maia -' .- David Olofson -------------------------------------------. | Audio Hacker - Open Source Advocate - Singer - Songwriter |`-------------------------------------> http://olofson.net -'On Wednesday 14 November 2001 16:17, Nicolai H??hnle wrote: