Slowdowns....Speedups!

brn wrote:

and only add the X coordinate. For example, we’ll build a table of y values
for a surface with the dimensions 5x5.:
int *y_table[5];
for(y = 0; y < 5; y++) {
y_table[y] = (y * 5)

On a 640x480 screen, you multiplication table with trash the l1 cache…
linear pointer increment (e.g. sam’s code) is faster

c.

How big is the l1 cache?

This code is not for blitting anyway, (whereas the code you refer to as “sam’s
code”, is for doing blitting), this was only for a put-pixel function, for
referencing one pixel at one specific location, without multiplication, there
are other ways of computing the actual element, you can even do it with
bit-shifting, however I found it to be slower. I’ve noticed that in SDL’s code
there is multiplication done to reference the initial pointers for the src and
destination surfaces during a blit, this could be avoided by using a
lookup-table, surely even if it trashes the “l1” cache, it has to be faster than
multiplication? Especially for very small things you are blitting such as
sprites.

I’d also considered instead of making a table of y-values, you could make a table
of pointers to each y-line on the screen, but I’ve never tested it speed-wise.

Paul Lowe
spazz at ulink.net

How big is the l1 cache?

It’s usually pretty small, e.g. 128-512K
I’m not actually sure if I am thinking of the L1 or L2 cache.
I think the L1 cache is much smaller than that.
You could look at some CPU specs and find out.

I’ve noticed that in SDL’s code
there is multiplication done to reference the initial pointers for the src and
destination surfaces during a blit, this could be avoided by using a
lookup-table, surely even if it trashes the “l1” cache, it has to be faster than
multiplication? Especially for very small things you are blitting such as
sprites.

Actually the multiplication is faster than a table lookup because it’s
only done once at the beginning of an inner loop, and uses values that
at that point are already usually in registers. A memory fetch is much
slower than a single (or a few) multiplies.

Table lookups tend to be faster than multiplies if the table is in the
cache, and you are doing several calculations, but pointer incrementing
is faster on the current hardware than any other solution.

I can’t wait for SIMD instructions… GRIN

I’d also considered instead of making a table of y-values, you could make a table
of pointers to each y-line on the screen, but I’ve never tested it speed-wise.

I think that is faster, but not by very much.

-Sam Lantinga				(slouken at devolution.com)

Lead Programmer, Loki Entertainment Software–
“Any sufficiently advanced bug is indistinguishable from a feature”
– Rich Kulawiec

How big is the l1 cache?
It’s usually pretty small, e.g. 128-512K
I’m not actually sure if I am thinking of the L1 or L2 cache.
I think the L1 cache is much smaller than that.

That’s the L2 cache… I just remember the Pentium L1 cache was 32 bytes.
ah, the good old times where people wrote texture mapping algorithms that
would try to fit as much as possible in a 32 byte cache :slight_smile:

(see for example my tunnel3d demo in OpenPTC :slight_smile:

c.

brn wrote:

How big is the l1 cache?
It’s usually pretty small, e.g. 128-512K
I’m not actually sure if I am thinking of the L1 or L2 cache.
I think the L1 cache is much smaller than that.

That’s the L2 cache… I just remember the Pentium L1 cache was 32 bytes.
ah, the good old times where people wrote texture mapping algorithms that
would try to fit as much as possible in a 32 byte cache :slight_smile:

(see for example my tunnel3d demo in OpenPTC :slight_smile:

c.

There’s some confusion here, I think…

The original Pentium had separate data and instruction L1 caches,
two-way 8KB each; Pentium with MMX doubled this to 16KB each, four-way.
I believe PII and PPro had the same, and I think PIII has doubled this
again.

Rumor has it that the Celeron 128 is way cool since it has a 128K L2
cache that runs at full core clock speed, making it essentially a huge
(by Intel standards) L1 cache. This was to make up for not having any
L2 cache at all in the original Celeron.

Most Pentium-class machines I’ve seen have at least 256K of L2 and 512K
is pretty common.

Each cache line is 32 bytes; maybe that’s the 32 you’re remembering…–

Gary Scillian @Gary_Scillian
"There’s a seeker born every minute." - Firesign Theatre