Slowdowns. (A Bit off topic)

I have a rendering loop which preforms to 640x480x16 blits per frame.
Each between SDL_Surfaces in system memory, then finally a screen
update. When combined with some smaller blits for a GUI I get about
25fps in non-dga mode. However, when I draw to one of the software
surfaces (writing about 4640+4480 pixels it slows to approximately 12
fps. My rendering loop is not that computationally intensive. Is
there any way I can improve the rate at which I write to system memory?
I use vertical scanlines because of a number of camera restrictions
which save a great deal of computation. Currently the pixel drawing
code looks like pxl[x+px+(y+py)w] = pshade( src[px+pysrcw], light,
specular ); (pshade is an inline gourard shading function). Any ideas?

Stuart

Stuart Anderson wrote:

I have a rendering loop which preforms to 640x480x16 blits per frame.
Each between SDL_Surfaces in system memory, then finally a screen
update. When combined with some smaller blits for a GUI I get about
25fps in non-dga mode. However, when I draw to one of the software
surfaces (writing about 4640+4480 pixels it slows to approximately 12
fps. My rendering loop is not that computationally intensive. Is
there any way I can improve the rate at which I write to system memory?
I use vertical scanlines because of a number of camera restrictions
which save a great deal of computation. Currently the pixel drawing
code looks like pxl[x+px+(y+py)w] = pshade( src[px+pysrcw], light,
specular ); (pshade is an inline gourard shading function). Any ideas?

DUDE. You are drawing your pixels all the wrong way. That is a very slow
way of doing it. Read up on some stuff I wrote, it talks about optimizing
this:

(copied from a tutorial I wrote a while back):

Precalculation of Referencing Pixels--------------------------------------

Multiplication is a heavy task for microprocessors to perform, you can
avoid the multiplication required to place a pixel by creating a table of
every possible y value. This table contains the element numbers of each Y
coordinate at x,y(0, Y). In this fashion, you can avoid the multiplication,
and only add the X coordinate. For example, we’ll build a table of y values
for a surface with the dimensions 5x5.:

int *y_table[5];

for(y = 0; y < 5; y++) {

y_table[y] = (y * 5)

}

pixel = surface[y_table[y] + x];

is much faster than:

pixel = surface[(y * width) +x];

I really hope this helps, I’ve implemented this into a graphics library I
wrote a while back:

http://SparkGL.netpedia.net

Paul Lowe
spazz at ulink.net

I use vertical scanlines because of a number of camera restrictions
which save a great deal of computation. Currently the pixel drawing
code looks like pxl[x+px+(y+py)w] = pshade( src[px+pysrcw], light,
specular ); (pshade is an inline gourard shading function). Any ideas?

This is actually a fairly computationally expensive operation, especially
since writing columns vertically may destroy data caching advantages in
the CPU. The large number of variables used in the loop means that they
cannot all be kept in registers on the x86, so you may be performing
memory loads in addition to the actual pixel operations.
Consider also the number of pixels you are actually operating on as well:
640*480 = 307200, or 1/3 million pixel operations.

Some optimizations I use when writing blitting code:

Instead of pxl[calculated_offset], use a pointer to the pixels directly,
and increment that pointer by a fixed amount:

    while ( height-- ) {
            for ( c=width; c; --c ) {
                    if ( 1 ) {
                      *dst = map[*src];
                    }
                    dst++;
                    src++;
            }
            src += srcskip;
            dst += dstskip;
    }

It may be that you are not drawing pixels in a tight loop, so this
optimization may be less important for you, but code written this
way is very close in speed to hand-tuned assembly that I’ve seen,
especially when loop unrolling optimizations are added.

Anyone else have optimization tips?

Stuart

Good luck!

-Sam Lantinga				(slouken at devolution.com)

Lead Programmer, Loki Entertainment Software–
“Any sufficiently advanced bug is indistinguishable from a feature”
– Rich Kulawiec

Sam Lantinga wrote:

I use vertical scanlines because of a number of camera restrictions
which save a great deal of computation. Currently the pixel drawing
code looks like pxl[x+px+(y+py)w] = pshade( src[px+pysrcw], light,
specular ); (pshade is an inline gourard shading function). Any ideas?

This is actually a fairly computationally expensive operation, especially
since writing columns vertically may destroy data caching advantages in
the CPU. The large number of variables used in the loop means that they
cannot all be kept in registers on the x86, so you may be performing
memory loads in addition to the actual pixel operations.
Consider also the number of pixels you are actually operating on as well:
640*480 = 307200, or 1/3 million pixel operations.

Actually I only draw the new background where the origional screen has
been destroyed, the rest is just an offset blit of the old background,
so I draw (usually) a 4480 or 4640 group of pixels depending on the
direction and distance of each frame’s scroll. I use vertical scan
lines because this lets me use tables for a lot of the shading and
eleminate a few linear interpolations. I will try using your pointer
scheme, however, I will have to use dst+=scrnskip; in my inner loop.
The loop I am drawing in is 4 deep, Tiles X, Tiles Y, Pixels in Tile x,
Pixels in Tile y. The corner of each tile is shifted on the y-axis and
the image data is mapped to the new polygon and shaded.

Stuart.

I wrote a small program to empirically test each of these pixel writing
methods on my machine.

ptr++ in inner loop, 45-50fps (60 with -O2 -funroll-loops)

ptr+=width; in inner, ptr-=640*480-1; in outer, 20fps (same with -O2
-funroll-loops)

ptr[x+ymap[y]] gives 20fps (same with -O2 -funroll-loops)

ptr[x+Y*width] gives 20fps (same with -O2 -funroll-loops)

So it seems that incrementing the pointer is the only method that will,
in my case, give significant speed increases. (same with -O2
-funroll-loops)

and only add the X coordinate. For example, we’ll build a table of y values
for a surface with the dimensions 5x5.:
int *y_table[5];
for(y = 0; y < 5; y++) {
y_table[y] = (y * 5)

On a 640x480 screen, you multiplication table with trash the l1 cache…
linear pointer increment (e.g. sam’s code) is faster

c.