Advice for performance bottlenecks

I have ported SDL 1.2.7 (video, timer, joystick) to run on the XBOX, and
thought I would try out a reasonable game such as Doom to see how I
fared. I downloaded sdldoom 1.10 (as ported by Sam) and after commenting
out most of the network stuff, it runs the initial menus just fine. When
you start the actual game, though, it is very slow. For example, when
you fire the gun it takes around 5 seconds for the screen to finish
repainting.

An obvious candidate was in my implementation of XBOX_UpdateRects(), but
when I log how long it takes my code to redraw the screen, it is only
around 18ms to draw 640x400 pixels.

I realise there are loads of things that can cause performance problems,
but I was wondering if there are a few common places that would be worth
me looking at before anything else? For example, is there an easy way for
me to log the current fps calculation? Any ideas or thoughts would be
much appreciated. Thanks a lot.–
Craig Edwards

Knowing nothing about sdldoom, I’ll venture you should probably write in
your own fps calculator, and start logging within sdldoom in other
places to see where the problem creeps in. It could be some low level
interaction between SDL and the XBox GNU C implementation.

-TomT64

Craig Edwards wrote:> I have ported SDL 1.2.7 (video, timer, joystick) to run on the XBOX,

and thought I would try out a reasonable game such as Doom to see how
I fared. I downloaded sdldoom 1.10 (as ported by Sam) and after
commenting out most of the network stuff, it runs the initial menus
just fine. When you start the actual game, though, it is very
slow. For example, when you fire the gun it takes around 5 seconds
for the screen to finish repainting.

An obvious candidate was in my implementation of XBOX_UpdateRects(),
but when I log how long it takes my code to redraw the screen, it is
only around 18ms to draw 640x400 pixels.

I realise there are loads of things that can cause performance
problems, but I was wondering if there are a few common places that
would be worth me looking at before anything else? For example, is
there an easy way for me to log the current fps calculation? Any
ideas or thoughts would be much appreciated. Thanks a lot.

Unfortunately, I don’t know diddley-squat about sdldoom myself either :slight_smile:
So far, what you describe is pretty much what I was doing… I was curious
to see whether there are some common performance pitfalls that
SDL-developers come across. Thanks for your thoughts.On Sun, 05 Sep 2004 01:18:51 -0700, TomT64 wrote:

Knowing nothing about sdldoom, I’ll venture you should probably write in
your own fps calculator, and start logging within sdldoom in other
places to see where the problem creeps in. It could be some low level
interaction between SDL and the XBox GNU C implementation.


Craig Edwards

In Stella (an Atari 2600 emulator), I’ve come across similar performance
problems between Linux and Windows.

I use a combination of dirty rectangles and SDL_UpdateRects to fill
regions of the screen with certain colours (the 2600 has no concept of
a sprite, only a framebuffer).

I felt this code was highly optimized, and when run under Linux the
whole emulation (including everything, such as video and audio updates
and emulation processing itself) uses only 4-6% of the CPU! The exact
same code under Windows (on the exact same hardware) use 50-60% CPU.

It’s a nut I haven’t managed to crack yet. I knew SDL wasn’t as
optimized on Windows, but that’s ridiculous. Luckily, I implemented an
OpenGL rendering mode which performs the same on all platforms.

Sorry for the ‘me too’ answer, but you’re definitely not alone.

SteveOn September 5, 2004 08:56 am, Craig Edwards wrote:

On Sun, 05 Sep 2004 01:18:51 -0700, TomT64 wrote:

Knowing nothing about sdldoom, I’ll venture you should probably
write in your own fps calculator, and start logging within sdldoom
in other places to see where the problem creeps in. It could be
some low level interaction between SDL and the XBox GNU C
implementation.

Unfortunately, I don’t know diddley-squat about sdldoom myself either
:slight_smile: So far, what you describe is pretty much what I was doing… I
was curious to see whether there are some common performance pitfalls
that SDL-developers come across. Thanks for your thoughts.

Hello !

I have ported SDL 1.2.7 (video, timer, joystick)
to run on the XBOX,

Woah ! SDL on two consoles PS2 and XBOX. Great.

CU

Stephen Anthony wrote:

In Stella (an Atari 2600 emulator), I’ve come across similar performance
problems between Linux and Windows.

I use a combination of dirty rectangles and SDL_UpdateRects to fill
regions of the screen with certain colours (the 2600 has no concept of
a sprite, only a framebuffer).

I felt this code was highly optimized, and when run under Linux the
whole emulation (including everything, such as video and audio updates
and emulation processing itself) uses only 4-6% of the CPU! The exact
same code under Windows (on the exact same hardware) use 50-60% CPU.

It’s funny that you say that, since people usually say the opposite
(because they are using blitting heavily, which is faster under
windows/directx).

It’s a nut I haven’t managed to crack yet. I knew SDL wasn’t as
optimized on Windows, but that’s ridiculous. Luckily, I implemented an
OpenGL rendering mode which performs the same on all platforms.

I guess you’re using a hardware surface, or a doublebuffer. Did you try
to use a software single buffered video surface instead (asking for a
double buffered surface implicitly requests a hardware surface) ? You
might also want to try the windib backend that’s more appropriate for
this kind of things.

In short, a hardware video surface can’t stand too much pixel-level
access, but are very appropriate for fast blitting. A software video
surface, OTOH, is appropriate for direct pixel access.

Stephane

Craig Edwards wrote:

I have ported SDL 1.2.7 (video, timer, joystick) to run on the XBOX,
and thought I would try out a reasonable game such as Doom to see how
I fared. I downloaded sdldoom 1.10 (as ported by Sam) and after
commenting out most of the network stuff, it runs the initial menus
just fine. When you start the actual game, though, it is very
slow. For example, when you fire the gun it takes around 5 seconds
for the screen to finish repainting.

An obvious candidate was in my implementation of XBOX_UpdateRects(),

What did you put in you updaterects func ?

but when I log how long it takes my code to redraw the screen, it is
only around 18ms to draw 640x400 pixels.

I realise there are loads of things that can cause performance
problems, but I was wondering if there are a few common places that
would be worth me looking at before anything else? For example, is
there an easy way for me to log the current fps calculation? Any
ideas or thoughts would be much appreciated. Thanks a lot.

Wild guess : sdldoom uses a single buffered surface, and thus tries to
draw directly to the video memory. This is very bad (and very slow) to
the point that this could explain the speed problem you’re seeing.

There are some solutions, however :

  • use the geforce blitting engine to copy a back surface to screen.
    Depending on which API you have access to, that might be possible
    (windib ? directx ? what APIs can you use if you don’t have the devkit
    ?). If you don’t have access to an API doing this, you can do it
    directly through mmio (like is done in the fbcon backend). If you know
    the base address of the mmio registers I can help you for the rest.
  • reduce the slowdown by doing carefully aligned memory access to copy
    the back surface to screen. That won’t be as good as the first solution,
    however.

Stephane

Stephen Anthony wrote:

In Stella (an Atari 2600 emulator), I’ve come across similar
performance problems between Linux and Windows.

I use a combination of dirty rectangles and SDL_UpdateRects to fill
regions of the screen with certain colours (the 2600 has no concept
of a sprite, only a framebuffer).

I felt this code was highly optimized, and when run under Linux the
whole emulation (including everything, such as video and audio
updates and emulation processing itself) uses only 4-6% of the CPU!
The exact same code under Windows (on the exact same hardware) use
50-60% CPU.

It’s funny that you say that, since people usually say the opposite
(because they are using blitting heavily, which is faster under
windows/directx).

I know, it’s weird. But the thing is, the code was ported from direct
Xlib code. I’m thinking that since it was originally designed and
tested in Linux, something there was implicitly assumed and that’s why
the same code in Windows is so slow. It probably needs to be
rewritten.

It’s a nut I haven’t managed to crack yet. I knew SDL wasn’t as
optimized on Windows, but that’s ridiculous. Luckily, I implemented
an OpenGL rendering mode which performs the same on all platforms.

I guess you’re using a hardware surface, or a doublebuffer. Did you
try to use a software single buffered video surface instead (asking
for a double buffered surface implicitly requests a hardware surface)

No, actually I am using a software single-buffered mode. When I tried
the double-buffered mode in Windows, it went even slower!

You might also want to try the windib backend that’s more
appropriate for this kind of things.

I’ve never heard of that, and will try it out.

In short, a hardware video surface can’t stand too much pixel-level
access, but are very appropriate for fast blitting. A software video
surface, OTOH, is appropriate for direct pixel access.

If I create an array of dirty SDL_Rects every frame and then update them
all with SDL_UpdateRects() once per frame, which approach would work
best? I think software video surface would be best in this case, and
that’s why I use it.

I still have to profile the code to see exactly what’s happening, but I
can’t understand why it works so well in Linux and so poorly in
Windows. Heck, the Linux software rendering is even faster than OpenGL
mode! (which makes sense, since software mode uses dirty updates and
OpenGL updates the whole texture).

SteveOn September 5, 2004 02:41 pm, Stephane Marchesin wrote:

Stephen Anthony wrote:

In short, a hardware video surface can’t stand too much pixel-level
access, but are very appropriate for fast blitting. A software video
surface, OTOH, is appropriate for direct pixel access.

If I create an array of dirty SDL_Rects every frame and then update them
all with SDL_UpdateRects() once per frame, which approach would work
best? I think software video surface would be best in this case, and
that’s why I use it.

Is there any performance difference between windows fullscreen and
windowed ? When in windowed mode, the data is sent when doing
updaterects. When in fullscreen mode, the data is already there and
updaterects does nothing. Maybe you should have different settings for
windowed/fullscreen ?

In the end, it seems to me that letting the user choose the rendering
method is the way to go. I can’t see a way to extrapolate

I still have to profile the code to see exactly what’s happening, but I
can’t understand why it works so well in Linux and so poorly in
Windows. Heck, the Linux software rendering is even faster than OpenGL
mode! (which makes sense, since software mode uses dirty updates and
OpenGL updates the whole texture).

Then you should try 8bpp textures and glTexSubImage2D. I guess the 2600
has less than 256 colors :wink: so that will do just fine.
FYI, 8bpp textures are faster that direct pixel access under X11 here
(that makes sense too, since you’re sending 8bpp values to the card, and
the card converts those to the display bpp, instead of sending bigger
already converted values which eat up the video card’s bandwidth).

Stephane

Stephen Anthony wrote:

In short, a hardware video surface can’t stand too much pixel-level
access, but are very appropriate for fast blitting. A software
video surface, OTOH, is appropriate for direct pixel access.

If I create an array of dirty SDL_Rects every frame and then update
them all with SDL_UpdateRects() once per frame, which approach
would work best? I think software video surface would be best in
this case, and that’s why I use it.

Is there any performance difference between windows fullscreen and
windowed ? When in windowed mode, the data is sent when doing
updaterects. When in fullscreen mode, the data is already there and
updaterects does nothing. Maybe you should have different settings
for windowed/fullscreen ?

Doesn’t seem to be. When there are a lot of changes onscreen, I can
really see the slowdown in the rendering in both cases. I don’t yet
have the option of doing a frameskip, but since it works so well in
Linux, I’d really like to find out how to fix it (introducing frameskip
is really only a hack in this case anyway). The rendering/updating
shouldn’t be that slow, and skipping frames is a cheat.

In the end, it seems to me that letting the user choose the rendering
method is the way to go. I can’t see a way to extrapolate

Well, they can choose between software and OpenGL mode, and in fact I
recommend for Windows users to use OpenGL mode until I can figure this
out. Other than that, I still haven’t figured out how to fix the
software mode. It’s due for a partial rewrite for the next release
anyway, so that may take care of it.

Heck, the Linux software rendering is even faster than
OpenGL mode! (which makes sense, since software mode uses dirty
updates and OpenGL updates the whole texture).

Then you should try 8bpp textures and glTexSubImage2D. I guess the
2600 has less than 256 colors :wink: so that will do just fine.

Actually, some modes have exactly 256 colors, so including those and
some colors for alpha-blending will push it to 16-bit, which is what
I’m using now.

FYI, 8bpp textures are faster that direct pixel access under X11 here
(that makes sense too, since you’re sending 8bpp values to the card,
and the card converts those to the display bpp, instead of sending
bigger already converted values which eat up the video card’s
bandwidth).

Since the software mode uses 4-6% and OpenGL uses 9-12%, I don’t think
it’s worth the trouble to go to paletted textures (and the texture is
only 320x200 anyway). Not to mention that I don’t have to use dirty
updates in OpenGL mode, so a lot of the tricky dirty rectangle update
detection and rendering code just disappears. I wish I could get rid
of software mode altogether. OpenGL is just so nice :slight_smile:

Anyway, thanks for the info
SteveOn September 5, 2004 03:16 pm, Stephane Marchesin wrote:

What did you put in you updaterects func ?

I took some of the code from the nanox port and massaged it a bit (my code
is below). Although the XBOX supports several video modes, to keep things
simple I keep it in the default 640x480 mode and then centre it based on
the requested resolution when calling SDL_SetVideoMode(). Once I am happy
that everything works, I will look at supporting other modes. Apologies
for the quality of the code… it has been a few years since I have done
any serious C programming.

Wild guess : sdldoom uses a single buffered surface, and thus tries to
draw directly to the video memory. This is very bad (and very slow) to
the point that this could explain the speed problem you’re seeing.

That is exactly what I am doing, and I thought it would be very slow
too… however, my timing info indicates that each frame (sdldoom seems to
ask for the whole screen to be redrawn every time) takes about 18-19ms to
draw.

Depending on which API you have access to, that might be possible
(windib ? directx ? what APIs can you use if you don’t have the devkit
?).

I don’t have access to any APIs :frowning: Nearly all functions supplied by the
XBOX kernel are memory/disk/io related.

If you don’t have access to an API doing this, you can do it directly
through mmio (like is done in the fbcon backend). If you know the base
address of the mmio registers I can help you for the rest.

This I do know and is what I plan to use to support more video modes in
the future. I freely admit that I don’t know much about mmio, so I may
take you up on your offer at some stage

Thanks very much for the thoughts. Last night after I posted I inserted
some tracing info into the sdldoom code and found that I am only really
getting about 2 fps - yikes! I have found a function that is called a
lot (NetUpdate()) that takes between 100-300ms each time. I will put my
tracing statements in that guy tonight and see how we go. Thanks again

// center the screen - dodgy!!!
int VIDEO_BUFFER_ADDR = 0xF0040240 + (((SCREEN_HEIGHT -
this->screen->h)/2) * (SCREEN_WIDTH *
SCREEN_PIXELWIDTH)) + (((SCREEN_WIDTH - this->screen->w)/2) *
SCREEN_PIXELWIDTH);

// These are the values for the incoming image
xinc = this->screen->format->BytesPerPixel ;
yinc = this->screen->pitch ;

for (i = 0; i < numrects; ++ i)
{
int start = times(NULL);
int x = rects[i].x;
int y = rects[i].y;
int w = rects[i].w;
int h = rects[i].h;
src = this->screen->pixels + yyinc + xxinc;
dest = (unsigned char*)VIDEO_BUFFER_ADDR;
destinc = SCREEN_WIDTH * SCREEN_PIXELWIDTH;

unsigned char *ptrsrc, *ptrdst;
for (j = h; j > 0; --j, src += yinc, dest += destinc)
{
ptrsrc = src;
ptrdst = dest;
for (k = w; k > 0; --k)
{
unsigned char r, g, b;
if (this->screen->format->BytesPerPixel == 1)
SDL_GetRGB(ptrsrc, this->screen->format, &r, &g, &b);
else if (this->screen->format->BytesPerPixel == 2)
SDL_GetRGB(
(unsigned short )ptrsrc, this->screen->format, &r,
&g, &b);
else
SDL_GetRGB(
(unsigned int *)ptrsrc, this->screen->format, &r, &g,
&b);
*ptrdst++ = b;
*ptrdst++ = g;
*ptrdst++ = r;
*ptrdst++ = 0; ptrsrc += xinc;
}
}
printf(“BLAH Updating screen x=%d y=%d w=%d h=%d time=%d\n”, x, y, w, h,
times(NULL)-start);
}On Sun, 05 Sep 2004 19:25:00 +0200, Stephane Marchesin <stephane.marchesin at wanadoo.fr> wrote:


Craig Edwards

Yikes! The code that polls the XBOX controller for input (somebody else
wrote it) had some sleep statements in there that took approximately 70ms
out of a normal 77ms execution time when polling the joystick. When I
removed those sleeps, things start running like a champion!

I do still have one hitch, though, which is that once you press fire, it
just keeps firing… Now that the joystick poll is no longer sleeping,
there must be some problem with my logic that fires off events saying the
button has been pressed. I need to check my logic in that area. Anyway,
thanks for the advice… I appreciate it!On Mon, 06 Sep 2004 18:11:24 +1000, Craig Edwards <@Craig_Edwards> wrote:

Thanks very much for the thoughts. Last night after I posted I inserted
some tracing info into the sdldoom code and found that I am only really
getting about 2 fps - yikes! I have found a function that is called a
lot (NetUpdate()) that takes between 100-300ms each time. I will put
my tracing statements in that guy tonight and see how we go. Thanks
again


Craig Edwards

Craig Edwards wrote:> On Mon, 06 Sep 2004 18:11:24 +1000, Craig Edwards wrote:

Thanks very much for the thoughts. Last night after I posted I
inserted some tracing info into the sdldoom code and found that I am
only really getting about 2 fps - yikes! I have found a function
that is called a lot (NetUpdate()) that takes between 100-300ms
each time. I will put my tracing statements in that guy tonight and
see how we go. Thanks again

Yikes! The code that polls the XBOX controller for input (somebody
else wrote it) had some sleep statements in there that took
approximately 70ms out of a normal 77ms execution time when polling
the joystick. When I removed those sleeps, things start running like
a champion!

I do still have one hitch, though, which is that once you press fire,
it just keeps firing… Now that the joystick poll is no longer
sleeping, there must be some problem with my logic that fires off
events saying the button has been pressed. I need to check my logic
in that area. Anyway, thanks for the advice… I appreciate it!

Nice you got it working !

Now, you still might want to get h/w accelerated blits because, if you
put it into perspective, 18ms to render a frame means at most 55fps.
As fot my offer to help you with low level graphics stuff - yes it
stands, and I even find that quite interesting. I’m doing some low level
video programming already, and the 2d functions of the geforce/tnt
series are documented (you can find examples on how to program that chip
in the Xfree86/Xorg sources, and also in the old 3D utah glx driver).

Stephane