Video Performance Question, and a few possible SDL Video (DX) bugs

Hey guys! I would really appreciate if you gurus would look at this and tell me what you think.

First, take a look at this snippet from my profiler:

    Func          Func+Child           Hit
    Time   %         Time      %      Count  Function---------------------------------------------------------

220372.625 87.2 220372.625 87.2 303936 SDLVideoOutputDevice::tempStretchWhile(unsigned short * &,int &) (sdlvideooutputdevice.obj)
8359.949 3.3 8359.949 3.3 303936 SDLVideoOutputDevice::tempStretchFor(unsigned short * &) (sdlvideooutputdevice.obj)
5202.653 2.1 13197.765 5.2 9269599 ProcessorBus::tick(void) (processorbus.obj)
1866.695 0.7 1866.695 0.7 3063223 std::basic_istream<char,struct std::char_traits >::get(void) (msvcp60d.dll)
1569.172 0.6 2207.665 0.9 5906398 AY38914::tick(void) (ay38914.obj)
… (1042 more modules snipped for brevity…)

I think we all see my performance problem. Now, here’s the code in that function:
void SDLVideoOutputDevice::tempStretchWhile(PUINT16& backBuffer, int& y)
{
int pitch = (m_pSurface->pitch >> 1);
while (m_dy < 0) {
m_dy += m_dyinc1;
memcpy(backBuffer,backBuffer-(m_pSurface->pitch >> 1),
m_OutputWidth<<1);
backBuffer += pitch;
y–;
}
m_dy += m_dyinc2;
}
Basicallly, this is the vertical piece of the bitmap stretch code (I temporarily broke it out to a seperate function just to help identify the exact cause of my slow performance). It copies the previous line the appropriate amount of times. I put temporary variables in here to make sure the while loop runs for the proper number of iterations (usually 3 per source line).

At 1024x768 (stretching a 320x192 bitmap), the program’s running like @%!& on my 936Mhz Athlon w/ 32MB TNT2. So, I tried this experiment.

void SDLVideoOutputDevice::tempStretchWhile(PUINT16& backBuffer, int& y)
{
int pitch = (m_pSurface->pitch >> 1);
int pitchdif = pitch - m_OutputWidth;
while (m_dy < 0) {
m_dy += m_dyinc1;
PUINT16 source = backBuffer-pitch;
for(int i = 0; i < m_OutputWidth; i++) {
*backBuffer++ = *source++;
}
backBuffer += pitchdif;
y–;
}
m_dy += m_dyinc2;
}

This code runs equally bad as before. So, I tried this:

void SDLVideoOutputDevice::tempStretchWhile(PUINT16& backBuffer, int& y)
{
int pitch = (m_pSurface->pitch >> 1);
int pitchdif = pitch - m_OutputWidth;
int test = 0;
while (m_dy < 0) {
m_dy += m_dyinc1;
PUINT16 source = backBuffer-pitch;
for(int i = 0; i < m_OutputWidth; i++) {
backBuffer++;
test = *source++;
}
backBuffer += pitchdif;
y–;
}
m_dy += m_dyinc2;
}

The difference here is that I’m not actually writing to the backbuffer. This is still pretty much as slow as the original code.

Finally, to satisfy my curiosity, I tried this:
void SDLVideoOutputDevice::tempStretchWhile(PUINT16& backBuffer, int& y)
{
int pitch = (m_pSurface->pitch >> 1);
int pitchdif = pitch - m_OutputWidth;
int test = 0;
while (m_dy < 0) {
m_dy += m_dyinc1;
PUINT16 source = backBuffer-pitch;
for(int i = 0; i < m_OutputWidth; i++) {
*backBuffer++ = 0;
}
backBuffer += pitchdif;
y–;
}
m_dy += m_dyinc2;
}

The point here is that I’m not reading the surface memory. This runs much faster (still not as fast as I would have hoped, though).

So, the question is: Why are surface reads so slow? My guess is that, even though I’m specifying SDL_SWSURFACE, they might be getting created in hardware memory? The program (Bliss32 Intellivision Emulator) in question runs about the same speed in a window, but that could be due to the flip blit overhead.

That’s when I started looking at SDL.

I thought I had found the culprit here, in CreateRGBSurface:

if ( ((flags&SDL_HWSURFACE) == SDL_SWSURFACE) ||
(video->AllocHWSurface(this, surface) < 0) ) {
if ( surface->w && surface->h ) {
surface->pixels = malloc(surface->hsurface->pitch);
if ( surface->pixels == NULL ) {
SDL_FreeSurface(surface);
SDL_OutOfMemory();
return(NULL);
}
/
This is important for bitmaps /
memset(surface->pixels, 0, surface->h
surface->pitch);
}
}
I might be misreading that, but it looks like it’s trying to allocate a hardware surface even when HWSURFACE is not set?!?!

Unfortunately, I modified my copy of it, and it made no apparent difference. So, I’m thinking I just haven’t found the right piece of code yet.

Here’s something else that struck me as odd in the SDL code:

/* in DX5_SetVideoMode */
video->flags |= SDL_SWSURFACE;

/* in DSp_SetVideoMode /
current->flags |= SDL_DOUBLEBUF | SDL_SWSURFACE; /
only front buffer is in VRAM */

Since SDL_SWSURFACE = 0x0, Shouldn’t those be:

/* in DX5_SetVideoMode */
video->flags &= ~SDL_HWSURFACE;

/* in DSp_SetVideoMode /
current->flags |= SDL_DOUBLEBUF & ~SDL_HWSURFACE; /
only front buffer is in VRAM */

???

If anyone out there can shed some light on my problem, I’d greatly appreciate it.

Thanks in advance,
-Jesse Litton
@Evil

So, the question is: Why are surface reads so slow? My guess is that, even though I’m specifying SDL_SWSURFACE, they might be getting created in hardware memory? The program (Bliss32 Intellivision Emulator) in question runs about the same speed in a window, but that could be due to the flip blit overhead.

That’s when I started looking at SDL.

I thought I had found the culprit here, in CreateRGBSurface:

if ( ((flags&SDL_HWSURFACE) == SDL_SWSURFACE) ||
(video->AllocHWSurface(this, surface) < 0) ) {
}
I might be misreading that, but it looks like it’s trying to allocate a hardware surface even when HWSURFACE is not set?!?!

No, it’s misleading due to the use of SDL_SWSURFACE, but the code is correct.

Unfortunately, I modified my copy of it, and it made no apparent difference. So, I’m thinking I just haven’t found the right piece of code yet.

Here’s something else that struck me as odd in the SDL code:

/* in DX5_SetVideoMode */
video->flags |= SDL_SWSURFACE;

This is a no-op, again misleading, but the flags are cleared above so it
should be okay. You can double check the code by putting a print statement
in the code that actually allocates a hardware surface and make sure it’s
not being called.

/* in DSp_SetVideoMode /
current->flags |= SDL_DOUBLEBUF | SDL_SWSURFACE; /
only front buffer is in VRAM */

I think this is wrong. You shouldn’t get SDL_DOUBLEBUF and not SDL_HWSURFACE.
SDL_DOUBLEBUF really means that you have access to page flipped video memory.
Darrell, can you fix that?

See ya!
-Sam Lantinga, Software Engineer, Blizzard Entertainment

Thanks for the response Sam!

I finally decided to stop fighting SDL’s tendency to make HWSURFACE’s (since
this is preferable in most cases), and instead re-wrote my code to generate
the output in a temporary buffer, then to copy that buffer to the
backbuffer, avoiding the video RAM reads entirely.

Thanks again!

-J> ----- Original Message -----

From: slouken@devolution.com (Sam Lantinga)
To:
Sent: Friday, September 14, 2001 8:50 PM
Subject: Re: [SDL] Video Performance Question, and a few possible SDL Video
(DX) bugs…

So, the question is: Why are surface reads so slow? My guess is that,
even though I’m specifying SDL_SWSURFACE, they might be getting created in
hardware memory? The program (Bliss32 Intellivision Emulator) in question
runs about the same speed in a window, but that could be due to the flip
blit overhead.

That’s when I started looking at SDL.

I thought I had found the culprit here, in CreateRGBSurface:

if ( ((flags&SDL_HWSURFACE) == SDL_SWSURFACE) ||
(video->AllocHWSurface(this, surface) < 0) ) {
}
I might be misreading that, but it looks like it’s trying to allocate a
hardware surface even when HWSURFACE is not set?!?!

No, it’s misleading due to the use of SDL_SWSURFACE, but the code is
correct.

Unfortunately, I modified my copy of it, and it made no apparent
difference. So, I’m thinking I just haven’t found the right piece of code
yet.

Here’s something else that struck me as odd in the SDL code:

/* in DX5_SetVideoMode */
video->flags |= SDL_SWSURFACE;

This is a no-op, again misleading, but the flags are cleared above so it
should be okay. You can double check the code by putting a print
statement
in the code that actually allocates a hardware surface and make sure it’s
not being called.

/* in DSp_SetVideoMode /
current->flags |= SDL_DOUBLEBUF | SDL_SWSURFACE; /
only front
buffer is in VRAM */

I think this is wrong. You shouldn’t get SDL_DOUBLEBUF and not
SDL_HWSURFACE.
SDL_DOUBLEBUF really means that you have access to page flipped video
memory.
Darrell, can you fix that?

See ya!
-Sam Lantinga, Software Engineer, Blizzard Entertainment


SDL mailing list
SDL at libsdl.org
http://www.libsdl.org/mailman/listinfo/sdl