Hey guys! I would really appreciate if you gurus would look at this and tell me what you think.
First, take a look at this snippet from my profiler:
Func Func+Child Hit
Time % Time % Count Function---------------------------------------------------------
220372.625 87.2 220372.625 87.2 303936 SDLVideoOutputDevice::tempStretchWhile(unsigned short * &,int &) (sdlvideooutputdevice.obj)
8359.949 3.3 8359.949 3.3 303936 SDLVideoOutputDevice::tempStretchFor(unsigned short * &) (sdlvideooutputdevice.obj)
5202.653 2.1 13197.765 5.2 9269599 ProcessorBus::tick(void) (processorbus.obj)
1866.695 0.7 1866.695 0.7 3063223 std::basic_istream<char,struct std::char_traits >::get(void) (msvcp60d.dll)
1569.172 0.6 2207.665 0.9 5906398 AY38914::tick(void) (ay38914.obj)
… (1042 more modules snipped for brevity…)
I think we all see my performance problem. Now, here’s the code in that function:
void SDLVideoOutputDevice::tempStretchWhile(PUINT16& backBuffer, int& y)
{
int pitch = (m_pSurface->pitch >> 1);
while (m_dy < 0) {
m_dy += m_dyinc1;
memcpy(backBuffer,backBuffer-(m_pSurface->pitch >> 1),
m_OutputWidth<<1);
backBuffer += pitch;
y–;
}
m_dy += m_dyinc2;
}
Basicallly, this is the vertical piece of the bitmap stretch code (I temporarily broke it out to a seperate function just to help identify the exact cause of my slow performance). It copies the previous line the appropriate amount of times. I put temporary variables in here to make sure the while loop runs for the proper number of iterations (usually 3 per source line).
At 1024x768 (stretching a 320x192 bitmap), the program’s running like @%!& on my 936Mhz Athlon w/ 32MB TNT2. So, I tried this experiment.
void SDLVideoOutputDevice::tempStretchWhile(PUINT16& backBuffer, int& y)
{
int pitch = (m_pSurface->pitch >> 1);
int pitchdif = pitch - m_OutputWidth;
while (m_dy < 0) {
m_dy += m_dyinc1;
PUINT16 source = backBuffer-pitch;
for(int i = 0; i < m_OutputWidth; i++) {
*backBuffer++ = *source++;
}
backBuffer += pitchdif;
y–;
}
m_dy += m_dyinc2;
}
This code runs equally bad as before. So, I tried this:
void SDLVideoOutputDevice::tempStretchWhile(PUINT16& backBuffer, int& y)
{
int pitch = (m_pSurface->pitch >> 1);
int pitchdif = pitch - m_OutputWidth;
int test = 0;
while (m_dy < 0) {
m_dy += m_dyinc1;
PUINT16 source = backBuffer-pitch;
for(int i = 0; i < m_OutputWidth; i++) {
backBuffer++;
test = *source++;
}
backBuffer += pitchdif;
y–;
}
m_dy += m_dyinc2;
}
The difference here is that I’m not actually writing to the backbuffer. This is still pretty much as slow as the original code.
Finally, to satisfy my curiosity, I tried this:
void SDLVideoOutputDevice::tempStretchWhile(PUINT16& backBuffer, int& y)
{
int pitch = (m_pSurface->pitch >> 1);
int pitchdif = pitch - m_OutputWidth;
int test = 0;
while (m_dy < 0) {
m_dy += m_dyinc1;
PUINT16 source = backBuffer-pitch;
for(int i = 0; i < m_OutputWidth; i++) {
*backBuffer++ = 0;
}
backBuffer += pitchdif;
y–;
}
m_dy += m_dyinc2;
}
The point here is that I’m not reading the surface memory. This runs much faster (still not as fast as I would have hoped, though).
So, the question is: Why are surface reads so slow? My guess is that, even though I’m specifying SDL_SWSURFACE, they might be getting created in hardware memory? The program (Bliss32 Intellivision Emulator) in question runs about the same speed in a window, but that could be due to the flip blit overhead.
That’s when I started looking at SDL.
I thought I had found the culprit here, in CreateRGBSurface:
if ( ((flags&SDL_HWSURFACE) == SDL_SWSURFACE) ||
(video->AllocHWSurface(this, surface) < 0) ) {
if ( surface->w && surface->h ) {
surface->pixels = malloc(surface->hsurface->pitch);
if ( surface->pixels == NULL ) {
SDL_FreeSurface(surface);
SDL_OutOfMemory();
return(NULL);
}
/ This is important for bitmaps /
memset(surface->pixels, 0, surface->hsurface->pitch);
}
}
I might be misreading that, but it looks like it’s trying to allocate a hardware surface even when HWSURFACE is not set?!?!
Unfortunately, I modified my copy of it, and it made no apparent difference. So, I’m thinking I just haven’t found the right piece of code yet.
Here’s something else that struck me as odd in the SDL code:
/* in DX5_SetVideoMode */
video->flags |= SDL_SWSURFACE;
/* in DSp_SetVideoMode /
current->flags |= SDL_DOUBLEBUF | SDL_SWSURFACE; / only front buffer is in VRAM */
Since SDL_SWSURFACE = 0x0, Shouldn’t those be:
/* in DX5_SetVideoMode */
video->flags &= ~SDL_HWSURFACE;
/* in DSp_SetVideoMode /
current->flags |= SDL_DOUBLEBUF & ~SDL_HWSURFACE; / only front buffer is in VRAM */
???
If anyone out there can shed some light on my problem, I’d greatly appreciate it.
Thanks in advance,
-Jesse Litton
@Evil