[Win32] DDraw sub-optimal performance

I wrote a test program yesterday to compare SDL’s and Allegro’s performance
under Win32. To my horror I found out that SDL is, in the best case, twice
as slow as Allegro (even when colour depths match, and even in full-screen
mode). Ouch! I’m on a 1.4GHz T-Bird running Win2k with a Matrox G450 card.
I checked SDL_GetVideoInfo() and SDL_VideoDriverName() – it’s using directx
alright and everything other than Alpha is accelerated (HW/SW/SW->HW
blitting, etc). I can’t get SDL to take less than 30%-60% CPU in any mode
(full-screen/hw-surface/double-buffered, windowed/sw-surface, etc.), while
with Allegro I get <5% CPU in full-screen mode, and 27% is windowed mode.
My first guess would have been that Allegro does any bpp conversion via
DDraw whereas SDL does it “manually”, but that would not explain the <5%
(Allegro) vs. 30% (SDL) performance hit while in full-screen mode. Any
thoughts? Am I doing something particularly stupid here?!########################################################
SDL code
########################################################

#include <stdlib.h>
#include “SDL.h”

int main(int argc, char *argv[]) {
SDL_Init(SDL_INIT_AUDIO | SDL_INIT_VIDEO | SDL_INIT_TIMER);

SDL_Surface *screen;
screen = SDL_SetVideoMode(640, 480, 24,
SDL_HWSURFACE|SDL_DOUBLEBUF|SDL_FULLSCREEN);

atexit(SDL_Quit);

SDL_ShowCursor(SDL_DISABLE);

SDL_Event event;

int i, j, c = 0;

while(1) {
SDL_LockSurface(screen);
for (i = 0; i < screen->h; i++) {
for (j = 0; j < (screen->w3); j++) {
((unsigned char
)(screen->pixels))[i * screen->pitch + j] = c;
}
}
c+=16;
SDL_UnlockSurface(screen);
SDL_Flip(screen);

if (SDL_PollEvent(&event) && (((event.type == SDL_KEYDOWN) &&
   (event.key.keysym.sym == SDLK_ESCAPE)) ||
   (event.type == SDL_QUIT)))
{
  exit(0);
}

SDL_Delay(30);

}
}

########################################################
Allegro code
########################################################

#include <stdlib.h>
#include “allegro.h”

int main(int argc, char *argv[]) {
allegro_init();
install_keyboard();
install_timer();

set_display_switch_mode(SWITCH_BACKGROUND);

set_color_depth(24);

set_gfx_mode(GFX_AUTODETECT_FULLSCREEN, 640, 480, 0, 0);

BITMAP* pages[2];
pages[0] = create_video_bitmap(640, 480);
pages[1] = create_video_bitmap(640, 480);
int cpage = 0;

int i, j, c = 0;

while (1) {
BITMAP* pixels = pages[cpage];

acquire_bitmap(pixels);
for (i = 0; i < pixels->h; i++) {
  unsigned long address = bmp_write_line(pixels, i);
  bmp_select(pixels);
  for (j = 0; j < (pixels->w*3); j++) {
    bmp_write8(address+j, c);
  }
  bmp_unwrite_line(pixels);
}
c+=16;
release_bitmap(pixels);

show_video_bitmap(pixels);
cpage = 1 - cpage;

poll_keyboard();

if (key[KEY_ESC])
  break;

rest(30);

}

return 0;
}

END_OF_MAIN();

Am I doing something particularly stupid here?!

Well, off the top of my head I notice that you’re asking SDL for a 24 bpp
mode. That’s actually the slowest video pixel format, and almost always
has to be converted to the native video format. Try using 0 with the
SDL_ANYFORMAT flag to use the “best” video format. Or even try 32, which
is probably what your video card is set to.

The other thing I notice is that you ask for a hardware surface, and then
lock and touch each pixel directly. This means that if SDL does indeed
give you access to the video memory, as you requested, then each memory
access has to go over the PCI bus … a very slow operation.

Allegro is probably giving you a 32 bpp software surface and then using
hardware accelerated bitblt to get it to the screen. SDL will do the same,
if you ask it to. :slight_smile:

See ya,
-Sam Lantinga, Software Engineer, Blizzard Entertainment

I wrote a test program yesterday to compare SDL’s and Allegro’s
performance
under Win32. To my horror I found out that SDL is, in the best case,
twice
as slow as Allegro (even when colour depths match, and even in full-screen
mode). Ouch! I’m on a 1.4GHz T-Bird running Win2k with a Matrox G450
card.
I checked SDL_GetVideoInfo() and SDL_VideoDriverName() – it’s using
directx
alright and everything other than Alpha is accelerated (HW/SW/SW->HW
blitting, etc). I can’t get SDL to take less than 30%-60% CPU in any mode
(full-screen/hw-surface/double-buffered, windowed/sw-surface, etc.), while
with Allegro I get <5% CPU in full-screen mode, and 27% is windowed mode.
My first guess would have been that Allegro does any bpp conversion via
DDraw whereas SDL does it “manually”, but that would not explain the <5%
(Allegro) vs. 30% (SDL) performance hit while in full-screen mode. Any
thoughts? Am I doing something particularly stupid here?!

########################################################
SDL code
########################################################

#include <stdlib.h>
#include “SDL.h”

int main(int argc, char *argv[]) {
SDL_Init(SDL_INIT_AUDIO | SDL_INIT_VIDEO | SDL_INIT_TIMER);

SDL_Surface *screen;
screen = SDL_SetVideoMode(640, 480, 24,
SDL_HWSURFACE|SDL_DOUBLEBUF|SDL_FULLSCREEN);

atexit(SDL_Quit);

SDL_ShowCursor(SDL_DISABLE);

SDL_Event event;

int i, j, c = 0;

while(1) {
SDL_LockSurface(screen);
for (i = 0; i < screen->h; i++) {
for (j = 0; j < (screen->w3); j++) {
((unsigned char
)(screen->pixels))[i * screen->pitch + j] = c;

Might very well be your unoptimized approach here, while with allegro you
are leaving allegro to do the work, and it’s optimized. Try this instead:

int x, y, w, inc, c = 0;
Uint8 *ptr8;

w = screen->w;
inc = screen->pitch - w * 3;
ptr8 = (Uint8 *) screen->pixels;

for (y=0; yh; y++) {
for (x=0; x<w; x++) {
*ptr8++ = c;
*ptr8++ = c;
*ptr8++ = c;
}

  ptr8 += inc;

}

Shoulds be a lot faster.

  }
}
c+=16;
SDL_UnlockSurface(screen);
SDL_Flip(screen);

if (SDL_PollEvent(&event) && (((event.type == SDL_KEYDOWN) &&
   (event.key.keysym.sym == SDLK_ESCAPE)) ||
   (event.type == SDL_QUIT)))
{
  exit(0);
}

SDL_Delay(30);

}
}

########################################################
Allegro code
########################################################

#include <stdlib.h>
#include “allegro.h”

int main(int argc, char *argv[]) {
allegro_init();
install_keyboard();
install_timer();

set_display_switch_mode(SWITCH_BACKGROUND);

set_color_depth(24);

set_gfx_mode(GFX_AUTODETECT_FULLSCREEN, 640, 480, 0, 0);

BITMAP* pages[2];
pages[0] = create_video_bitmap(640, 480);
pages[1] = create_video_bitmap(640, 480);
int cpage = 0;

int i, j, c = 0;

while (1) {
BITMAP* pixels = pages[cpage];

acquire_bitmap(pixels);
for (i = 0; i < pixels->h; i++) {
  unsigned long address = bmp_write_line(pixels, i);
  bmp_select(pixels);
  for (j = 0; j < (pixels->w*3); j++) {
    bmp_write8(address+j, c);
  }
  bmp_unwrite_line(pixels);
}
c+=16;
release_bitmap(pixels);

show_video_bitmap(pixels);
cpage = 1 - cpage;

poll_keyboard();

if (key[KEY_ESC])
  break;

rest(30);

}

return 0;
}

END_OF_MAIN();


SDL mailing list
SDL at libsdl.org
http://www.libsdl.org/mailman/listinfo/sdl

----- Original Message -----
From: vlad.romascanu@ericsson.ca (Vlad Romascanu (LMC))
To:
Sent: Sunday, March 17, 2002 12:51 AM
Subject: [SDL] [Win32] DDraw sub-optimal performance

“Vlad Romascanu (LMC)” <Vlad.Romascanu at ericsson.ca> wrote in message
news:mailman.1016418856.22397.sdl at libsdl.org

screen = SDL_SetVideoMode(640, 480, 24,
SDL_HWSURFACE|SDL_DOUBLEBUF|SDL_FULLSCREEN);

while(1) {
SDL_LockSurface(screen);
for (i = 0; i < screen->h; i++) {
for (j = 0; j < (screen->w3); j++) {
((unsigned char
)(screen->pixels))[i * screen->pitch + j] =
c;
}
}
c+=16;
SDL_UnlockSurface(screen);

Direct pixel access is inherently slow (and the extra multiplications
per pixel aren’t helping). If want to fill the screen with one color,
use ‘SDL_FillRect’. If you want to test the relative performance of
the two APIs, find a more realistic test case.–
Rainer Deyke | root at rainerdeyke.com | http://rainerdeyke.com

Heck, if you are going to set it all to 0, just use memset(surf->pixels,0,surf->pitch*surf->h);

locking and unlocking surf around that, if needed…–
-==-
Jon Atkins
http://jcatki.2y.net/

Heck, if you are going to set it all to 0, just use memset(surf->pixels,0,surf->pitch*surf->h);

locking and unlocking surf around that, if needed…

No, don’t ever do that. There may be other surfaces outside surf->w that
you’ll be zeroing (possibly used by other applications)

See ya,
-Sam Lantinga, Software Engineer, Blizzard Entertainment

Rainer Deyke wrote:

Direct pixel access is inherently slow (and the extra multiplications
per pixel aren’t helping). If want to fill the screen with one color,
use ‘SDL_FillRect’. If you want to test the relative performance of
the two APIs, find a more realistic test case.

If you look at my original message you’ll see that with Allegro I can do as
little as <5% CPU (incuding the for loops, multiplications, video memory
access, etc.), so how on earth is filling the screen in a for loop
responsible for the 30-70% performance loss when using SDL? For my
particular application doing pixel-level manipulation is the only realistic
case. I see your point that doing it directly into a HW surface hits
performance, though, but for some obscure reason even if I specify a SW
surface it’s just as slow. And, BTW, removing the for loops completely
makes for 0% improvement in performance (flip is the CPU hog). Also, as I
said in my reply to Sam’s mail, I see SDL doing proper clipping in windowed
mode whereas Allegro doesn’t, and if we take into account the fact that
asking for SW<->HW surfaces makes no significant performance difference I’d
say SDL is not taking advantage 100% of acceleration. Should I explicitly
ask SDL for overlay?

V.

Jonathan Atkins wrote:

Heck, if you are going to set it all to 0, just use
memset(surf->pixels,0,surf->pitch*surf->h);
locking and unlocking surf around that, if needed…

Look, the for loop has nothing to do with performance. And I’m not 0-ing
pixels out, BTW. I can very well comment out the two for loops and only
leave the flip, and it’s still eating 30-70% CPU (trust me, I did that to
find who the CPU hog is, and it’s not LockSurface, UnlockSurface or the for
loops – it’s SDL_Flip). I suspect SDL is doing some very unoptimized
copying + conversion from pixels[] into the ddraw surface behind the scenes
(the LowerBlitWhatever function?). Can someone comment on this?!

Jason Hoffoss wrote:

Might very well be your unoptimized approach here, while with allegro you
are leaving allegro to do the work, and it’s optimized. Try this instead:

Please looks closer. In the Allegro code I actually do more work (I call
one function for every byte and an additional two function calls per
scan-line, as opposed to the SDL code where I only do a plain assignment for
every byte. The multiplication is peanuts. Anyway, completely removing the
nested for loops brings 0.0000% improvement (see above), so that is not
the reason.

Any more insights?!

V.

Sam Lantinga wrote:

Well, off the top of my head I notice that you’re asking SDL for a 24 bpp
mode. That’s actually the slowest video pixel format, and almost always
has to be converted to the native video format. Try using 0 with the
SDL_ANYFORMAT flag to use the “best” video format. Or even try 32, which
is probably what your video card is set to.

I tried both HW double-buffered, HW plain, and SW surfaces, both full-screen
and windowed mode (all 4-6 combinations). I was not able to get less than
30% CPU overhead (the worst case, windowed mode, can eat as much as 70%).
24bpp should not matter in full-screen mode (in windowed mode my card is set
to 32bpp – I chose 24bpp on purpose to test conversion, but even if I
choose 0 [take desktop bpp] there is <5% improvement of the overhead).

I configured the Allegro surface the same way (24bpp). The most overhead I
get there is 30%, so I’d say something is definitely different. As a
side-note I can see that SDL clips properly any tool-tips that occur over
the surface (e.g. tool-tips for the window close/minimize/maximize buttons
that pop up over the surface), whereas Allegro couldn’t care less (no
clipping, the tool-tip is overwritten within 30ms). Also, software mouse
cursors (cursors with the “shadow” enabled, as opposed to the "plain"
cursors that get hw-accelerated on my card) flicker like crazy with Allegro
no matter where on the screen they are, whereas I get no flicker with SDL.
So it definitely looks like things are being done differently by Allegro and
SDL.

Cheers,
Vlad.

“Vlad Romascanu (LMC)” <Vlad.Romascanu at ericsson.ca> wrote in message
news:mailman.1016505310.2975.sdl at libsdl.org

And, BTW, removing the for loops completely
makes for 0% improvement in performance (flip is the CPU hog).

Let’s say that you actually got a hardware surface and double
buffering. (Check the ‘flags’ attribute of the screen surface to be
sure.) If this is the case, ‘SDL_Flip’ will perform two actions (not
necessarily in the order in which I describe them):

  • It will set a register on the graphics card. This takes
    approximately no time at all, especially compared with your 'for’
    loops.

  • It will wait until the vertical retrace so that graphics operation
    on the new back buffer don’t interfere with the CRT. This can take up
    to 1000/‘n’ ms, where ‘n’ is your monitor refresh rate. 60 Hz is a
    common low-end refresh rate, resulting in a delay of up to 16 ms. And
    this may very well be CPU time, since modern graphics cards often
    don’t have a vertical retrace interrupt. But this won’t normally
    interfere all that much with overall performance, since you can only
    update the screen once per refresh anyway.–
    Rainer Deyke | root at rainerdeyke.com | http://rainerdeyke.com

  • It will wait until the vertical retrace so that graphics operation
    on the new back buffer don’t interfere with the CRT. This can take up
    to 1000/‘n’ ms, where ‘n’ is your monitor refresh rate. 60 Hz is a
    common low-end refresh rate, resulting in a delay of up to 16 ms. And
    this may very well be CPU time, since modern graphics cards often
    don’t have a vertical retrace interrupt. But this won’t normally
    interfere all that much with overall performance, since you can only
    update the screen once per refresh anyway.

SDL doesn’t wait for vsync, as far as I know, hardware surface or not.

It sounds like he’s not getting a hardware surface to me anyhow.

–ryan.

“Ryan C. Gordon” wrote in message
news:mailman.1016509686.5552.sdl at libsdl.org

  • It will wait until the vertical retrace so that graphics
    operation

on the new back buffer don’t interfere with the CRT. This can
take up

to 1000/‘n’ ms, where ‘n’ is your monitor refresh rate. 60 Hz is
a

common low-end refresh rate, resulting in a delay of up to 16 ms.
And

this may very well be CPU time, since modern graphics cards often
don’t have a vertical retrace interrupt. But this won’t normally
interfere all that much with overall performance, since you can
only

update the screen once per refresh anyway.

SDL doesn’t wait for vsync, as far as I know, hardware surface or
not.

SDL-1.2.3/src/video/windx5/SDL_dx5video.c, function
’DX5_FlipHWSurface’, line 1930:

result = IDirectDrawSurface3_Flip(dd_surface,NULL,DDFLIP_WAIT);

‘DD_FLIP_WAIT’ indicates that the function should not return until the
flip is complete (i.e. after the vertical retrace).–
Rainer Deyke | root at rainerdeyke.com | http://rainerdeyke.com

SDL-1.2.3/src/video/windx5/SDL_dx5video.c, function
’DX5_FlipHWSurface’, line 1930:

result = IDirectDrawSurface3_Flip(dd_surface,NULL,DDFLIP_WAIT);

I stand corrected. Sorry 'bout that.

–ryan.