Sdl & smp

slouken · May 2, 2000, 2:49pm

Kostas Gewrgiou wrote:

Hello Sam

I know that the proper place to send this should have been the SDL
mailing list but i still haven’t managed to subscribe

I was browsing the GGI changelog today and i noticed the following

1999-12-28 Marcus Sundberg [marcus at ggi-project.org]

display/X/mode.c:
Not calling XSync() directly after XShmPutImage() can give significant
performance improvements on SMP machines. Instead we call it before,
to prevent the X server from being flooded by ShmPut requests.

The recent SDL_ASYNCBLIT flag does more or less the same but the above
might work better in some cases (and with no performance loss in single
cpu systems).
Unfortunately i don’t have an SMP machine to do any benchmarks so i can’t
tell if it wins over asynchronous video updates but i can’t see a reason
why both can’t get in.

Actually, the SDL_ASYNCBLIT flag does exactly that, but not on
single-CPU machines, since benchmarking showed that the X server and
application contend for the CPU in this situation. Also, if you do it
exactly the way GGI does it, you get update artifacts as the X server
displays memory you are writing to. GGI gets away with this by
emulating a vertical retrace.

Thanks anyway!–
-Sam Lantinga, Lead Programmer, Loki Entertainment Software

P.S. I’ll add you to the SDL mailing list.

Pierre_Phaneuf · May 2, 2000, 4:11pm

Sam Lantinga wrote:

Not calling XSync() directly after XShmPutImage() can give significant
performance improvements on SMP machines. Instead we call it before,
to prevent the X server from being flooded by ShmPut requests.

The recent SDL_ASYNCBLIT flag does more or less the same but the above
might work better in some cases (and with no performance loss in single
cpu systems).
Unfortunately i don’t have an SMP machine to do any benchmarks so i can’t
tell if it wins over asynchronous video updates but i can’t see a reason
why both can’t get in.

Actually, the SDL_ASYNCBLIT flag does exactly that, but not on
single-CPU machines, since benchmarking showed that the X server and
application contend for the CPU in this situation. Also, if you do it
exactly the way GGI does it, you get update artifacts as the X server
displays memory you are writing to. GGI gets away with this by
emulating a vertical retrace.

We had improvement in performance from doing it this way (XSync before
the XShmPutImage), because in a real life game/application, there are
some places where the X server can slice in its work (for example, when
we select() the network connections, it gives up our timeslice and let
the X server work a bit in what would be unused CPU). In a pure
benchmark application, this could be different.

How do you know that a machine is single or multi CPU? Why would you
care? Did you know that this code on an Alpha 21264 would probably be
faster?

As for the artifacts, you should ask for the completion event when you
use XShmPutImage, marking the surface as “busy” as well, then remove the
"busy" flag when you get the completion event for that surface. Disallow
locking the surface while it is “busy”. No artifacts.–
Pierre Phaneuf
Systems Exorcist

Mattias_Engdegard · May 2, 2000, 5:28pm

We had improvement in performance from doing it this way (XSync before
the XShmPutImage), because in a real life game/application, there are
some places where the X server can slice in its work (for example, when
we select() the network connections, it gives up our timeslice and let
the X server work a bit in what would be unused CPU). In a pure
benchmark application, this could be different.

As you say, this depends strongly on the application. Many (most?)
real-time games won’t block on select() anyway, and if the only other task
in need of CPU is the X server, then XSync() is a fair way to explicit
schedule it on a uniprocessor box. On the other hand, when the client
actually sleeps, deferring XSync may be better.

I have a related problem: a CPU-intensive client using X, and a game
server frequently running on the same box. I wish for a tri-CPU box

How do you know that a machine is single or multi CPU? Why would you
care? Did you know that this code on an Alpha 21264 would probably be
faster?

Why? (The 21[34]64 is another matter, but they are vapourware so far
As for how to detect SMP, see SDL_x11image.c; there’s only code for Linux
and Solaris so far.

As for the artifacts, you should ask for the completion event when you
use XShmPutImage, marking the surface as “busy” as well, then remove the
"busy" flag when you get the completion event for that surface. Disallow
locking the surface while it is “busy”. No artifacts.

other than those from the lack of vertical retrace synchronization :(.
Yes, this would be easy to add, and probably perform better than the
current XSync-on-lock mechanism.

slouken · May 2, 2000, 6:41pm

As for the artifacts, you should ask for the completion event when you
use XShmPutImage, marking the surface as “busy” as well, then remove the
"busy" flag when you get the completion event for that surface. Disallow
locking the surface while it is “busy”. No artifacts.

other than those from the lack of vertical retrace synchronization :(.
Yes, this would be easy to add, and probably perform better than the
current XSync-on-lock mechanism.

That requires the display lock code to poll events to see if the completion
event has occurred (bad). The lock sematics are such that it blocks until
the surface is available for writing. Thus XSync() is appropriate there.
If SDL has non-waiting locks, this might be appropriate.

I’m interested in whether or not it would make a difference in real SDL
applications, but for the next few weeks I will have very little time
to look into it.

See ya,
-Sam Lantinga, Lead Programmer, Loki Entertainment Software

Mattias_Engdegard · May 2, 2000, 8:16pm

That requires the display lock code to poll events to see if the completion
event has occurred (bad). The lock sematics are such that it blocks until
the surface is available for writing. Thus XSync() is appropriate there.
If SDL has non-waiting locks, this might be appropriate.

XSync is much slower than an X11 event poll, since it requires 2 full context
switches and a whole bunch of system calls, even if nothing needs to be done.
On an SMP box it’s slightly better, but not by much.

I’m interested in whether or not it would make a difference in real SDL
applications, but for the next few weeks I will have very little time
to look into it.

Right; it should be tested first.