X11 performance

Pierre_Phaneuf · December 23, 1999, 5:42pm

This isn’t specific to SDL, but rather about platform parity between
Windows and Linux. I’m really starting to get pissed off.

Take two computers:

Windows machine:

Pentium 200 MMX
66 MHz system bus
Voodoo Banshee 16 MB
96 MB of RAM

Linux machine:

Pentium 225 MMX
75 MHz system bus
Matrox Millenium G200 SD 8 MB (also a Voodoo2 12 MB, but unrelated)
96 MB of RAM

We do some blitting tests (at 640x480 in 16 bit), only to find out the
following: my machine can barely move around 30 megabytes per second
while the Windows machine can do 474 freakin’ megabytes per second!!!

We’re talking 2D here. We did the test both in fullscreen and windowed
modes on Windows, so telling me to use DGA isn’t going to do (and in the
reality, it isn’t faster, or only barely so, nothing 10-15 times
faster).

Obviously, X isn’t using the video memory to store the pixmap (and use
the hardware blitter to copy it).

Also, note that he gets 56 megabytes per second blitting from a surface
in system memory, which is almost twice as fast. I’m suspecting the X
server of using memcpy() instead of some bus mastered or DMA transfer to
do the blitting…

Oh yes, our test program is basically this: create a 640x480 Window and
two same sized Pixmap, one painted white and the other painted black.
XCopyArea each of them in sucession to the window. I have to use a XSync
to prevent flooding the X server.

Raster (of Enlightenment fame) told me that my test ran around 210
megabytes per second (accelerated) on XFree86 4.0, but I’d like to know
what can be done for XFree86 3.3.x? Am I missing something big? What
about this, reported by my X server at startup (with “xaa_benchmark”):

(–) SVGA: Using XAA (XFree86 Acceleration Architecture)
(–) SVGA: XAA: Solid filled rectangles
(–) SVGA: XAA: Screen-to-screen copy
(–) SVGA: XAA: 8x8 color expand pattern fill
(–) SVGA: XAA: CPU to screen color expansion (TE/NonTE imagetext,
TE/NonTE polytext)
(–) SVGA: XAA: Using 9 128x128 areas for pixmap caching
(–) SVGA: XAA: Caching tiles and stipples
(–) SVGA: XAA: General lines and segments
(–) SVGA: XAA: Dashed lines and segments
CPU to framebuffer 45.71 Mpix/sec (91.42
MB/s)
10x1 solid rectangle fill 19.80 Mpix/sec (39.60
MB/s)
40x40 solid rectangle fill 190.37 Mpix/sec (380.74
MB/s)
400x400 solid rectangle fill 240.61 Mpix/sec (481.22
MB/s)
10x10 screen copy 59.18 Mpix/sec (118.36
MB/s)
40x40 screen copy 149.13 Mpix/sec (298.26
MB/s)
400x400 screen copy 197.84 Mpix/sec (395.68
MB/s)
400x400 aligned screen copy (scroll) 200.39 Mpix/sec (400.78
MB/s)
10x10 8x8 color expand pattern fill 106.10 Mpix/sec (212.20
MB/s)
400x400 8x8 color expand pattern fill 243.43 Mpix/sec (486.86
MB/s)
10x10 CPU-to-screen color expand 5.36 Mpix/sec (10.72
MB/s)
416x400 CPU-to-screen color expand 235.11 Mpix/sec (470.22
MB/s)
10x10 screen-to-screen color expand 66.76 Mpix/sec (133.52
MB/s)

Where the f**k is that 395 MB/s I see for screen copy??? Or that 91 MB/s
for “CPU to framebuffer”??? If I can get one half of those numbers,
I’ll be a happy camper.

Anybody got an idea?–
Pierre Phaneuf
Ludus Design, http://ludusdesign.com/
“First they ignore you. Then they laugh at you.
Then they fight you. Then you win.” – Gandhi

slouken · December 23, 1999, 7:02pm

This isn’t specific to SDL, but rather about platform parity between
Windows and Linux. I’m really starting to get pissed off.

Take two computers:

Windows machine:

Pentium 200 MMX

66 MHz system bus

Voodoo Banshee 16 MB

96 MB of RAM

Linux machine:

Pentium 225 MMX

75 MHz system bus

Matrox Millenium G200 SD 8 MB (also a Voodoo2 12 MB, but unrelated)

96 MB of RAM

We do some blitting tests (at 640x480 in 16 bit), only to find out the
following: my machine can barely move around 30 megabytes per second
while the Windows machine can do 474 freakin’ megabytes per second!!!

Based on your benchmarks, it looks like Windows is creating both pixmaps
in video memory and doing hardware accelerated (video <–> video) blits.
The banshee is very fast at this.

X11 on the other hand, is creating both pixmaps in system memory and
performing (host --> video) blits using the CPU to transfer the data.
In addition, you have the X protocol overhead.

This is currently typical of Windows v.s. X11, but in real-world
2D games, most of the time the drawing is all done to an offscreen buffer
in system memory, and then blitted to the video card. This makes the
difference between Windows and X11 much lower.

I believe that XFree86 4.0 is more likely to use video memory for pixmap
surfaces, and has more support for taking advantage of DMA busmastering,
but I haven’t checked, so I could be wrong.

-Sam Lantinga				(slouken at devolution.com)

Lead Programmer, Loki Entertainment Software–
“Any sufficiently advanced bug is indistinguishable from a feature”
– Rich Kulawiec

Stephane_Peter · December 23, 1999, 7:29pm

In article <83tris$7ku$1 at news.lokigames.com>,
Sam Lantinga writes:

This isn’t specific to SDL, but rather about platform parity between
Windows and Linux. I’m really starting to get pissed off.

Take two computers:

Windows machine:

Pentium 200 MMX

66 MHz system bus

Voodoo Banshee 16 MB

96 MB of RAM

Linux machine:

Pentium 225 MMX

75 MHz system bus

Matrox Millenium G200 SD 8 MB (also a Voodoo2 12 MB, but unrelated)

96 MB of RAM

We do some blitting tests (at 640x480 in 16 bit), only to find out the
following: my machine can barely move around 30 megabytes per second
while the Windows machine can do 474 freakin’ megabytes per second!!!

Based on your benchmarks, it looks like Windows is creating both pixmaps
in video memory and doing hardware accelerated (video <–> video) blits.
The banshee is very fast at this.

Actually most modern cards are fast at doing blits (faster than memory
copy anyway, except for a couple of old cards I believe).

X11 on the other hand, is creating both pixmaps in system memory and
performing (host --> video) blits using the CPU to transfer the data.
In addition, you have the X protocol overhead.

The only fact that you don’t use any hardware acceleration in X11 explains
in itself the huge performance difference. Read and write access across the
bus are much slower than the internal blitter of any modern card. I’d say
that the X protocol overhead is almost nothing compared to this…

This is currently typical of Windows v.s. X11, but in real-world
2D games, most of the time the drawing is all done to an offscreen buffer
in system memory, and then blitted to the video card. This makes the
difference between Windows and X11 much lower.

This is a very bad approach. The problem I see in SDL is that the X11
implementation does not take advantage at all of accelerations from the
X server. Having worked on XFree86 for a few months, I think that the
performance could be improved if we could find a way to store SDL surfaces in
X objects (like Pixmaps), instead of in a malloc’ed() chunk of system memory.
The fact is that XAA (the X Acceleration Architecture) manages a pixmap cache
in offscreen video memory, that can the be blitted at maximum speed by
the X server. This is already done in XFree 3.3.5, but will likely be even
better in 4.0 …

I believe that XFree86 4.0 is more likely to use video memory for pixmap
surfaces, and has more support for taking advantage of DMA busmastering,
but I haven’t checked, so I could be wrong.

AFAIK we could take advantage of pixmap surfaces in memory right now, but
I don’t think there is really DMA support yet cause this needs help from
the kernel and is thus an unportable feature (and only some chipsets may allow
this in fact).–
Stephane Peter
Programmer
Loki Entertainment Software

“Microsoft has done to computers what McDonald’s has done to gastronomy”

Nicholas_Vining · December 23, 1999, 10:10pm

Newsgroups: loki.open-source.sdl

We do some blitting tests (at 640x480 in 16 bit), only to find out the
following: my machine can barely move around 30 megabytes per second
while the Windows machine can do 474 freakin’ megabytes per second!!!

I find myself not having much faith in that particular test. Are you sure
that it’s not your testing code? I doubt it’s a question of simply that
Linux is that much slower. It could also be anything from inefficient
compilers (MSVC does quite well, I’m not sure about GCC or whatever you’re
using) to optimization levels to the fact that the video cards are different
and the computers are different.

Oh yes, our test program is basically this: create a 640x480 Window and
two same sized Pixmap, one painted white and the other painted black.
XCopyArea each of them in sucession to the window. I have to use a XSync
to prevent flooding the X server.

Are you doing any sort of synchronization with the Windows version? If not,
that might be a problem. XSync automatically will slow your computer down so
that it’s not able to output as much as it might be able to, whereas your
Windows code doesn’t. In that case (and if it really is necessary, I’m not
sure), this represents a design problem with X rather than a slowdown.

Anybody got an idea?

Try those.

–
Pierre Phaneuf
Ludus Design, http://ludusdesign.com/
“First they ignore you. Then they laugh at you.
Then they fight you. Then you win.” – Gandhi

Nicholas

----- Original Message -----
From: pp@ludusdesign.com (Pierre Phaneuf)
To: sdl at lokigames.com
Date: Thursday, December 23, 1999 10:06 AM
Subject: [SDL] X11 performance

Timothy_Downs · December 24, 1999, 3:19am

Hi There.

We’re talking 2D here. We did the test both in fullscreen and windowed
modes on Windows, so telling me to use DGA isn’t going to do (and in the
reality, it isn’t faster, or only barely so, nothing 10-15 times
faster).

When I first got into SDL, I tried to make a pretty splash screen, to learn
SDL, with alpha blending, etc.
As this is a dual boot machine, I tested the program in both X and in
windows, and was suprised to find that the X version, in DGA was far faster
than the windows version. (p120, 32mb ram, s3virge)
So I was quite suprised to read your e-mail.
Unfortunantly I havent got the numbers here, as this was a while ago.

Anybody got an idea?

Well, one problem I often had with X, was getting the program to use DGA.
Sometimes it just wanted to use Xlib. I guess my X is a little badly set
set. Possibly its not in fact using DGA on the full screen test.
Also, I would suggest trying the test on identical hardware configurations,
as one little piece of hardware can make a massive difference in the numbers
in a benchmark.
Incidently, are you running any other programs on the linux computer? Its
not your server or something?

Good luck,

Timothy Downs
@Timothy_Downs______________________________________________________
Get Your Private, Free Email at http://www.hotmail.com

Pierre_Phaneuf · January 3, 2000, 4:26pm

Sam Lantinga wrote:

We do some blitting tests (at 640x480 in 16 bit), only to find out the
following: my machine can barely move around 30 megabytes per second
while the Windows machine can do 474 freakin’ megabytes per second!!!

Based on your benchmarks, it looks like Windows is creating both pixmaps
in video memory and doing hardware accelerated (video <–> video) blits.
The banshee is very fast at this.

My fast-clocked Millenium G200 is also very fast at this, and most video
cards have very respectable numbers in this area, generally a lot
faster than system -> video blits thru the CPU. I would expect at least
100 megabytes per second from a very cheap low end card.

X11 on the other hand, is creating both pixmaps in system memory and
performing (host --> video) blits using the CPU to transfer the data.
In addition, you have the X protocol overhead.

X protocol overhead is dwarfed by the size of the copying, sending only
a few messages per frame (the XCopyArea, XSync and checking for events,
quitting on the first event I get).

The question is why isn’t X creating the pixmaps in video memory? I
never touch their content, and as they are regular X Pixmaps (not shared
memory or anything), I do not even have access to the pixel data! (so it
could do hidden optimizations like putting the pixmap in video memory)

At least, if it created the pixmaps in system memory, why are the
XCopyArea so slow when the XAA benchmark tells me something in the
90-something megabytes per second for CPU to video?

This is currently typical of Windows v.s. X11, but in real-world
2D games, most of the time the drawing is all done to an offscreen buffer
in system memory, and then blitted to the video card. This makes the
difference between Windows and X11 much lower.

Well, our real world 2D game is a Sierra Quest-style prototype and the
fact is that the Win32 prototype puts the scene (which changes only once
in a while) in video memory, and the Linux version doesn’t, which makes
the difference between “silky smooth playing” and “unplayably jerky”.
That is a difference.

For something like a Doom clone (which updates most of the screen at
every frame), I suppose shared pixmaps would do, but many game types can
benefit from video memory caching, like tile-based games for example.
Using the hardware acceleration can make the difference between 320x240
and 800x600.

I believe that XFree86 4.0 is more likely to use video memory for pixmap
surfaces, and has more support for taking advantage of DMA busmastering,
but I haven’t checked, so I could be wrong.

Yes, Raster ran my benchmark and it gave numbers in the 200 megabytes
per second, with a goofy Xinerama setup to boot…–
Pierre Phaneuf
Ludus Design, http://ludusdesign.com/
“First they ignore you. Then they laugh at you.
Then they fight you. Then you win.” – Gandhi

Pierre_Phaneuf · January 3, 2000, 4:45pm

Stephane Peter wrote:

Note that this stuff is raw Xlib, I ask here because I know many good
Xlib hackers (working on SDL) hang in here.

X11 on the other hand, is creating both pixmaps in system memory and
performing (host --> video) blits using the CPU to transfer the data.
In addition, you have the X protocol overhead.

The only fact that you don’t use any hardware acceleration in X11 explains
in itself the huge performance difference. Read and write access across the
bus are much slower than the internal blitter of any modern card. I’d say
that the X protocol overhead is almost nothing compared to this…

Well… “you don’t use any hardware acceleration” should be more like
"X11 doesn’t use hardware acceleration"! Is there any way I can make
XFree86 do the right thing?

I know the internal blitter of my video card is a magnitude faster
than what I get now! Now, how do I get to use it???

This is currently typical of Windows v.s. X11, but in real-world
2D games, most of the time the drawing is all done to an offscreen buffer
in system memory, and then blitted to the video card. This makes the
difference between Windows and X11 much lower.

This is a very bad approach. The problem I see in SDL is that the X11
implementation does not take advantage at all of accelerations from the
X server. Having worked on XFree86 for a few months, I think that the
performance could be improved if we could find a way to store SDL surfaces
in X objects (like Pixmaps), instead of in a malloc’ed() chunk of system
memory. The fact is that XAA (the X Acceleration Architecture) manages a
pixmap cache in offscreen video memory, that can the be blitted at maximum
speed by the X server. This is already done in XFree 3.3.5, but will likely
be even better in 4.0 …

I know this is a bad approach. We decided that for a modern 2D game, we
had to use hardware acceleration as much as possible to obtain a
playable game at high resolutions. In the past, people programmed only
to bare VGA, and then to some SVGA modes, always using the CPU to do the
transfers. But we think that today, everyone that plays game has a video
card capable of rectangle filling and blitting and decided to design
taking that into account.

We achieved extraordinary results with some native DirectX tests by
careful hardware usage. Our way of using the hardware resulted in a very
unusual 2D library with features like memory management that are more
often found in 3D libraries than in 2D ones. For example, where most 2D
libraries give you access to the pixel data of surfaces, that we could
instead optimize better by disallowing surface access, having the
library user instead “upload” the surface data, similar to texture
management in OpenGL.

We found out that Lock()ing hardware DirectX surfaces with a Riva TNT
was extremely painful, as the TNT drivers emulates the direct access
to the hardware surface by first copying it completely in system memory!
So our “upload surfaces” design turned out very good and also seemingly
had a nice side effect: where before we simulated hardware surfaces in
X11 with XShm, we now had a design that was very similar to the design
of X11, mapping very nicely! We still use XShm rather than the X stream
for uploading data to the X server, but the main tool is now XCopyArea!

We expected that the X server would be even easier than DirectX, since
it would do its own memory management. But it turned out that XFree86
3.3.x never puts anything in video memory it seems!!!

I believe that XFree86 4.0 is more likely to use video memory for pixmap
surfaces, and has more support for taking advantage of DMA busmastering,
but I haven’t checked, so I could be wrong.

AFAIK we could take advantage of pixmap surfaces in memory right now, but
I don’t think there is really DMA support yet cause this needs help from
the kernel and is thus an unportable feature (and only some chipsets may
allow this in fact).

How do I take advantage of the XAA pixmap cache then? I am no dumb guy
looking for a fact, but I am rather looking for somebody familiar with
the internals of XFree86 that could tell me what magic I am missing to
get pixmaps to go in video memory… Something in the graphic context?
Help!!!–
Pierre Phaneuf
Ludus Design, http://ludusdesign.com/
“First they ignore you. Then they laugh at you.
Then they fight you. Then you win.” – Gandhi

Pierre_Phaneuf · January 3, 2000, 4:49pm

Stephane Peter wrote:

I believe that XFree86 4.0 is more likely to use video memory for pixmap
surfaces, and has more support for taking advantage of DMA busmastering,
but I haven’t checked, so I could be wrong.

AFAIK we could take advantage of pixmap surfaces in memory right now, but
I don’t think there is really DMA support yet cause this needs help from
the kernel and is thus an unportable feature (and only some chipsets may
allow this in fact).

BTW, note that benchmark I got by using the “xaa_benchmark” option on my
X server:

CPU to framebuffer 45.71 Mpix/sec (91.42 MB/s)

It is one of the slowest number, the exceptions being 10x1 solid
rectangle fill and 10x10 CPU-to-screen color expand (which I have no
idea what it is, but it sounds unaccelerated and is quite a small size).

This is still three times what I am actually getting while doing
XCopyArea of a 640x480 non-XShm Pixmap to a Window. How could I get a
better number too??? Heck, even the X server admits it can do better!!!–
Pierre Phaneuf
Ludus Design, http://ludusdesign.com/
“First they ignore you. Then they laugh at you.
Then they fight you. Then you win.” – Gandhi

hayward_at_slothmud · January 3, 2000, 5:07pm

My fast-clocked Millenium G200 is also very fast at this, and most video
cards have very respectable numbers in this area, generally a lot
faster than system -> video blits thru the CPU. I would expect at least
100 megabytes per second from a very cheap low end card.

I’m not sure about megabytes per second, but I’m doing 640x480 updates at
16bit color and getting about 17fps without transparent blitting, and
10 fps with about 30% transparent blitting.

X protocol overhead is dwarfed by the size of the copying, sending only
a few messages per frame (the XCopyArea, XSync and checking for events,
quitting on the first event I get).

Under X, I went from using XCopyArea() with pixmaps, to using shared
memory XImages (thus getting rid of the X protocol for blitting onto a
buffer) and saved lots of time. SDL Uses shared memory when available.

The problem here is that both the client program (your game) and the X
server must share and access the exact same video memory.

I do not believe XFree86 3.x even supports using video memory as shared
memory, so that both the client can manipulate the data and the server can
display it.–
Brian

Pierre_Phaneuf · January 3, 2000, 5:08pm

Nicholas Vining wrote:

We do some blitting tests (at 640x480 in 16 bit), only to find out the
following: my machine can barely move around 30 megabytes per second
while the Windows machine can do 474 freakin’ megabytes per second!!!

I find myself not having much faith in that particular test. Are you sure
that it’s not your testing code? I doubt it’s a question of simply that
Linux is that much slower. It could also be anything from inefficient
compilers (MSVC does quite well, I’m not sure about GCC or whatever you’re
using) to optimization levels to the fact that the video cards are different
and the computers are different.

It isn’t that “Linux is that much slower” or the optimizations being
that much off. It is just not using the hardware acceleration present in
my video card.

For example, I get around 30 fps playing Quake 3 with hardware
acceleration (I have a Voodoo2) and I would probably get something like
0.1-1 fps without hardware acceleration. I am now experiencing the 2D
equivalent of this and it hurts a LOT.

Oh yes, our test program is basically this: create a 640x480 Window and
two same sized Pixmap, one painted white and the other painted black.
XCopyArea each of them in sucession to the window. I have to use a XSync
to prevent flooding the X server.

Are you doing any sort of synchronization with the Windows version? If not,
that might be a problem. XSync automatically will slow your computer down so
that it’s not able to output as much as it might be able to, whereas your
Windows code doesn’t. In that case (and if it really is necessary, I’m not
sure), this represents a design problem with X rather than a slowdown.

We use the DirectX blitter in blocking mode. XSync does NOT slow down
the computer! What XSync does is wait for the X server to do what it was
asked to do, in this case, finish the blitting operation. This is to
simulate the blocking blitter we used in DirectX, because XCopyArea is
non-blocking (when XCopyArea returns, it only means that the X server
has been asked to do the blitting, not that it is done).

If you do not use XSync in a benchmark like I made, the program can emit
at least a few hundreds, if not thousands, of blitting requests per
second, while the X server can only execute 10 or 20 of them per second.
Letting the program run for a few seconds leads to the X server being
out of control for about 1 or 2 minutes, which isn’t very cool.

What is cool about this is that in a regular program, you can call
XCopyArea right before doing the “real work” of the program, then call
XSync right before sending another XCopyArea (to make sure the first one
is completed, if it already is finished, the XSync will return
immediately).–
Pierre Phaneuf
Ludus Design, http://ludusdesign.com/
“First they ignore you. Then they laugh at you.
Then they fight you. Then you win.” – Gandhi

Pierre_Phaneuf · January 3, 2000, 5:17pm

Timothy Downs wrote:

We’re talking 2D here. We did the test both in fullscreen and windowed
modes on Windows, so telling me to use DGA isn’t going to do (and in the
reality, it isn’t faster, or only barely so, nothing 10-15 times
faster).

When I first got into SDL, I tried to make a pretty splash screen, to learn
SDL, with alpha blending, etc.
As this is a dual boot machine, I tested the program in both X and in
windows, and was suprised to find that the X version, in DGA was far faster
than the windows version. (p120, 32mb ram, s3virge)
So I was quite suprised to read your e-mail.
Unfortunantly I havent got the numbers here, as this was a while ago.

I am getting really tired of people that do not know about the inner
workings telling me that DGA is faster. It is a fact that DGA will
not use hardware acceleration. Maybe I can double the speed of a blit
if the driver for my video card is really dumb (and if it is smart,
DGA will be slower!), but only maybe. A sure fact is that
video->video blitting is still a solid 1000% faster.

I could get my colleague to make a very similar benchmark in DirectX and
send it to you, I am pretty sure that if your S3 ViRGE has enough
memory, it will produce benchmark numbers DRAMATICALLY better than what
you’ll get in X. If it doesn’t have enough video memory to store the
offscreen surface, you should get numbers similar or slightly better (as
in 50% to 200% faster) than what you’ll get in X.–
Pierre Phaneuf
Ludus Design, http://ludusdesign.com/
“First they ignore you. Then they laugh at you.
Then they fight you. Then you win.” – Gandhi

hayward_at_slothmud · January 3, 2000, 5:23pm

What is cool about this is that in a regular program, you can call
XCopyArea right before doing the “real work” of the program, then call
XSync right before sending another XCopyArea (to make sure the first one
is completed, if it already is finished, the XSync will return
immediately).

Is there a way to take advantage of this from within SDL?–
Brian

Pierre_Phaneuf · January 3, 2000, 5:31pm

hayward at slothmud.org wrote:

My fast-clocked Millenium G200 is also very fast at this, and most video
cards have very respectable numbers in this area, generally a lot
faster than system -> video blits thru the CPU. I would expect at least
100 megabytes per second from a very cheap low end card.

I’m not sure about megabytes per second, but I’m doing 640x480 updates at
16bit color and getting about 17fps without transparent blitting, and
10 fps with about 30% transparent blitting.

You are getting around 10 megabytes per second (640 * 480 * 2 (16 bits),
times 17 frames per second) for your non-transparent blit. I am guessing
you have a machine around 133-150 MHz with a not-so-hot but
not-so-crappy video card (S3 ViRGE range).

If I am right, the hardware you have is probably capable of over 100 fps
for the non-transparent blitting. Are you happy with using only 15-17%
of your hardware?

I paid for 100% of the performance, not 15%, I guess it is your case
also.

X protocol overhead is dwarfed by the size of the copying, sending only
a few messages per frame (the XCopyArea, XSync and checking for events,
quitting on the first event I get).

Under X, I went from using XCopyArea() with pixmaps, to using shared
memory XImages (thus getting rid of the X protocol for blitting onto a
buffer) and saved lots of time. SDL Uses shared memory when available.

Argh… I know my stuff, okay? The fact is that the X protocol overhead
for a XCopyArea is similar to a XShmPutImage, okay? And I would easily
venture that the XCopyArea code path in the X server is better optimized
than the XShmPutImage one, okay?

The problem here is that both the client program (your game) and the X
server must share and access the exact same video memory.

Shared memory is a very large improvement over XPutImage. XPutImage is
responsible for really putting the “slow and stupid” in X11.
XShmPutImage should be considered “expected normal speed”.

Sharing the memory where the pixmap data is stored is useful to modify
the data often. A game like XDoom, where almost all of the screen is
updated at every frame (the status bar at the bottom isn’t), benefits
most from shared memory. A tile-based game using all precomputed
graphics in tiles wouldn’t profit much.

I do not believe XFree86 3.x even supports using video memory as shared
memory, so that both the client can manipulate the data and the server can
display it.

No, SysV shared memory cannot share video memory, you are right. But
XFree86 3.x supports sharing the video memory, this is what DGA does. It
is more complicated than sharing ordinary memory, this is why it
requires root permissions and freaks out so often on some video cards.–
Pierre Phaneuf
Ludus Design, http://ludusdesign.com/
“First they ignore you. Then they laugh at you.
Then they fight you. Then you win.” – Gandhi

hayward_at_slothmud · January 3, 2000, 5:59pm

You are getting around 10 megabytes per second (640 * 480 * 2 (16 bits),
times 17 frames per second) for your non-transparent blit. I am guessing
you have a machine around 133-150 MHz with a not-so-hot but
not-so-crappy video card (S3 ViRGE range).

If I am right, the hardware you have is probably capable of over 100 fps
for the non-transparent blitting. Are you happy with using only 15-17%
of your hardware?

I have a K6-3 400 MHZ with a Matrox G200 card.

Argh… I know my stuff, okay? The fact is that the X protocol overhead
for a XCopyArea is similar to a XShmPutImage, okay? And I would easily
venture that the XCopyArea code path in the X server is better optimized
than the XShmPutImage one, okay?

I’m not telling you that you don’t know what your talking about, all I’m
doing is showing my own experience working with Xlib.

This is just one real-world example, where I have a 2D tiled
game that used XCopyArea() on a 640x480x16bpp. My performance going from
XCopyArea to XShmPutImage() doubled. I can’t tell you why, as I haven’t
looked at the optimizations of the XFree86 Server code. Maybe you can
explain why, since you seem to know better than me?

No, SysV shared memory cannot share video memory, you are right. But
XFree86 3.x supports sharing the video memory, this is what DGA does. It
is more complicated than sharing ordinary memory, this is why it
requires root permissions and freaks out so often on some video cards.

I see, thanks for clearing that up.–
Brian

Pierre_Phaneuf · January 3, 2000, 6:33pm

hayward at slothmud.org wrote:

You are getting around 10 megabytes per second (640 * 480 * 2 (16 bits),
times 17 frames per second) for your non-transparent blit. I am guessing
you have a machine around 133-150 MHz with a not-so-hot but
not-so-crappy video card (S3 ViRGE range).

If I am right, the hardware you have is probably capable of over 100 fps
for the non-transparent blitting. Are you happy with using only 15-17%
of your hardware?

I have a K6-3 400 MHZ with a Matrox G200 card.

Hmm, then you are doing something really badly. Raw blitting (no
transparency) goes 49 fps here using XShmPutImage and 50 fps using
XCopyArea. I have an Intel Pentium 166 MMX, overclocked to 225 MHz (with
a 75 MHz bus clock instead of 66), and a Matrox Millenium G200 SD video
card (the PCI one, not the AGP one) with 8 megs of video memory.

This is with a 640x480 window in 16 bit, giving something nearly at 30
megabytes per second. My AMD 486DX4/120 with an S3 ViRGE does around 7
megabytes per second with the exact same test program.

I’m not telling you that you don’t know what your talking about, all I’m
doing is showing my own experience working with Xlib.

Sorry if I sounded a bit rude, but it just seems a lot of people
improvise themselves Xlib experts these days and run wildly telling me
to use DGA or XShm, like they knew and/or I didn’t… Next thing,
they’ll tell me to use threading to improve my program!

(threads is my other performance nightmare, beside X)

This is just one real-world example, where I have a 2D tiled
game that used XCopyArea() on a 640x480x16bpp. My performance going from
XCopyArea to XShmPutImage() doubled. I can’t tell you why, as I haven’t
looked at the optimizations of the XFree86 Server code. Maybe you can
explain why, since you seem to know better than me?

Did you change anything in the GC? Some options in the GC can add
plenty of overhead to both XCopyArea and XShmPutImage.

I would venture some more by saying that shared pixmaps should be a tad
faster than shared images. Raster told me that is the case, but he is
often over-enthousiastic about new stuff he discovers. He was lauding
about his code not needing MTRRs because he did the write-combining
himself by writing in 32 bits chunks, where actually MTRR combine larger
chunks and don’t even apply to shared memory…

But for the shared pixmap vs. shared image, I would agree (a small
improvement tho, in the 2%-5%).–
Pierre Phaneuf
Ludus Design, http://ludusdesign.com/
“First they ignore you. Then they laugh at you.
Then they fight you. Then you win.” – Gandhi

slouken · January 3, 2000, 11:50pm

I would venture some more by saying that shared pixmaps should be a tad
faster than shared images. Raster told me that is the case, but he is
often over-enthousiastic about new stuff he discovers. He was lauding
about his code not needing MTRRs because he did the write-combining
himself by writing in 32 bits chunks, where actually MTRR combine larger
chunks and don’t even apply to shared memory…

But for the shared pixmap vs. shared image, I would agree (a small
improvement tho, in the 2%-5%).

That’s correct. Also, many X servers do not support X shared mem pixmaps
while they do support shared mem images.

-Sam Lantinga				(slouken at devolution.com)

Lead Programmer, Loki Entertainment Software–
“Any sufficiently advanced bug is indistinguishable from a feature”
– Rich Kulawiec

slouken · January 3, 2000, 11:52pm

What is cool about this is that in a regular program, you can call
XCopyArea right before doing the “real work” of the program, then call
XSync right before sending another XCopyArea (to make sure the first one
is completed, if it already is finished, the XSync will return
immediately).

Is there a way to take advantage of this from within SDL?

Currently I call XSync() after XShmPutImage(), because the semantics
are “when the update call returns, the requested update is visible”

-Sam Lantinga				(slouken at devolution.com)

Lead Programmer, Loki Entertainment Software–
“Any sufficiently advanced bug is indistinguishable from a feature”
– Rich Kulawiec

slouken · January 4, 2000, 12:08am

–
Pierre Phaneuf
Ludus Design, http://ludusdesign.com/

Pierre, your basic complaint is that X does not allow you the precise
control over which pixmaps are put into video memory and which are not.
This is unfortunately a limitation of the X server, and though I have
sent a message to the XFree86 development list, I do not expect this
to change.

You might take a look at the new SDL framebuffer console driver, which
is in it’s infancy but has direct access to the entire video memory and
acceleration for supported video cards. Currently, both the 3Dfx and
Matrox cards are supported by the fbcon driver.

See ya!
-Sam Lantinga (slouken at devolution.com)

Lead Programmer, Loki Entertainment Software–
“Any sufficiently advanced bug is indistinguishable from a feature”
– Rich Kulawiec

hayward_at_slothmud · January 4, 2000, 12:16am

Is there a way to take advantage of this from within SDL?

Currently I call XSync() after XShmPutImage(), because the semantics
are “when the update call returns, the requested update is visible”

-Sam Lantinga (slouken at devolution.com)

Maybe provide a video mode flag SDL_DEFER_UPDATE_SYNC, for
SDL_SetVideoMode(), and provide a SDL_Sync() routine?

Would only be useful on platforms that can make use of such a feature, but
it might be a little additional help for those of us who want to fine-tune
the currently under-performing X11 environment.

I would be willing to provide a patch, if this is deemed a worthy
addition, to be incorporated into 1.0.2.

Looking through, I see XSync is called in SDL_x11mouse.c, SDL_x11wm.c and
SDL_x11video.c.

X11_MITSHMUpdate() and X11_NormalUpdate() seem to be the most important
ones though. Looking at the other places XSync is called, there are some
places you definately would not want to defer a sync.–
Brian

Stephane_Peter · January 4, 2000, 1:39am

In article <3870D218.B3C90EFE at ludusdesign.com>,
Pierre Phaneuf writes:

Stephane Peter wrote:

Note that this stuff is raw Xlib, I ask here because I know many good
Xlib hackers (working on SDL) hang in here.

X11 on the other hand, is creating both pixmaps in system memory and
performing (host --> video) blits using the CPU to transfer the data.
In addition, you have the X protocol overhead.

The only fact that you don’t use any hardware acceleration in X11 explains
in itself the huge performance difference. Read and write access across the
bus are much slower than the internal blitter of any modern card. I’d say
that the X protocol overhead is almost nothing compared to this…

Well… “you don’t use any hardware acceleration” should be more like
"X11 doesn’t use hardware acceleration"! Is there any way I can make
XFree86 do the right thing?

I know the internal blitter of my video card is a magnitude faster
than what I get now! Now, how do I get to use it???

Well, the fact is that the X server is the only software responsible for it.
Only the X server decides how to accelerate things, and there is no way that I
know of in the current X11 spec that allow a client to force the server to
use acceleration (X11 was not designed with games in mind!).

Although most if not all X11 drivers (especially in XFree86) at least have
blits and rectangles hardware accelerated (else the performance would be
utterly slow, try the framebuffer only X server to get an idea)…

This is currently typical of Windows v.s. X11, but in real-world
2D games, most of the time the drawing is all done to an offscreen buffer
in system memory, and then blitted to the video card. This makes the
difference between Windows and X11 much lower.

This is a very bad approach. The problem I see in SDL is that the X11
implementation does not take advantage at all of accelerations from the
X server. Having worked on XFree86 for a few months, I think that the
performance could be improved if we could find a way to store SDL surfaces
in X objects (like Pixmaps), instead of in a malloc’ed() chunk of system
memory. The fact is that XAA (the X Acceleration Architecture) manages a
pixmap cache in offscreen video memory, that can the be blitted at maximum
speed by the X server. This is already done in XFree 3.3.5, but will likely
be even better in 4.0 …

I know this is a bad approach. We decided that for a modern 2D game, we
had to use hardware acceleration as much as possible to obtain a
playable game at high resolutions. In the past, people programmed only
to bare VGA, and then to some SVGA modes, always using the CPU to do the
transfers. But we think that today, everyone that plays game has a video
card capable of rectangle filling and blitting and decided to design
taking that into account.

We achieved extraordinary results with some native DirectX tests by
careful hardware usage. Our way of using the hardware resulted in a very
unusual 2D library with features like memory management that are more
often found in 3D libraries than in 2D ones. For example, where most 2D
libraries give you access to the pixel data of surfaces, that we could
instead optimize better by disallowing surface access, having the
library user instead “upload” the surface data, similar to texture
management in OpenGL.

This mostly works the same way in X11, although you don’t have any control
over this (which is a shame for game developers!). However with a bit of
knowledge of X servers work (and mostly XFree in this case), you can arrange
things so that you increase the chances that hardware acceleration will be
used (using Pixmaps that may be stored in video memory for instance)…

Doing things like DirectX is almost impossible in X, because this is just
plain dirty and in contradiction with what X11 stands for. The only solution
would be to use a new X11 extension, something like DGA 2.0 …

[…]

I believe that XFree86 4.0 is more likely to use video memory for pixmap
surfaces, and has more support for taking advantage of DMA busmastering,
but I haven’t checked, so I could be wrong.

AFAIK we could take advantage of pixmap surfaces in memory right now, but
I don’t think there is really DMA support yet cause this needs help from
the kernel and is thus an unportable feature (and only some chipsets may
allow this in fact).

How do I take advantage of the XAA pixmap cache then? I am no dumb guy
looking for a fact, but I am rather looking for somebody familiar with
the internals of XFree86 that could tell me what magic I am missing to
get pixmaps to go in video memory… Something in the graphic context?
Help!!!

Here is a post from Mark Vojkovich, one of the main XFree86 developers :-------------------
On Mon, 3 Jan 2000, Sam Lantinga wrote:

Is there any way to tell that a pixmap has been cached in hardware and
will be HW accelerated in blitting? Is there any way to request a
pixmap that must reside in acceleratable video memory?

Nope. I can tell you under what conditions XAA will currently
put a pixmap in offscreen memory though. If it has an area
larger than 64000 pixels and there’s room for it (and provided
the driver has allowed this) it will get stuck in offscreen
memory. It can get kicked out by the server at any time to
make room for something else though.

That’s assuming you’re talking about using Pixmaps for
back buffers. Pixmaps used as GC tiles are handled a little
differently.

                            Mark.

–
Stephane Peter
Programmer
Loki Entertainment Software

“Microsoft has done to computers what McDonald’s has done to gastronomy”