X11 again

Hi guys. Someone knows how to use hardware acceleration under X11 ?
The second question is complicated: When I create image in shared
memory , must i align each row of my image for faster blitting? I mean
follow: if I don’t align image and hardware requires image is to be
aligned, XServer perform alignment itself so we get perfomance penalty.

Is there anybody experienced in X11 programming with who I can talk ? by
icq or email ?–
With best regards Razor.X.Jackie
"The choise is yours… walk now and live or stay and die"

Maybe this needs to be placed on the front page of the SDL website, as
many times as it’s been asked in the last month on this list. :slight_smile:

SDL does not currently support hardware under X11. That will be available
when Sam gets the time to code support for DGA 2.0 (XFree86 4.0)–
Brian

On Sat, 25 Mar 2000, razorjack wrote:

Hi guys. Someone knows how to use hardware acceleration under X11 ?
The second question is complicated: When I create image in shared
memory , must i align each row of my image for faster blitting? I mean
follow: if I don’t align image and hardware requires image is to be
aligned, XServer perform alignment itself so we get perfomance penalty.

Is there anybody experienced in X11 programming with who I can talk ? by
icq or email ?


With best regards Razor.X.Jackie
"The choise is yours… walk now and live or stay and die"

Hi guys. Someone knows how to use hardware acceleration under X11 ?

What functions are accelerated are up to the X server. The following functions
are typically accelerated:

  • various drawing primitives: (thin) lines, ellipses, arcs
  • text blitting
  • filling areas with a solid colour, and sometimes with a pattern
  • copying rectangular areas from one spot on the screen to another
    (and as a special case, scrolling)
  • handling the mouse cursor :slight_smile:

More seldom you also find the following:

  • copying/filling shaped objects (using a clip mask)
  • copying between pixmaps and the screen (not so common because pixmaps often
    reside in ordinary RAM)

Then you have special things like transparent overlays, hardware 3D support
etc, all which are server and hardware dependent.

The second question is complicated: When I create image in shared
memory , must i align each row of my image for faster blitting? I mean
follow: if I don’t align image and hardware requires image is to be
aligned, XServer perform alignment itself so we get perfomance penalty.

You don’t have a choice, since XShmCreateImage does not let you specify
the quantum of a scanline. You have to accept its bitmap_pad.

Note that XShmPutImage is usually not accelerated in any particular way;
normally it is just the CPU shoveling data across the bus.

razorjack wrote:

Hi guys. Someone knows how to use hardware acceleration under X11 ?
The second question is complicated: When I create image in shared
memory , must i align each row of my image for faster blitting? I mean
follow: if I don’t align image and hardware requires image is to be
aligned, XServer perform alignment itself so we get perfomance penalty.

Align each row? There is a parameter in the XImage structure about this,
don’t remember which. If you ignore it and it work, I guess that the
pitch is the same as the width…

As for getting hardware acceleration under X11, good luck! I know you
won’t get much with shared memory. I think XFree86 3.3.x uses something
crummy, along the lines of memcpy(), to blit from shared memory to the
framebuffer (30 MB/s on my system). As far as I can see, XCopyArea does
the same. The only thing that is very fast is window-to-window XCopyArea
(if everything is visible, 10 times faster maybe). Maybe a tile-based
game could save some time by doing all blitting of a tile after the
first from the first using XCopyArea.

XFree86 4.0 is much better in some regards. I think that if you are very
lucky and use a large enough non-shared Pixmap, it might get allocated
in off-screen video memory and XCopyArea will be very fast (just like
window-to-window).

Is there anybody experienced in X11 programming with who I can talk ? by
icq or email ?

If you want to discuss with me by e-mail, feel free!–
Pierre Phaneuf
Systems Exorcist

Maybe a tile-based
game could save some time by doing all blitting of a tile after the
first from the first using XCopyArea.

During my incessant experimenting quest of the ultimate way of doing
tile-based games under X11, I have indeed considered doing this. You don’t
get double-buffering though, so it can only be used for unobscured tiles,
so I didn’t find it very useful.

But Sam’s xf86vidmode-trick for full-screen displays in X11 gives us something
more: we can use the rest of the screen area for scratch space! Both fast
blitting and page flipping (if the X11 root size is at least twice that
of the fullscreen window, in any direction). Too bad it isn’t portable,
but it would be a cool and useful trick.

If you have a multi-layer device, you can “hide” your scratch windows
under overlays (which can be painted decoratively or even used for something
in the game). (On a single-layer frame buffer you will just get loads of
GraphicsExpose events, so you can detect this and fall back on pixmaps.)

XFree86 4.0 is much better in some regards. I think that if you are very
lucky and use a large enough non-shared Pixmap, it might get allocated
in off-screen video memory and XCopyArea will be very fast (just like
window-to-window).

Yes, other servers do this as well, I think. But on the other hand,
XF86 4.0 has finally a useful DGA…

Mattias Engdeg?rd wrote:

Hi guys. Someone knows how to use hardware acceleration under X11 ?

What functions are accelerated are up to the X server. The following
functions are typically accelerated:

  • filling areas with a solid colour, and sometimes with a pattern

True, this is accelerated. Not sure for the pattern (might depend on the
video card or the size of the pattern, didn’t play much with this), but
fillrects are definitely accelerated on many cards.

  • copying/filling shaped objects (using a clip mask)

Really? On what server did you see that?

  • copying between pixmaps and the screen (not so common because pixmaps
    often reside in ordinary RAM)

XFree86 3.3.x never does that, but 4.0 does (if the pixmap is big
enough, with enough free video memory and the right phase of the moon).

If you have an image that you load once and don’t change often (like a
set of tiles for example), load them into a Pixmap and PLEASE drop the
shared stuff! Use shared images/pixmaps to do communication with the X
server (like uploading to a Pixmap), but Pixmaps are definitely better
if you can use them, as they have more potential for the X server to
optimize and accelerate them.

You don’t have a choice, since XShmCreateImage does not let you specify
the quantum of a scanline. You have to accept its bitmap_pad.

Note that XShmPutImage is usually not accelerated in any particular way;
normally it is just the CPU shoveling data across the bus.

Yes, at least in XFree86 3.3.x, at the usual PIO glacial transfer rate.
I think 4.0 has some system->video transfers accelerated in some way for
some video cards (using more efficient stuff like bus mastering and/or
DMA).–
Pierre Phaneuf
Systems Exorcist

  • copying/filling shaped objects (using a clip mask)

Really? On what server did you see that?

I didn’t observe it personally, but a friend told me some HP server did it.
Sorry for not double-checking it. I wouldn’t expect to see it on PC class
video boards (they are more likely to have support for colour-keyed blits).

  • copying between pixmaps and the screen (not so common because pixmaps
    often reside in ordinary RAM)

XFree86 3.3.x never does that, but 4.0 does (if the pixmap is big
enough, with enough free video memory and the right phase of the moon).

I don’t know about XFree86, but I suppose XSun does (at least it does
not support shared pixmaps).

If you have an image that you load once and don’t change often (like a
set of tiles for example), load them into a Pixmap and PLEASE drop the
shared stuff! Use shared images/pixmaps to do communication with the X
server (like uploading to a Pixmap), but Pixmaps are definitely better
if you can use them, as they have more potential for the X server to
optimize and accelerate them.

I agree in principle, but often (in games) you need pixel-level access
to the image being composed. It would be sweet to let the X server do all
the frame rendering, but there’s not enough primitives for that. As you
say, masked blits are rarely (if ever) accelerated, and then there’s all
the cool magic you can do with alpha blending, lighting, etc.
GLX admittedly changes the picture somewhat, but it’s not present everywhere.

Some simple games run well with X11-based rendering, but I’ve usually been
able to get better performance with MIT-SHM and direct pixel blasting.
The single biggest problem I have is the inability to synchronize to
vertical refresh.

Yes, at least in XFree86 3.3.x, at the usual PIO glacial transfer rate.
I think 4.0 has some system->video transfers accelerated in some way for
some video cards (using more efficient stuff like bus mastering and/or
DMA).

I was under the impression that even a less-than-recent CPU should be able
to saturate the bus rather well. But given the abysmal XShmPutImage rates
these days, I’m more than happy if you are right.

Mattias Engdeg?rd wrote:

I didn’t observe it personally, but a friend told me some HP server did it.
Sorry for not double-checking it. I wouldn’t expect to see it on PC class
video boards (they are more likely to have support for colour-keyed blits).

And Xlib has no color-keyed blit operation. Darn. (yeah, I KNOW ABOUT
DGA)

XFree86 3.3.x never does that, but 4.0 does (if the pixmap is big
enough, with enough free video memory and the right phase of the moon).

I don’t know about XFree86, but I suppose XSun does (at least it does
not support shared pixmaps).

Hmm, I didn’t benchmark XSun… Hmm, just did a “x11perf
-copypixwin500”, and I’m not sure… It looks fast (it is faster on the
Sun SparcStation 5 than on my overclocked Pentium 225 with Matrox G200),
but not by much (something like 20 MB/s vs 17 MB/s on my Pentium). My
Pentium is about two or three times as fast, so this is nice performance
anyway. I ran it through the network also (sucky Solaris doesn’t have
x11perf). So it sounds accelerated, yes, thank God (or whoever).

I agree in principle, but often (in games) you need pixel-level access
to the image being composed. It would be sweet to let the X server do all
the frame rendering, but there’s not enough primitives for that. As you
say, masked blits are rarely (if ever) accelerated, and then there’s all
the cool magic you can do with alpha blending, lighting, etc.
GLX admittedly changes the picture somewhat, but it’s not present
everywhere.

Some simple games run well with X11-based rendering, but I’ve usually been
able to get better performance with MIT-SHM and direct pixel blasting.
The single biggest problem I have is the inability to synchronize to
vertical refresh.

Yes, it depends on the game. A tile-and-sprite-based game doesn’t need
to update its tiles and sprites all the time. You could upload a single
largish pixmap for each sprite that would contain all the animated
frames (and probably also upload its precalculated clip mask).

Yes, at least in XFree86 3.3.x, at the usual PIO glacial transfer rate.
I think 4.0 has some system->video transfers accelerated in some way for
some video cards (using more efficient stuff like bus mastering and/or
DMA).

I was under the impression that even a less-than-recent CPU should be able
to saturate the bus rather well. But given the abysmal XShmPutImage rates
these days, I’m more than happy if you are right.

No, they don’t saturate the bus. The problem is with wait states I
think. While it doesn’t go much faster with DMA or bus mastering, you
can start calculating the next frame while it transfer, so you get a
better overall framerate. With a DirectX test we did, one of my friend
was getting nearly twice as many MB/s blitting from system memory to
video memory than I could do in Linux on a faster machine with a better
video card. Aww…–
Pierre Phaneuf
Systems Exorcist

Hmm, I didn’t benchmark XSun… Hmm, just did a “x11perf
-copypixwin500”, and I’m not sure… It looks fast (it is faster on the
Sun SparcStation 5 than on my overclocked Pentium 225 with Matrox G200),
but not by much (something like 20 MB/s vs 17 MB/s on my Pentium).

The old 143 MHz UltraSparc I’m sitting in front of does 298 copypixwin500/s,
almost 75 MB/s, but only 104 shmput500/s (both in 8bpp).

Yes, it depends on the game. A tile-and-sprite-based game doesn’t need
to update its tiles and sprites all the time. You could upload a single
largish pixmap for each sprite that would contain all the animated
frames (and probably also upload its precalculated clip mask).

I actually tried that for a tile-and-sprite-based game, and got better
framerates doing it by hand. I think my RLE-encoded sprites were way faster
than the clip mask-based X11 blitting. And I can still do pixel effects :slight_smile:

No, they don’t saturate the bus. The problem is with wait states I
think. While it doesn’t go much faster with DMA or bus mastering, you
can start calculating the next frame while it transfer, so you get a
better overall framerate. With a DirectX test we did, one of my friend
was getting nearly twice as many MB/s blitting from system memory to
video memory than I could do in Linux on a faster machine with a better
video card. Aww…

Very interesting! X11 should use that a lot more. There should be no reason
XShmPutImage has to use PIO — it could use the same bus mastering/DMA
as DirectX, and you can even ask for an event to be sent upon completion.
Does XFree86 4.0 do better in this regard? I have no modern PC to try it
on, alas.

Mattias Engdeg?rd wrote:

Hmm, I didn’t benchmark XSun… Hmm, just did a “x11perf
-copypixwin500”, and I’m not sure… It looks fast (it is faster on the
Sun SparcStation 5 than on my overclocked Pentium 225 with Matrox G200),
but not by much (something like 20 MB/s vs 17 MB/s on my Pentium).

The old 143 MHz UltraSparc I’m sitting in front of does 298 copypixwin500/s,
almost 75 MB/s, but only 104 shmput500/s (both in 8bpp).

Very reasonable. This looks to me like the pixmap is stored in main
memory and you are getting some kind of non-PIO transfer to the
framebuffer (bus master/DMA). Or maybe you just have a very good bus.

BTW, if your UltraSparc is old, what should I call my SparcStation 5?
:slight_smile:

Yes, it depends on the game. A tile-and-sprite-based game doesn’t need
to update its tiles and sprites all the time. You could upload a single
largish pixmap for each sprite that would contain all the animated
frames (and probably also upload its precalculated clip mask).

I actually tried that for a tile-and-sprite-based game, and got better
framerates doing it by hand. I think my RLE-encoded sprites were way faster
than the clip mask-based X11 blitting. And I can still do pixel effects :slight_smile:

Okay, maybe not for sprites, but background tiles could do well. Oh, but
then you have to XShmPutImage a sprite on the background, which wouldn’t
be okay, you want to merge the two beforehand. I know window-to-window
is very fast, so there is a lot to gain there, if you have any
repetitive image on the screen, blit it once through shared memory, then
XCopyArea it everywhere else.

Of course, if shared memory is faster, go for it!

No, they don’t saturate the bus. The problem is with wait states I
think. While it doesn’t go much faster with DMA or bus mastering, you
can start calculating the next frame while it transfer, so you get a
better overall framerate. With a DirectX test we did, one of my friend
was getting nearly twice as many MB/s blitting from system memory to
video memory than I could do in Linux on a faster machine with a better
video card. Aww…

Very interesting! X11 should use that a lot more. There should be no reason
XShmPutImage has to use PIO — it could use the same bus mastering/DMA
as DirectX, and you can even ask for an event to be sent upon completion.

Barring any DMA-related memory limitation or stuff like that, this
should work. I think this is the big problem, but I don’t know enough
about bus transfers. But an asynchronous DMA transfer that would trigger
an event upon completion (as XCopyArea and XShmPutImage both can already
do) would be sweet.

Does XFree86 4.0 do better in this regard? I have no modern PC to try it
on, alas.

I have no idea.–
Pierre Phaneuf
Systems Exorcist

I know window-to-window
is very fast, so there is a lot to gain there, if you have any
repetitive image on the screen, blit it once through shared memory, then
XCopyArea it everywhere else.

That only works if nothing is on top of the repeated tiles, not even a bullet.
Whether this is common enough to be worthwhile depends on the game, of course.

Barring any DMA-related memory limitation or stuff like that, this
should work. I think this is the big problem, but I don’t know enough
about bus transfers. But an asynchronous DMA transfer that would trigger
an event upon completion (as XCopyArea and XShmPutImage both can already
do) would be sweet.

In practice the server would have to busy-wait polling for completion,
unless we want to take the pains to make a device that delivers
interrupts to userspace. But considering the alternative (PIO), it would
still be a big win.

Very interesting! X11 should use that a lot more. There should be no
reason

XShmPutImage has to use PIO — it could use the same bus mastering/DMA
as DirectX, and you can even ask for an event to be sent upon
completion.

I really wish it were possible to DMA a sprite from client memory right into
the framebuffer, but right now that’s pretty difficult… PCI DMA requires
that the source bytes lie in a physically contiguous region of memory, which
is very hard to arrange unless you’re the kernel (or your graphics card
supports scatter/gather, which is pretty rare AFAIK). AGP DMA lifts this
restriction, but you’d still want some kernel support.

DMA between user space and the bus is such a useful thing that it’s being
worked on for 2.4; search the kernel list for “kiobuf”. (BTW this feature
has long been a part of Windows; maybe explaining those wicked DirectX blit
rates =)

It might be possible for the X server to allocate some medium-size static
DMA buffers; the clients could draw in them and blit from there. I’m not
sure if the speed gain would be worth the amount of driver hacking, but at
least it’d get you out of PIO mode.

These problems really show the weaknesses of the PC architecture. True
"workstation" hardware usually has much more robust support for shoveling
data around the system. A few years ago I watched a low-end SGI machine play
uncompressed 2048 x 1536 film frames at 24fps… Well, we’re getting closer
=)

Dan

razorjack wrote:

Hi guys. Someone knows how to use hardware acceleration under X11 ?
The second question is complicated: When I create image in shared
memory , must i align each row of my image for faster blitting? I mean
follow: if I don’t align image and hardware requires image is to be
aligned, XServer perform alignment itself so we get perfomance penalty.

Now that I saw it mentioned by Sam, fbcon has hardware acceleration with
some drivers (definitely not all). I find it not usable as a main
display subsystem for a consumer commercial game (too complicated to set
up), but as an alternative set up for people who go out of their way to
set this up, this could be very nice.–
Pierre Phaneuf
Systems Exorcist

Dan Maas wrote:

I really wish it were possible to DMA a sprite from client memory right into
the framebuffer, but right now that’s pretty difficult… PCI DMA requires
that the source bytes lie in a physically contiguous region of memory, which
is very hard to arrange unless you’re the kernel (or your graphics card
supports scatter/gather, which is pretty rare AFAIK). AGP DMA lifts this
restriction, but you’d still want some kernel support.

Yeah, I know… And isn’t there a “below the 16 megs limit” thing for
DMA also? Or is this only 8 bit DMA or something like that?

DMA between user space and the bus is such a useful thing that it’s being
worked on for 2.4; search the kernel list for “kiobuf”. (BTW this feature
has long been a part of Windows; maybe explaining those wicked DirectX blit
rates =)

Yes, I’m pretty anxious to see that kind of stuff coming to Linux,
because right now, it sucks pretty badly…

It might be possible for the X server to allocate some medium-size static
DMA buffers; the clients could draw in them and blit from there. I’m not
sure if the speed gain would be worth the amount of driver hacking, but at
least it’d get you out of PIO mode.

I think the DRI kernel module has some way to let processes mmap it to
get DMA buffers, with authentication through the X server using
ioctl()s. It is being touted very loudly as a 3D/OpenGL solution, but
I’m pretty sure we could use that stuff to dramatically improve 2D
performance. We’d just need to look at how to do it. This would
require client-side video drivers library, etc…

These problems really show the weaknesses of the PC architecture. True
"workstation" hardware usually has much more robust support for shoveling
data around the system. A few years ago I watched a low-end SGI machine play
uncompressed 2048 x 1536 film frames at 24fps… Well, we’re getting closer
=)

Yeah, PCs suck pretty badly. At least, we can avoid Windows and run
Linux instead… :-)–
Pierre Phaneuf
Systems Exorcist

Yeah, I know… And isn’t there a “below the 16 megs limit” thing for
DMA also? Or is this only 8 bit DMA or something like that?

Only for ISA DMA. PCI cards have access to all physical RAM, although you
still need to worry about contiguous buffers. AGP chipsets do their own
scatter/gather so you don’t even need that. The GART kernel module nicely
exports this functionality.

I think the DRI kernel module has some way to let processes mmap it to
get DMA buffers, with authentication through the X server using
ioctl()s. It is being touted very loudly as a 3D/OpenGL solution, but
I’m pretty sure we could use that stuff to dramatically improve 2D
performance. We’d just need to look at how to do it. This would
require client-side video drivers library, etc…

Yes, it would be neat to take advantage of DRI. That looks like the stable
long-term solution. (Actually, as a short-term hack, you could build on
Utah-GLX… It already has AGP and direct rendering; just export some 2D
functions =)

Personally I have the ambition to write a whole windowing system someday
from the ground up - forget X entirely and use direct hardware OpenGL for
everything… =)

Thanks for your comments,
Dan

Personally I have the ambition to write a whole windowing system someday
from the ground up - forget X entirely and use direct hardware OpenGL for
everything… =)

if you don’t mind working on windowing system which support network
connection (a la XServer on one machine and the XClient on another),but
was coded with mesa at its core,look at
http://www.berlin-consortium.org/,they are working on something much
more advanced than X.

Alain