X11 again

I think for stable real-time operation on Linux,
(eg. you are doing a backup in the background and want to use the restant CPU
to play your favorite foobar game at ful speed, which is supposed to use less
than 100% of CPU),
we really need to get the X server out of the 2D rendering cycle during the
running phase.
It would even be cool being able to mmap() the framebuffer into mem,
even while you are in windowed mode (no fullscreen), and then do your
own clipping.
Ok, I know this can let some garbage on the screen sometimes when you move
windows, just like kwinTV which maps the TV card’s picture into the
framebuffer mem. (sometime when I overlap the TV window with another windows,
when moving the other window, a part of the TV picture remains on the latter)

my wishes (on Linux):

  • a set of drivers which allows full direct access to 2D hardware, without any
    intervention of the X server (since the X server is run with SCHED_OTHER as a
    regular process, and performance will suck when you stress your system with
    other background load).
    Maybe the ideal situation would be a set functions to get access (mmap,DMA) to
    the gfx card’s framebuffer mem, but not implemented as a client/server system
    but as a library, that means a SCHED_FIFO process (realtime-process)
    (assuming the low-latency extensions get into the official kernel) would
    receive full graphics performance , no matter if you have heavy system load
    in the background or not.
    Of course these accelerations would be really nice in windowed-mode too,
    (with the method described above,manaul clipping), so that the user experiences
    full performance , independently from the fact whether he is working in
    fullscreen mode or not.

  • making the X server “realtime-capable”:
    separate the “deterministic’ functions from the
    "non-deterministic” ones.
    That means the X server should run with 2 threads,
    one doing the “slow things” (running with normal priority), like memory
    allocation/deallocation, font managing etc.
    The other doing the rendering , which should be run with SCHED_RR lowest
    priority. In this case the X server can pre-empt other “disturbing” tasks , and
    let you achieve smooth performance when doing X11 rendering, independently
    from the background load (disk I/O, CPU etc) you put on your machine.

But that is not enough: to achieve the above, the X11 dispatching loop
(the one select()'ing on the client’s filedescriptors) has to
be modified in order to support a prioritized event system.

eg. you have an oscilloscope simulation which should be as smooth as possible,
even if you run x11perf at the same time:

  • run your app with SCHED_FIFO / SCHED_RR

  • tell the X server you need higher priority for X11 events

  • the X server rendering process which is supposed to run SCHED_RR, now knows
    that your app’s X11 messages have to be processed with higher priority than
    the ones of x11perf.
    The X server could simply use some prioritization mechanism, (round-robing,
    fifo, etc) to defer x11perf X11 messages in order to process the ones of your
    app.

With the above two methods, one could design games using the first method,
and design “GUI” apps (the ones using standard X11 calls), both with best
possible graphics performance, independendly of background activity.

Can the SGI do similar things ?
(I know the SGI realtime-scheduling is quite good (I heard that audio with 5-7ms
latency is possible even on older machines, but I don’t know if the
realtime-issues apply to video too)

Benno.On Tue, 28 Mar 2000, Dan Maas wrote:

Yeah, I know… And isn’t there a “below the 16 megs limit” thing for
DMA also? Or is this only 8 bit DMA or something like that?

Only for ISA DMA. PCI cards have access to all physical RAM, although you
still need to worry about contiguous buffers. AGP chipsets do their own
scatter/gather so you don’t even need that. The GART kernel module nicely
exports this functionality.

I think the DRI kernel module has some way to let processes mmap it to
get DMA buffers, with authentication through the X server using
ioctl()s. It is being touted very loudly as a 3D/OpenGL solution, but
I’m pretty sure we could use that stuff to dramatically improve 2D
performance. We’d just need to look at how to do it. This would
require client-side video drivers library, etc…

Yes, it would be neat to take advantage of DRI. That looks like the stable
long-term solution. (Actually, as a short-term hack, you could build on
Utah-GLX… It already has AGP and direct rendering; just export some 2D
functions =)

Personally I have the ambition to write a whole windowing system someday
from the ground up - forget X entirely and use direct hardware OpenGL for
everything… =)

Thanks for your comments,
Dan

we really need to get the X server out of the 2D rendering cycle during
the
running phase.

I too have been interested in different types of optimizations that could
speed up X. Basically you have two problems: latency and bandwidth. Latency
is the time between the user clicking on a button, and the button graphic
becoming highlighted/depressed on screen. Bandwidth is how many 640x480
32-bit images you can display per second. Just to throw out some numbers,
I’d really love to see latency below 10ms and bandwidth at least 30MB/sec…

Traditional X uses a UNIX domain socket between client and server, a pretty
bad form of data transport (from the viewpoint of optimizing latency and
bandwidth). In the best case, latency isn’t too slow; all that separates a
client drawing request from the server carrying it out is a single context
switch. As the system load increases, latency gets worse since there is more
contention for the CPU… Note that Windows NT used a similar transport
method through v3.51; however, NT got better performance under heavy loads
thanks to some nasty scheduler tricks. Specifically, when a client sent a
drawing request to the server, it could use a system call that would DEMAND
the server thread to be run next, bypassing all other threads waiting for
the CPU. One could theoretically implement this on Linux by, say, adding a
system call to switch right to the server, instead of what the scheduler has
planned. (someday when I get a chance I’ll try this out; it should reduce
worst-case latency to a single context switch, plus the time for
encoding/decoding the drawing request into the IPC buffer. I believe a
context switch would still allow really smoking performance, especially with
shared-memory transport… has anyone timed context switches?)

Bandwidth, on the other hand, really blows with UNIX sockets. You haven’t
got a chance at displaying full-motion video at 30fps, since all that data
must be serialized and then passed through the tiny socket buffer. (A while
ago I measured the throughput of a UNIX pipe at ~1 MB/sec, ugh). The obvious
way around this is to share memory between client and server, as in X-SHM. A
large enough shared segment could contain all drawing requests and bitmap
data; then you could also switch to a simple semaphore as the IPC mechanism.
Together with a switch_to_server_thread_NOW() system call, this would
approximate NT 3.51 pretty closely. (you could additionally optimize the
server->hardware path by, say, directly DMA’ing between the shared memory
bitmaps and the framebuffer; this could push your rendering bandwidth into
the hundreds of MB/sec on AGP cards!)

Client-side rendering, the method you outlined, is currently in use by
Windows NT4, as well as the XFree 4.0 DRI drivers. I honestly don’t believe
going this far is necessary, since client-side hardware access is very, very
hard to manage, and all it buys you is one measly context switch. You really
don’t want to try multiplexing MMU-less consumer graphics hardware across
many simultaneously rendering processes. Presenting a uniform view of the
hardware to each process at the very least requires lots of kernel support;
you’d need a DRI-like interface that dispatches DMA buffers to the card,
plus some synchronization mechanisms. And how do you manage video memory? Do
you really want to write kernel code to swap bitmaps in and out of card
memory on context switches? And what about security?

For those of us yearning for Windows-like 2D and 3D graphics speed on Linux,
I believe a more realistic approach is a minimalistic client-server
architecture. As in X, the server manages all hardware access and input
devices. Set up several MB of shared memory for each client to transmit
bitmaps and drawing commands. Synchronize the client and server processes
with a simple UNIX semaphore, and modify the Linux scheduler to guarantee
that the server process can run immediately after the client deposits
drawing requests in shared memory. All the server has to do is decode the
simple drawing commands and start DMA’ing bitmaps right into the
framebuffer. You could also head more in the direction of DRI, and allow the
client to build up DMA buffers itself; that might make more sense for 3D
API’s where lots of number-crunching might have to occur to go from drawing
commands to register programming.

I’ve seriously thought about implementing the above, and I might try it when
I get some free time this summer. I can’t wait for the day when dragging
windows around on Linux will be just as fast as NT…

Comments are welcome,
Dan

In the best case, latency isn’t too slow; all that separates a
client drawing request from the server carrying it out is a single context
switch.

Just went and RTFM… Context switches shouldn’t be a problem at all.
They’re less than 10 microseconds. You would definitely have to be careful
that your CPU cache doesn’t go to heck, but the scheduler trickery could
probably take care of that. (you don’t want to give SETI at home a timeslice in
the middle of a rendering operation =)

As far as I can see, one server process accessing hardware on behalf of many
clients would not differ much from many clients rendering directly.

Dan

I’d really love to see latency below 10ms and bandwidth at least 30MB/sec…

10ms latency shouldn’t be a problem even now, if you mean the time
between a mouse click and a button flash. You might even get it over a
network. Surely you have higher goals than that :slight_smile:

NT got better performance under heavy loads
thanks to some nasty scheduler tricks. Specifically, when a client sent a
drawing request to the server, it could use a system call that would DEMAND
the server thread to be run next, bypassing all other threads waiting for
the CPU. One could theoretically implement this on Linux by, say, adding a
system call to switch right to the server, instead of what the scheduler has
planned.

Aw… just to get rid of a reschedule? I don’t think it pays off.
If the only eligible process is the newly woken up X server, that is going
to be very fast anyway. The scheduler cost drowns in the cost of a context
switch (switch + TLB flush + cache effects).

The obvious
way around this is to share memory between client and server, as in X-SHM. A
large enough shared segment could contain all drawing requests and bitmap
data; then you could also switch to a simple semaphore as the IPC mechanism.

Many X servers already allow for using shared memory as a local transport
mechanism (XSun, for instance). But I don’t think the gain isn’t as large
nowadays since the sockets/pipes are highly tuned (I did some experiments
with it some time ago). MIT-SHM still pays off well, since it bypasses the
communication entirely.

Benno Senoner wrote:

  • a set of drivers which allows full direct access to 2D hardware,
    without any intervention of the X server (since the X server is run
    with SCHED_OTHER as a regular process, and performance will suck when
    you stress your system with other background load). Maybe the ideal
    situation would be a set functions to get access (mmap,DMA) to
    the gfx card’s framebuffer mem, but not implemented as a
    client/server system but as a library, that means a SCHED_FIFO
    process (realtime-process) (assuming the low-latency extensions get
    into the official kernel) would receive full graphics performance ,
    no matter if you have heavy system load in the background or not.
    Of course these accelerations would be really nice in windowed-mode
    too, (with the method described above,manaul clipping), so that the
    user experiences full performance , independently from the fact
    whether he is working in fullscreen mode or not.

DGA will mmap the framebuffer, but everything else goes through the X
server. For things like an accelerator call for a large blit, I think
that it isn’t that bad to have the X server do the call, but I think
this is because things currently sucks so much that I would be please
with little (like SOME acceleration).

I think that all the versions of DGA are restricted to full-screen.

What is really interesting is that new DRM module in the 2.3.x
kernel… Hmm… This is exactly how direct rendering works: the DRM
module allows the X client to do mmap/DMA to the video hardware (with
authorization from the X server), letting it do its own clipping for
windowed mode. This also means that a full duplicate of the video driver
has to be available to the X client, but this can be hidden in a library
(like libdri.so does).–
Pierre Phaneuf
Systems Exorcist

Dan Maas wrote:

encoding/decoding the drawing request into the IPC buffer. I believe a
context switch would still allow really smoking performance, especially with
shared-memory transport… has anyone timed context switches?)

The problem I find with context switches isn’t all that much their own
performance, but the hit they impart on other things. My main problem is
that it completely wrecks the cache. I work hard to make optimized
blitting and culling routines, aligning everything and making sure all
my structures all fit in the cache, and what do I get? It gets kicked
out and I have a processor doing wait-states, waiting for the data to be
BACK when it could have stayed there the whole time.

Even without realtime scheduling, you would get higher performance if
you removed the context switch. You’d only get a hitch once in a while
when a background process had to do something, but when they are all
sleeping, you don’t get pre-empted and switched. If you use realtime
scheduling, then you don’t get pre-empted, period.

I personnally don’t mind that much getting pre-empted once in a while by
a syslogd wake-up, enough that I’m not thinking that hard about realtime
scheduling. Now, using realtime scheduling has some effects (at least,
your cache doesn’t get fucked up), but the X server (which is doing
the bulk of the job) gets the same mis-treatment. It could be even
WORSE, because now it has to contend with an hungry, over-prioritized
realtime process!

Client-side rendering, the method you outlined, is currently in use by
Windows NT4, as well as the XFree 4.0 DRI drivers. I honestly don’t believe
going this far is necessary, since client-side hardware access is very, very
hard to manage, and all it buys you is one measly context switch. You really
don’t want to try multiplexing MMU-less consumer graphics hardware across
many simultaneously rendering processes. Presenting a uniform view of the

I think most video card (including the cheaper top-end professionnal
offerings) do not have MMUs. The only ones I know are the ccNUMA SGI
workstations, which is obviously MMUish stuff.

hardware to each process at the very least requires lots of kernel support;
you’d need a DRI-like interface that dispatches DMA buffers to the card,
plus some synchronization mechanisms. And how do you manage video memory? Do
you really want to write kernel code to swap bitmaps in and out of card
memory on context switches? And what about security?

It’s already there and done (that kernel support), for DRI. Yes, it was
a lot of work. And it’s done, isn’t it wonderful? So now, let’s USE it!

For the security, this is covered by the DRM. The /dev/drm device won’t
talk to a process without being given a cookie that comes from the X
server. I think that root privilege is not required, but that they
recommend making this device only accessible to a group, and putting the
appropriate people in that group (like the “floppy” group on older Red
Hat used to control the access to the /dev/fd0 device).

For those of us yearning for Windows-like 2D and 3D graphics speed on Linux,
I believe a more realistic approach is a minimalistic client-server
architecture. As in X, the server manages all hardware access and input
devices. Set up several MB of shared memory for each client to transmit
bitmaps and drawing commands. Synchronize the client and server processes
with a simple UNIX semaphore, and modify the Linux scheduler to guarantee
that the server process can run immediately after the client deposits
drawing requests in shared memory. All the server has to do is decode the
simple drawing commands and start DMA’ing bitmaps right into the
framebuffer. You could also head more in the direction of DRI, and allow the
client to build up DMA buffers itself; that might make more sense for 3D
API’s where lots of number-crunching might have to occur to go from drawing
commands to register programming.

This would be a lot of work (work yet to be done, as opposed to the lot
of work already done on DRI). Would be highly unportable (regarding that
"schedule this process next" thing). This actually sounds a lot like
what X currently is (with XShm), the only addition being the kernel
scheduler modification (which you referred to as “nasty” earlier, not a
good sign, eh?).

Note that there is a flaw in the current system, DMA buffers have to be
physically contiguous (AGP lifts that restriction though), so DMA direct
from shared memory is usually not possible.–
Pierre Phaneuf
Systems Exorcist

Mattias Engdeg?rd wrote:

NT got better performance under heavy loads
thanks to some nasty scheduler tricks. Specifically, when a client
sent a drawing request to the server, it could use a system call that
would DEMAND the server thread to be run next, bypassing all other
threads waiting for the CPU. One could theoretically implement this
on Linux by, say, adding a system call to switch right to the server,
instead of what the scheduler has planned.

Aw… just to get rid of a reschedule? I don’t think it pays off.
If the only eligible process is the newly woken up X server, that is going
to be very fast anyway. The scheduler cost drowns in the cost of a context
switch (switch + TLB flush + cache effects).

True, thinking about this, in a system doing only this (no daemon or
other process trying hard to run), after doing an operation, there is
probably only one other process on the run queue, the X server itself
(which is marked for wake up from it’s ‘select()’).

One thing that might be good is a mechanism I heard some OSes had,
that you could map a whole running process just like a .so and call
entry points in your own context…–
Pierre Phaneuf
Systems Exorcist

OK, good point, perhaps the context switch is more severe than I thought…
The DRI might make a nice base to build upon. A few things like
synchronizing with the X server could be left behind; a user-mode window
manager could handle the cursor etc. through the direct API. I’m not sure
how aggressive DRI is with DMA; here I’d simply require AGP hardware and use
the gart mechanism to go right from user space. (I really, really want to
see 2048x1536 video playback at 60fps someday…)

Speaking of API, what sort of interface should be exported to user
applications? In a perfect world, I’d love to see just a single bit-blit
function that takes a bitmap, width, height, and destination coordinates
(and maybe an alpha-blending option). Absolutely all drawing would go
through that one function…

What do you think about exporting (a subset of) OpenGL? This was my original
thought, but texture management seems to be the problem. How would one allow
different processes to share texture memory on the card? Hmm, maybe if there
were an upper limit of clients or something…

Well, enough mothing off for me. The proof, as always, lies in running code
=). I’ll let you know if I get a chance…

Regards,
Dan

I think that all the versions of DGA are restricted to full-screen.

What is really interesting is that new DRM module in the 2.3.x
kernel… Hmm… This is exactly how direct rendering works: the DRM
module allows the X client to do mmap/DMA to the video hardware (with
authorization from the X server), letting it do its own clipping for
windowed mode. This also means that a full duplicate of the video driver
has to be available to the X client, but this can be hidden in a library
(like libdri.so does).

Does this mean that under linux 2.4 + XFree4.0 you will get the mmap()ed
windowed access ? would really rock.

That is exactly that what we would need to get windowed full-motion
performance under high load:
the low-latency patches already allow for <5ms latencies WORST CASE,
regardless of the load.
That means if your DVD software decoder takes 50% of the CPU,
you could watch dropout-free video while doing heavy computations or
disk I/O in the background, just as with a hardware decoder.
That would be too cool, since the same concept applies to games.
(You play a game which takes <100% of the CPU at full speed/smoothness,
regardless of the background system load)

Benno.On Wed, 05 Apr 2000, Pierre Phaneuf wrote:


Pierre Phaneuf
Systems Exorcist

Benno Senoner wrote:

What is really interesting is that new DRM module in the 2.3.x
kernel… Hmm… This is exactly how direct rendering works: the DRM
module allows the X client to do mmap/DMA to the video hardware (with
authorization from the X server), letting it do its own clipping for
windowed mode. This also means that a full duplicate of the video driver
has to be available to the X client, but this can be hidden in a library
(like libdri.so does).

Does this mean that under linux 2.4 + XFree4.0 you will get the mmap()ed
windowed access ? would really rock.

Possibly. Right now, the DRI is all-out on 3D, I don’t think there is
anything for 2D in there. But what it basically does is give you access
to the hardware directly from the client, so there should be a way to
hack something.

That would be too cool, since the same concept applies to games.
(You play a game which takes <100% of the CPU at full speed/smoothness,
regardless of the background system load)

Hmm… Sweeet! :-)–
Pierre Phaneuf
Systems Exorcist