Low performance in DGA

Vaclav_Slavik · July 19, 1999, 9:12pm

Hi,
it’s nothing important, just strange :
I have a program that runs in 800x600, 16bpp. Blitting is 3 times slower when
run as root in DGA mode than when run in window!
My X server is XFree 3.3.2 on S3 Trio 64V+, supports both 800x600 and 1024x768
(default) in 16bpp, Linux kernel is 2.0.36 – so no surface emulation should be
present and DGA access should be the fastest possible (I lock surface, memcpy
data and unlock it). But it is 3 times slower than drawing into window (memory
surface) AND updating it so that it is displayed…
Any idea?
I went across this problem some time ago but I though that the problem is in
using SDL’s blit. Now I use memcpy instead and results are same if not worse

May the problem be in S3 server? Is it possible that it emulates DGA somehow? Of
course I could try a 2.2 kernel, maybe framebuffer support would help…

Has anyone experience with DGA? I mean HOW much faster common program runs in
Dga mode?

regards
Vasek

Off-topic P.S. : Recent question on mutexes reminded me one thing I still don’t
understand - imagine you have 2 threads, one of them doing 'heavy processing’
and the second doing something like “for (; {delay(1); if
(thread_1_finished()) break;}”. I would expect that thread one will run twice
slower (because thread 2 is still doing something!) but that’s not true - when I
tested it it ran almost as fast as in single threaded application…

Tomas_Andrle · July 19, 1999, 10:30pm

Off-topic P.S. : Recent question on mutexes reminded me one thing I still don’t
understand - imagine you have 2 threads, one of them doing 'heavy processing’
and the second doing something like “for (; {delay(1); if
(thread_1_finished()) break;}”. I would expect that thread one will run twice
slower (because thread 2 is still doing something!) but that’s not true - when I
tested it it ran almost as fast as in single threaded application…

Might be caused by the delay function which could give the given slice
of time to the other threads?

Tomas Andrle / red_hatredOn Mon, 19 Jul 1999, Vaclav Slavik wrote:

Dave_Ashley · July 19, 1999, 10:56pm

Of course I could try a 2.2 kernel, maybe framebuffer support would help…

Don’t expect performance increases with the framebuffer device. It is really
only a stopgap measure until your new video board is supported by the
XFree86 servers, as long as it is Vesa 2.0 compliant. And you get a nice
penguin picture during boot. But the XFree86 specific accelerated server
for your video board will be best for you.

Warren_Downs · July 19, 1999, 11:19pm

Regarding why a thread that only sleeps doesn’t waste the CPU time and slow down
a thread that is actually doing something:

The delay() function causes suspension of the thread, so even though you think
it does something, the only thing it does is during the brief period between the
end of the loop and the start of the loop again. In other words, the thread
voluntarily gives up it’s allocated CPU timeslice when it calls delay() (or any
blocking I/O functions, for that matter). Thus, the first thread still gets
most of the CPU time.

Warren E. Downs
______________________________ Reply Separator _________________________________Subject: [SDL] low performance in DGA
Author: at internet-mail
Date: 7/19/99 11:12 PM

Off-topic P.S. : Recent question on mutexes reminded me one thing I still don’t
understand - imagine you have 2 threads, one of them doing 'heavy processing’
and the second doing something like “for (; {delay(1); if
(thread_1_finished()) break;}”. I would expect that thread one will run twice
slower (because thread 2 is still doing something!) but that’s not true - when
I tested it it ran almost as fast as in single threaded application…

slouken · July 20, 1999, 1:22am

it’s nothing important, just strange :
I have a program that runs in 800x600, 16bpp. Blitting is 3 times slower when
run as root in DGA mode than when run in window!

What’s happening is each pixel access is turning into a separate bus access.
Since the PCI bus is much slower than the CPU, this translates into slow
memory access. The X server has access to the card’s blitter, so it can
dramatically speed up the video transfer.

If you run the 2.2 kernel, and recompile SDL, it will take advantage of
the MTRR support and speed up your DGA access by a factor of 2 or more.
I don’t think anyone has done any definitive benchmarks on this.

-Sam Lantinga				(slouken at devolution.com)

Lead Programmer, Loki Entertainment Software–
“Any sufficiently advanced bug is indistinguishable from a feature”
– Rich Kulawiec

Justin_Bradford · July 20, 1999, 1:34am

Off-topic P.S. : Recent question on mutexes reminded me one thing I still don’t
understand - imagine you have 2 threads, one of them doing 'heavy processing’
and the second doing something like “for (; {delay(1); if
(thread_1_finished()) break;}”. I would expect that thread one will run twice
slower (because thread 2 is still doing something!) but that’s not true - when I
tested it it ran almost as fast as in single threaded application…

When the second thread is sitting in the delay, the processor time is
being used by thread 1. It doesn’t actually block other threads when
delaying.

Also, a better way would be to use a thread function to wait for the other
thread’s completion, if you can (saves that tiny bit on the testing and
for loop). In many cases, a mutex would work in place of that code.

And for the other person who asked, the mutexes are used for thread
locking/synchronization. You create a mutex, and then threads try to lock
it. Only one thread can have a lock at at a time, and the others wait
in turn to get the lock.

Justin

Garrett · July 20, 1999, 2:08am

At 06:22 PM 7/19/99 -0700, you wrote:

it’s nothing important, just strange :
I have a program that runs in 800x600, 16bpp. Blitting is 3 times slower
when

run as root in DGA mode than when run in window!

What’s happening is each pixel access is turning into a separate bus access.
Since the PCI bus is much slower than the CPU, this translates into slow
memory access. The X server has access to the card’s blitter, so it can
dramatically speed up the video transfer.

If you run the 2.2 kernel, and recompile SDL, it will take advantage of
the MTRR support and speed up your DGA access by a factor of 2 or more.
I don’t think anyone has done any definitive benchmarks on this.

MTRR made a noticeable difference for me when I turned it on. I don't

think it will make a difference for large blts across the bus though, just
lots of small memory accesses and small blts since they will get grouped
together and sent across the bus at once. At least I assume thats how it works.

-Mongoose WPI student majoring in Computer Science
This messge sent from Windoze… ugh.

Vaclav_Slavik · July 20, 1999, 9:00am

Sam Lantinga wrote:

it’s nothing important, just strange :
I have a program that runs in 800x600, 16bpp. Blitting is 3 times slower when
run as root in DGA mode than when run in window!

What’s happening is each pixel access is turning into a separate bus access.
Since the PCI bus is much slower than the CPU, this translates into slow
memory access. The X server has access to the card’s blitter, so it can
dramatically speed up the video transfer.

Sounds probably, thanks for your answer. (And thank you all who answered
However this doesn’t answer this question : why exactly same code when run under
DX performs faster? I understood it that DGA maps video memory into memory address
space and thus performace should be same as with SVGAlib or (Dos or DX)
framebuffer access?

Vasek

Vaclav_Slavik · July 20, 1999, 8:57am

Justin Bradford wrote:

When the second thread is sitting in the delay, the processor time is
being used by thread 1. It doesn’t actually block other threads when
delaying.

So this is what usleep() [used internally in SDL_Delay] does? I thought it’s just
doing some do-nothing cycle (if it works with MICROseconds…).

Also, a better way would be to use a thread function to wait for the other
thread’s completion, if you can (saves that tiny bit on the testing and
for loop). In many cases, a mutex would work in place of that code.

Of course, the example was oversimplified. The second thread would do something
useful - e.g. move mouse cursor - or real life example:
thread 1 is decompressing video images as fast as possible, placing them into 16
images ring buffer
thread 2 is watching actual position in WAV being played and displays one image each
time WAV position reaches some value (every 1/15 second for example). Something like
"wait, lock mutex, check time, unlock mutex, do real work if needed… etc."

Vasek

slouken · July 20, 1999, 3:34pm

Sounds probably, thanks for your answer. (And thank you all who answered
However this doesn’t answer this question : why exactly same code when run under
DX performs faster? I understood it that DGA maps video memory into memory address
space and thus performace should be same as with SVGAlib or (Dos or DX)
framebuffer access?

The performance is the same as SVGAlib (and I think DOS) framebuffer access.
DirectX has access to the card’s accelerated blitter, and enables things like
the MTRR support. There might be more to it that I don’t know.

-Sam Lantinga				(slouken at devolution.com)

Lead Programmer, Loki Entertainment Software–
“Any sufficiently advanced bug is indistinguishable from a feature”
– Rich Kulawiec