Optimization 101 (fwd)

Since we’re talking about line drawing, here’s some experiments I did a
while ago. It might be helpful. Ignore the rambling, out of context parts.

The program in question is Toby, which is kinda like a better version of
LOGO: http://icculus.org/toby/

(And no need to point out the obvious “don’t use a 32-bit surface”. The
rest still applies. :slight_smile: )

Hope this is interesting, if not helpful.

–ryan.---------- Forwarded message ----------
Date: Fri, 13 Apr 2001 18:20:02 -0700 (PDT)
Subject: Optimization 101

Linux hackers boggle my mind.

There’s a device driver that redirects all ioctl calls that command a
CD-ROM to play an audio track to XMMS, so you can play Quake 1 with MP3s
instead of the original audio tracks.

On a slightly more practical note, I wanted to ramble a bit more about
optimizations.

Remember the puzzle I presented you about blitting a diagonal line from a
linear framebuffer to the screen with the greatest speed?

At the time, I thought the fastest method was something like this:

SDL_UpdateRect(0, 0, half_width, half_height);
SDL_UpdateRect(half_width, half_height, half_width, half_height);

That gave me the best average between setup-and-talk-to-the-X-server
overhead and the actual amount of data that gets moved to video RAM.

My test program was:
Draw a square, 160 pixels per side, rotate 1 degree to the right, draw
again, for 360 degrees. Each line of each square must be on the screen
before the next can begin drawing.

This not only guaranteed that there would be lines drawn at every possible
whole number angle, but also in each direction (positive and negative
direction on both X and Y axes). It’s a good test case. In the completed
Java version of Toby, this takes about 330 milliseconds to run, but this
also has the overhead of interpreting Toby code (which I’m pleased to say
is pretty fast, but it’s still an interpreter). I ran this a couple of
times to get the best speed once all the important code chunks got JIT’ed.

The C+±based binary has reasonable compiler optimizations, no debugging
code or printfs, etc. Normally there is an abstraction between SDL and
TurtleSpace, but for these tests, I hardcoded SDL calls into the abstract
base class. All these numbers are best case, taken from several runs. A
multitasking OS and abstractions over video RAM (especially an X server)
gave me variations of up to 100 ms per run. ymmv.

The source code itself is an unoptimized Bresenham line algorithm that
renders the line and then blits the whole 640x480 screen with every line
drawn. It takes 78159ms to run this on my 333MHz celeron with XFree86
3.3.5. Yes, 78 seconds, verses the Java version’s 330 milliseconds. Yikes.

Blitting just the rectangle that bounds the line takes about 2850ms start
to finish. Definitely an improvement, but still WAY too slow.

The test case you came up with (and I agreed with at the time), of
blitting two quarters of the bounding rectangle, dropped us to a run of
1817ms. More than a full second shaved off, which ain’t bad at all.

Now, something I hadn’t explored at the time was the SDL_UpdateRects()
function. This takes an array of rectangles. I used the exact same code as
above, except the two UpdateRect calls get their data smooshed into a
two-element SDL_Rect array and passed into the other function. The result?
About 1590ms. That’s a savings of 227ms of overhead just in talking to the
X server on a local system, even with the MIT shm extention in use.

Remember that even with the MIT shm extension, there is still data pushed
through a socket between the app and the X server. It doesn’t push the
actual IMAGE through the socket, but it has to pass handles, geometry, etc.

And it shows. So with UpdateRects(), plural, we can send all the
rectangles to update in one burst, and let the other process do all it’s
work sooner. Also, there’s probably an XFlush() call on each of those
SDL_UpdateRect() calls, so Xlib can’t even queue up the requests.
Actually, that’s probably the bottleneck right there.

I started tweaking. If the real latency was in the individual transmits,
would 3 rectangles be that much slower than 2? I was theorizing that, if
SDL_UpdateRects() was chopping out one part of the processing bulk, that
the rest of the win was just to reduce the number of pixels that need to
get touched in video memory. This is only partially right, which I’ll
explain in a moment.

I took my code to split the line’s bounding rectangle into quarters and
rewrote it to be more general. It would split the bounding area into an
arbitrary but equal number of smaller rectangles. After some
experimentation, I found that splitting into ten-by-ten chunks gave you
the best boost: there are ten rectangles blitted, which minimized the
unnecessary pixel redrawing, but didn’t choke on transmit. After ten, up
to about 25, there was no noticable difference in performance (but
obviously you spend that much more time filling in SDL_Rect structs), and
after 25, you tend to get slower performance…I guess that’s the
saturation point. At either rate, it’s a bell curve, and 10 was a good
number. Run time with this method: 750ms. This halved the run time.

Next theory: Most lines don’t use that many runs to render. That is, if I
have a line that goes from (10, 10) to (110, 13) then I’ve drawn a line
that spans 100 pixels horizontally, but only 3 vertically. Since there’s
no antialiasing involved, we can break that down to being three horizontal
runs of about 33 pixels each…or three rectangles that are about 33
pixels wide and 1 pixel high. The added benefit to this method is that we
touch NOTHING on the screen except the line. This isn’t so important to
Toby, but there’s lots of other applications that might have incorrect or
half-rendered data in parts of the surface the line passes by. I gave this
a try: 937ms. Hmm.

What I didn’t take into account is that when you draw a line from (0, 0)
to (160, 160) then you’ve just updated 160 rectangles that are 1 pixel
each. No good.

It’s also important to remember that less pixels aren’t necessarily
good; there might be hardware acceleration in the X server that, if it
exists at all, is optimized for moving bigger rectangles, and not
collections of smaller ones, so sooner or later, moving less pixels
actually gets more expensive.

I found mixing these two solutions worked rather well: check if the line
will consist of more than 10 individual runs, and if so, split it up into
10 chunks of rectangles and do the update that way. Otherwise, update just
the exact pixels of the runs. Choosing the best optimization on-the-fly is
fun, and at least a little profitable: 741ms.

To add to this fun, I added special cases. We can accelerate codepaths
that need to render horizontal, vertical, and single-pixel lines, since
there’s no reason to measure runs or chop these up into multiple
rectangles. For my test case, there wasn’t a discernable speedup, since
there’s only a few horizontal and vertical lines, and no single-pixel
lines. This isn’t necessarily the case for the usual Toby program (note
the gradient example; every pixel is set by the turtle moving forward one
dot at a time.)

There’s still more that could be done. The line drawing algorithm I’m
using was originally a standard Bresenham, which is considered extremely
fast (in running and coding…it’s like 10 lines of C code), but Bresenham
has another line algorithm which is much more complex but is also
apparently much faster; at the time Abrash was writing about it, the VGA
controller couldn’t keep up with the algorithm.

The code that chops up the line’s bounding rectangle into smaller
rectangle tends to miss pixels, but bugs aside, it can be optimized more,
and special-cased for lines with a major axis of X vs Y.

It also might be worth investigating if there’s a way to determine what
lines MUST make it immediately to the screen. This would probably take
some sort of branch prediction in the interpreter, and hooks into the
TurtleSpace class that allow the interpreter to enable/disable blits to
screen. That’s no good, because ideally, the Toby interpreter isn’t bound
to the Toby language, or even to a language that uses TurtleSpace or any
graphics at all, and that abstraction is broken if it requires tackling at
this level. I’ll have to think on it.

That being said, the slowdown isn’t necessarily my code at this point.
Sure, there’s still tweaks to be made, but there’s the inevitable
bottleneck of moving from an abstract (32-bit) framebuffer to the
(possibly 16 or 24 or 8 bit) Xserver. X as a network protocol is optimized
for rendering graphic primitives, and, if you MUST have bitmapped
graphics, then the preferred tools are pixmaps, which are stored on the
server. For the tinkering that I’m doing, via SDL, you can’t use pixmaps,
because I need to twiddle bits, which I can’t do to a pixmap, since it’s
on a different computer.

But, hey, that’s why Toby’s rendering interface is abstracted. I hacked up
a GTK+ version of TurtleSpaceRenderer. GDK (which sits under GTK+) is just
a thin wrapper over Xlib, so when you call gdk_draw_line(), then you
getting access to X line primitive with the overhead of a few function
calls.

How long does this take to render the image that takes Java 330ms and my
framebuffer 741ms? Six (6) milliseconds.

Draw your own conclusions.

–ryan.

Ryan C. Gordon wrote:

Since we’re talking about line drawing, here’s some experiments I did a
while ago. It might be helpful. Ignore the rambling, out of context parts.

The program in question is Toby, which is kinda like a better version of
LOGO: http://icculus.org/toby/

(And no need to point out the obvious “don’t use a 32-bit surface”. The
rest still applies. :slight_smile: )

Hope this is interesting, if not helpful.

–ryan.

Some things said in here helped a lot. Using SDL_UpdateRects instead of
SDL_UpdateRect helped a lot. I’m just updating a rect around the line, and
already it’s dropped from 12 1/4 seconds (drawing 1000 lines at 640x480) to
about 3 1/3 seconds. I will try more things in here, but things are
looking good already! :slight_smile: Thanks for posting this!–

  • Mik Mifflin
    .doog era skeynoM