SDL 1.3 on Mac OS X speed hit

Hi there,

I recently tried to use SDL 1.3 (latest SVN development version) with
ScummVM on Mac OS XThe main motivation is that we are looking into
vsync enabled fullscreen graphics code for various reasons.

And the current SDL 1.2 quartz backend is so totally utterly messed
up (I feel entitled to say that, as I did contribute my share of code
to it and share the blame :slight_smile: that it’s virtually impossible to add
that there. In fact, while it has code which tries to implement VBL
syncing, the way it’s done and the way CoreGraphics works with LCD
screens, means that the VBL syncing code actually increases the
tearing effect, instead of hiding it.

Anyway, I briefly considered ripping out the existing fullscreen code
and replacing it with something new, when I discovered that SDL 1.3
already does that. Great. Very elegant new code, I must say, kudos to
Sam and Ryan and everybody else responsible.

However, when I tried ScummVM with the new API (using the SDL_compat
compatibility layer), I was disappointed to see that it suddenly took
up 90+% of my CPU power; the whole thing was unbearably sluggish.
Upon investigating with Shark, it turned out that it spends virtually
all of that time doing internal texture conversions, in a function
called glgProcessPixels. Ouch!

Some googling quickly revealed the following insightful pointers:
<http://developer.apple.com/documentation/GraphicsImaging/Conceptual/
OpenGL-MacProgGuide/opengl_performance/chapter_13_section_4.html>
<http://developer.apple.com/documentation/GraphicsImaging/Conceptual/
OpenGL-MacProgGuide/opengl_performance/chapter_13_section_2.html#//
apple_ref/doc/uid/TP40001987-CH213-SW23>

So, in short, if you use 8888 mode for your texture data, you are
fine, if you use 1555 mode, it’s still OK, anything else will cause
you PAIN PAIN PAIN. Sure enough, quickly hacking the ScummVM code to
allocate a (1)555 (15 bpp) surface instead of a 565 (16 bpp) surface
gave a big speed boost; the app was usable again (but still took
about 50% CPU time).

My question now: Would it be possible to take this into account into
the compat layer, resp. the SDL Cocoa OpenGL code? Given that Apple
specifically documents this bottleneck… Because it seems that SDL
does a much better job doing these bitmap conversions. At least on my
PowerBook G4 1.5 Ghz, with Radeon 9700 XT onboard graphics. As a
quick test for that claim, I changed SDL_compat.c, replacing in lin 487
SDL_VideoTexture = SDL_CreateTexture(desired_format,
SDL_TEXTUREACCESS_LOCAL, width, height);
by
SDL_VideoTexture = SDL_CreateTexture(SDL_PIXELFORMAT_RGB888,
SDL_TEXTUREACCESS_LOCAL, width, height);
and the speed also went up to “fast enough” again (although the
colors were incorrect this way).

Bye,
Max

Anyway, I briefly considered ripping out the existing fullscreen code
and replacing it with something new, when I discovered that SDL 1.3
already does that. Great. Very elegant new code, I must say, kudos to
Sam and Ryan and everybody else responsible.

Thanks!

So, in short, if you use 8888 mode for your texture data, you are
fine, if you use 1555 mode, it’s still OK, anything else will cause
you PAIN PAIN PAIN.

My question now: Would it be possible to take this into account into
the compat layer, resp. the SDL Cocoa OpenGL code?

Absolutely. The idea was to have the OpenGL driver have some suggested
formats and do any necessary conversion to get to the optimal format.

Out of curiousity, is the RGB888 code path significantly slower than SDL 1.2?

Can you enter a note in bugzilla so we remember to look at this?

Thanks!
-Sam Lantinga, Lead Software Engineer, Blizzard Entertainment

So, in short, if you use 8888 mode for your texture data, you are
fine, if you use 1555 mode, it’s still OK, anything else will cause
you PAIN PAIN PAIN.

My question now: Would it be possible to take this into account into
the compat layer, resp. the SDL Cocoa OpenGL code?

Absolutely. The idea was to have the OpenGL driver have some
suggested
formats and do any necessary conversion to get to the optimal format.

OK. Sounds reasonable

Out of curiousity, is the RGB888 code path significantly slower
than SDL 1.2?

Yes, about 2-3 times slower. And it still seems to spend 50% of the
CPU time in glgProcessPixels. Specifically, in a method including
"GLGConverter_BGRA8_RGBA8" in its signature. So that sounds as if SDL
uses the “wrong” byte order (in the eyes of the Apple OpenGL
implementation, that is). In particular, Apple wants BGRA8, SDL
supplies RGB8.

So I just changed SDL_compat.c to pass SDL_PIXELFORMAT_BGRA8888
instead, hoping that would help, but no change, the same slow down in
the same spot still occurs.

Can you enter a note in bugzilla so we remember to look at this?

I will do it, once http://bugzilla.libsdl.org/ works again – or
anything else in the libsdl.org domain, like the SVN server or the
website… :-/

Cheers,
MaxAm 23.07.2007 um 02:46 schrieb Sam Lantinga:

So I just changed SDL_compat.c to pass SDL_PIXELFORMAT_BGRA8888
instead, hoping that would help, but no change, the same slow down in
the same spot still occurs.

The site that you linked to has a recommended pixel format. Can you
modify SDL_renderer_gl.c to use the recommended format and see if that
works?

I will do it, once http://bugzilla.libsdl.org/ works again – or
anything else in the libsdl.org domain, like the SVN server or the
website… :-/

Can you privately tell me what IP address you’re trying to connect
from? It’s working for me, and I know a few IP ranges were blocked
for denial of service attacks.

See ya,
-Sam Lantinga, Lead Software Engineer, Blizzard Entertainment

However, when I tried ScummVM with the new API (using the SDL_compat
compatibility layer), I was disappointed to see that it suddenly took
up 90+% of my CPU power; the whole thing was unbearably sluggish.
Upon investigating with Shark, it turned out that it spends virtually
all of that time doing internal texture conversions, in a function
called glgProcessPixels. Ouch!

I just spent the day optimizing the renderer for Mac OS X, and found
some interesting things.

First of all, can you update code and see if it’s faster for you?

Second, it turns out that the new MacBook Pro actually has the OpenGL
driver from Leopard back-ported to Tiger. One of the interesting features
I noticed was it has optimized functions for texture conversion, including
dynamically generated vector assembly code for unusual combinations.

See ya!
-Sam Lantinga, Lead Software Engineer, Blizzard Entertainment

However, when I tried ScummVM with the new API (using the SDL_compat
compatibility layer), I was disappointed to see that it suddenly took
up 90+% of my CPU power; the whole thing was unbearably sluggish.
Upon investigating with Shark, it turned out that it spends virtually
all of that time doing internal texture conversions, in a function
called glgProcessPixels. Ouch!

I just spent the day optimizing the renderer for Mac OS X, and found
some interesting things.

First of all, can you update code and see if it’s faster for you?

I first had to convince it to compile. Lots of conditional code still
refers to SDL_HWSURFACE, including the SDL_ALTIVEC_BLITTERS stuff in
src/video/SDL_blit_A.c – so I just removed the checks for
SDL_HWSURFACE (hoping that this is correct).

After that, I was able to compile SVN head on my PowerBook G4 1.5Ghz.
So, in short, it’s now a lot better than it was (i.e. ScummVM now
runs at an acceptable speed). Looking at the CPU usage, it’s still
quite a bit higher. In my “idle” test case, it goes from about 8% CPU
to 12%. But in a “load” case (involving lots of full screen updates
for a scrolling+fading effect), it went from 20% to 70%. Still was
smooth enough, but is nevertheless a quite serious regression. The
majority of that time is spent in glgProcessColor(), by the way (65%
or so), which in turn is called by glgProcessPixels().

Second, it turns out that the new MacBook Pro actually has the OpenGL
driver from Leopard back-ported to Tiger. One of the interesting
features
I noticed was it has optimized functions for texture conversion,
including
dynamically generated vector assembly code for unusual combinations.

Hm, yeah. Of course I am running on a PowerPC machine… do you
still have one around for testing? Might be worth investigating that
difference…

Bye,
MaxAm 12.08.2007 um 09:10 schrieb Sam Lantinga:

I first had to convince it to compile. Lots of conditional code still
refers to SDL_HWSURFACE, including the SDL_ALTIVEC_BLITTERS stuff in
src/video/SDL_blit_A.c – so I just removed the checks for
SDL_HWSURFACE (hoping that this is correct).

Yep, I’ll check this change in shortly.

After that, I was able to compile SVN head on my PowerBook G4 1.5Ghz.
So, in short, it’s now a lot better than it was (i.e. ScummVM now
runs at an acceptable speed). Looking at the CPU usage, it’s still
quite a bit higher. In my “idle” test case, it goes from about 8% CPU
to 12%. But in a “load” case (involving lots of full screen updates
for a scrolling+fading effect), it went from 20% to 70%. Still was
smooth enough, but is nevertheless a quite serious regression. The
majority of that time is spent in glgProcessColor(), by the way (65%
or so), which in turn is called by glgProcessPixels().

Well, I’m glad it’s improved. I’m still looking at optimizations.
I have an old G4 iMac, and on that system my changes yesterday dropped
FPS from 78 to 36, so this may be a case where something that’s good
for the new drivers is bad for the old drivers. I’m going to try the
texture range extension today to see if I can get AGP transfers directly
from system memory.

See ya!
-Sam Lantinga, Lead Software Engineer, Blizzard Entertainment

Well, I’m glad it’s improved. I’m still looking at optimizations.
I have an old G4 iMac, and on that system my changes yesterday dropped
FPS from 78 to 36, so this may be a case where something that’s good
for the new drivers is bad for the old drivers. I’m going to try the
texture range extension today to see if I can get AGP transfers directly
from system memory.

We could probably get a win by not having to block in glTexSubImage()
when we can use GL_ARB_pixel_buffer_object. Right now it probably does
all the data format conversions before glTexSubImage() returns, since it
has to dereference the data right then.

Pushing the data into a pixel buffer object will let the GL defer that
until it needs it, possibly pushing the work onto the GPU or at least
another CPU core.

That work would best be done as calls to glMapBufferARB() in
GL_LockTexture(), and the unmap in Unlock (extra credit for taking
advantage of GL_APPLE_flush_buffer_range there if it exists). You still
use glTex*Image(), but you add the map/unmap semantics and pass it an
offset instead of a pointer.

It doesn’t change the basic problem that it seems we’re feeding the GL a
data format it needs to convert on the CPU, but PBOs are going to be
your fastest path to move data around, especially on Mac OS X, and can
probably mitigate the bottleneck somewhat even in the worst case.

Generating pixel data in a different format would resolve this specific
bottleneck, though.

Been spending way too much time reading about this recently. :slight_smile:

–ryan.

Below is a fix for SDL_SetGammaRamp(). The problem I ran into is that pal
can be NULL and that crashes when trying to dereference elements in the
SDL_Palette structure. The fix is listed below. Search for BUGFIX to find
the line I changed.

Ken Rogoway
Homebrew Software
http://www.homebrewsoftware.com/

int SDL_SetGammaRamp(const Uint16 *red, const Uint16 *green, const Uint16
*blue)
{
int succeeded;
SDL_VideoDevice *video = current_video;
SDL_VideoDevice *this = current_video;
SDL_Surface *screen = SDL_PublicSurface;

/* Verify the screen parameter */
if ( !screen ) {
	SDL_SetError("No video mode has been set");
	return -1;
}

/* Lazily allocate the gamma tables */
if ( ! video->gamma ) {
	SDL_GetGammaRamp(0, 0, 0);
}

/* Fill the gamma table with the new values */
if ( red ) {
	SDL_memcpy(&video->gamma[0*256], red,

256sizeof(video->gamma));
}
if ( green ) {
SDL_memcpy(&video->gamma[1
256], green,
256
sizeof(video->gamma));
}
if ( blue ) {
SDL_memcpy(&video->gamma[2
256], blue,
256*sizeof(*video->gamma));
}

/* Gamma correction always possible on split palettes */
if ( (screen->flags & SDL_HWPALETTE) == SDL_HWPALETTE ) {
	SDL_Palette *pal = screen->format->palette;

	/* If physical palette has been set independently, use it */
	if(video->physpal)
	        pal = video->physpal;

	// BUGFIX: pal can be NULL.
	if ( pal ) {
		SDL_SetPalette(screen, SDL_PHYSPAL,
			pal->colors, 0, pal->ncolors);
	}
	return 0;
}

/* Try to set the gamma ramp in the driver */
succeeded = -1;
if ( video->SetGammaRamp ) {
	succeeded = video->SetGammaRamp(this, video->gamma);
} else {
	SDL_SetError("Gamma ramp manipulation not supported");
}
return succeeded;

}

No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.476 / Virus Database: 269.11.15/949 - Release Date: 8/12/2007
11:03 AM

Well, I’m glad it’s improved. I’m still looking at optimizations.
I have an old G4 iMac, and on that system my changes yesterday dropped
FPS from 78 to 36, so this may be a case where something that’s good
for the new drivers is bad for the old drivers. I’m going to try the
texture range extension today to see if I can get AGP transfers directly
from system memory.

So it turns out that on my G4 iMac if I use the client-storage extension,
the texture data is asynchronously copied from system memory, which plays
all sorts of havoc with a screen that is continually being updated. It
looks like the Core 2 MacBook Pro driver always blocks until the copy is
complete, and ignores the other optimizing extensions at this point.

sigh

I changed the Cocoa video mode code to report the optimal texture format
as the screen format so at least we can get the fast path working by default.
Of course this exposed holes in the blit functions - having a screen surface
with an alpha channel isn’t in something SDL does well, so that’ll give me
something to chew on while the MacBook Pro drivers are updated and I do
some research on the options for OpenGL extensions.

Ryan, the extensions I was trying would allow asynchronous DMA directly from
SDL surface memory to the GPU. Pixel buffer objects would involve one extra
copy above that, but may solve some of the other issues. Do you feel like
trying it out? I don’t really know how to use that extension yet.

Thanks!
-Sam Lantinga, Lead Software Engineer, Blizzard Entertainment

Sam Lantinga wrote:

Well, I’m glad it’s improved. I’m still looking at optimizations.
I have an old G4 iMac, and on that system my changes yesterday dropped
FPS from 78 to 36, so this may be a case where something that’s good
for the new drivers is bad for the old drivers. I’m going to try the
texture range extension today to see if I can get AGP transfers directly
from system memory.

So it turns out that on my G4 iMac if I use the client-storage extension,
the texture data is asynchronously copied from system memory, which plays
all sorts of havoc with a screen that is continually being updated. It
looks like the Core 2 MacBook Pro driver always blocks until the copy is
complete, and ignores the other optimizing extensions at this point.

FWIW, there’s a sample that allows some comparison of the upload speeds:
http://developer.apple.com/samplecode/TexturePerformanceDemo/index.html

Both the GL_STORAGE_CACHED_APPLE hint and GL_UNPACK_CLIENT_STORAGE_APPLE
are commented out.

With frame_rate set to 600 I get these results:

Both settings off: C2D gf7600 ~800MB/sec, G4 mac mini ~170MB/sec
Both settings on : C2D gf7600 ~1200MB/sec, G4 mac mini ~480MB/sec

I didn’t see any screen garbage.

Cheers,
Frank.–
Need a break? http://criticalmass.sf.net/

FWIW, there’s a sample that allows some comparison of the upload speeds:
http://developer.apple.com/samplecode/TexturePerformanceDemo/index.html

Yep, there’s a more advanced demo I’ve been working with here:
http://developer.apple.com/samplecode/TextureRange/index.html

I’m going to adapt it to do something similar to testsprite and play around
with it.

See ya!
-Sam Lantinga, Lead Software Engineer, Blizzard Entertainment

Both the GL_STORAGE_CACHED_APPLE hint and GL_UNPACK_CLIENT_STORAGE_APPLE
are commented out.

I think they consider them deprecated, even though the developer notes
were updated recently. The WWDC’07 sessions on OpenGL optimization are
available for free through ADC-on-iTunes (er, with a free registration
at developer.apple.com), and they never mention it, but they DO push
vertex_buffer_object and pixel_buffer_object very heavily.

Which makes sense, since the ARB extensions are more portable, more
flexible, and better designed.

–ryan.

Ryan, the extensions I was trying would allow asynchronous DMA directly from
SDL surface memory to the GPU. Pixel buffer objects would involve one extra
copy above that, but may solve some of the other issues. Do you feel like
trying it out? I don’t really know how to use that extension yet.

Here’s a first shot at it. It’s not ready for prime time, and I’ve only
tested it with testsprite2.c, which isn’t exactly the heaviest pusher of
texture upload performance. :slight_smile:

I also added in support for GL_texture_non_power_of_two (favor it
instead of GL_TEXTURE_RECTANGLE if supported) and
GL_APPLE_flush_buffer_range (only DMA the changed parts of a pixel
buffer instead of the whole thing on unmap).

This probably needs at least minor cleanups and testing, and it might be
worth breaking each path out into separate function pointers to reduce
the if/elseif clutter.

Performance notes coming shortly.

–ryan.

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed…
Name: SDL-glpbo-etc-RYAN1.diff
URL: http://lists.libsdl.org/pipermail/sdl-libsdl.org/attachments/20070814/498cde04/attachment.asc

Here’s a first shot at it. It’s not ready for prime time, and I’ve only
tested it with testsprite2.c, which isn’t exactly the heaviest pusher of
texture upload performance. :slight_smile:

I’ve been testing it with testsprite, which is a perfect test case for
streaming textures. The SDL_compat code sets up a single streaming
texture and tries to set it up with no locking required. What performance
are you seeing in that case?

As an aside, I’m becoming more and more aware that locking is required if
we’re going to do any kind of asynchronous blit acceleration. :slight_smile:

See ya!
-Sam Lantinga, Lead Software Engineer, Blizzard Entertainment

Performance notes coming shortly.

These are for any one that wants to tackle these, and some are probably
only small or theoretical wins. There are probably others, but here’s a
start.

  • Every call to SDL_SelectRenderer() appears to result in a MakeCurrent
    call, even if we haven’t changed GL contexts, and even if there’s only
    one GL context total. The high-level SDL call should cache the last
    selected renderer and return immediately if the new selection matches
    the old one. There are platforms where making a GL context current is
    extremely expensive, probably even if it’s already current.
    MoveSprites() in testsprite2.c is doing this when there’s only one window.

  • Never never call a glGet*() entry point if you can avoid it, including
    glGetError(). In the new multithreaded GL on Mac OS X 10.5 (and some
    existing 10.4-based hardware), this will force all threads to
    synchronize, basically nullifying the multithreading boost, even if you
    just call glGetError() once a frame. glGetError() is mostly good for
    detecting programming errors and not runtime states, so you can live
    without it in production code (#ifdef _DEBUG those sections), state you
    set yourself should be shadowed in local variables instead of retrieved
    from glGetInteger(), etc, or queried once at startup for things like the
    max texture size.

  • Turn on the multithreaded GL. :slight_smile: Here’s the code:

#if MACOSX // or whatever
CGLContextObj ctx = CGLGetCurrentContext();
const CGLError err = CGLEnable(ctx, kCGLCEMPEngine);
if (err == kCGLNoError) {
printf(“Enabled threaded GL engine”);
} else {
printf(“Couldn’t enable threaded GL engine: err==%d”), (int) err);
}
#endif

…then it does the right thing (nothing on single CPU systems, etc).
Assuming your code is well-designed (avoid glGet*, etc), it does the
rest behind the scenes with your existing code for fairly significant
speed boosts (almost 2x in certain cases…even complicated things like
World of Warcraft seem to get a very good boost from this).

  • Don’t use glBegin()/glEnd() to put polygons to the screen. Apple
    engineers implied that just having a vertex_buffer_object with one
    rectangle in it will be faster, since there’s all sorts of state that
    has to be built, coordinated, pushed to the hardware, and immediately
    discarded every time in a glBegin/glEnd pair. Better to bind a
    vertex_buffer and draw from that, or at least a client-side vertex
    array. At least wrap that thing in a display list! Any of these options
    adds just a little bit more state to each SDL texture. I think glBegin
    probably nullifies the multithreaded GL, too. Set up the vertices as if
    you are rendering to (0,0-w,h) when creating the texture. Set up the
    texcoords to the full size of the image by default, and cache both of
    these details in a vertex buffer or display list or whatnot. Now make
    use of glTranslate*() and glMatrixMode() to adjust from those base
    positions without rebuilding the vertices every time. I suspect this is
    the biggest bottleneck in testsprite2 at the moment.

  • Keep track of state you’ve set, only set it when it changes (if
    texture X is already bound to target Y, calling glBindTexture(Y, X) is a
    waste of time). Not all GL implementations are smart enough to make
    these into no-ops. This is probably a small win overall, though.

testsprite2 is only using 50% of the CPU on this G4 iBook (10.4), which
means we’re leaving a LOT of performance on the table, probably waiting
for the GPU to sync up with all those glBegin/glEnd calls.

–ryan.

  • Every call to SDL_SelectRenderer() appears to result in a MakeCurrent
    call, even if we haven’t changed GL contexts, and even if there’s only
    one GL context total. The high-level SDL call should cache the last
    selected renderer and return immediately if the new selection matches
    the old one. There are platforms where making a GL context current is
    extremely expensive, probably even if it’s already current.
    MoveSprites() in testsprite2.c is doing this when there’s only one window.

Not Mac-related, but anyway…

A few years ago, I remember stumbling across ATI drivers on Windows
which leaked memory on every call to wglMakeCurrent. More info here:

I believe the bug has been fixed for quite some time now, but as it’s
likely that there are still a few PCs out there with those drivers,
wglMakeCurrent is probably something SDL will want to avoid calling
more often than necessary.

When wxWindows was calling this function per-frame my app was leaking
about a megabyte a minute! :smiley:

Pete

Performance notes coming shortly.

Wow, good stuff there. I can handle the high level stuff (avoiding glGet
calls, eliminating duplicate set calls, etc.) The lower level things
(set up matrix transforms, eliminating glBegin/glEnd, etc.) are things
that would take me a fair amount of time to do since I’m not really an
expert at OpenGL. Is that something you’d be interested in doing when
you have time? Or is there someone else with Mac OS X and OpenGL
experience who’d like to take a look at these? :slight_smile:

Also, did you look at testsprite’s performance with your patch? I’m
curious how an extra copy from system memory to a pbo and then an async
transfer in the GL driver stacks up against direct DMA from system memory.

BTW, thanks so much for looking at this. I’d been wrestling for a few
days trying to get streaming textures on the fast OpenGL path for good
SDL 1.2 compatibility performance.

Thanks!
-Sam Lantinga, Lead Software Engineer, Blizzard Entertainment

Performance notes coming shortly.

  • Every call to SDL_SelectRenderer() appears to result in a MakeCurrent
    call, even if we haven’t changed GL contexts, and even if there’s only
    one GL context total. The high-level SDL call should cache the last
    selected renderer and return immediately if the new selection matches
    the old one. There are platforms where making a GL context current is
    extremely expensive, probably even if it’s already current.
    MoveSprites() in testsprite2.c is doing this when there’s only one window.

Okay, this is implemented.

  • Never never call a glGet*() entry point if you can avoid it, including
    glGetError(). In the new multithreaded GL on Mac OS X 10.5 (and some
    existing 10.4-based hardware), this will force all threads to
    synchronize, basically nullifying the multithreading boost, even if you
    just call glGetError() once a frame. glGetError() is mostly good for
    detecting programming errors and not runtime states, so you can live
    without it in production code (#ifdef _DEBUG those sections), state you
    set yourself should be shadowed in local variables instead of retrieved
    from glGetInteger(), etc, or queried once at startup for things like the
    max texture size.

I’ll switch this over once the VBO/PBO stuff is done.

  • Turn on the multithreaded GL. :slight_smile: Here’s the code:

All set, pending VBO/PBO changes to take advantage of it.

  • Don’t use glBegin()/glEnd() to put polygons to the screen. Apple
    engineers implied that just having a vertex_buffer_object with one
    rectangle in it will be faster, since there’s all sorts of state that
    has to be built, coordinated, pushed to the hardware, and immediately
    discarded every time in a glBegin/glEnd pair.

I did a test on Windows OpenGL last year, and it was a wash. I opted to
leave it as glBegin and glEnd for simplicity, but since it’s much better
on Mac OS X, yes let’s use a VBO.

  • Keep track of state you’ve set, only set it when it changes (if
    texture X is already bound to target Y, calling glBindTexture(Y, X) is a
    waste of time). Not all GL implementations are smart enough to make
    these into no-ops. This is probably a small win overall, though.

Once the VBO/PBO stuff is done I’ll go through and clean this up. I’ll
wait until then since I’m not sure how much code will change between now
and then.

Thanks!
-Sam Lantinga, Lead Software Engineer, Blizzard Entertainment

I did a test on Windows OpenGL last year, and it was a wash. I opted to
leave it as glBegin and glEnd for simplicity, but since it’s much better
on Mac OS X, yes let’s use a VBO.

I just moved this to a vbo, and it doesn’t appear to help testsprite2
after all, at least on this machine. Let’s hold off on that for now,
since it adds complexity.

I’m not sure where this thing is stalling now in that test. I guess
it’ll need further research now.

–ryan.