I keep hearing from people that complain that SDL is slow on MacOSX.
Having shipped commercial projects using it, I couldn’t ever understand
why they’d say that. I think I figured out why: the GL codepath is fast,
but the 2D codepaths are not.
I was surprised to find that a 2D MacOSX project I am working on was
sitting in BlitNtoN() for 55% of my CPU time, so I set out to optimize a
The project in question blits a 32-bit surface to the screen surface
once per frame, usually the whole 640x480 area (but less, in some
cases). Preconverting or producing source surfaces in screen format
isn’t practical, so the conversion gets done in SDL_BlitSurface(). The
application wants to write exclusively to a BGRA8888 surface. MacOS is
handing me a ARGB8888 surface, so a blit requires some basic swizzling
but no serious conversion. Having no optimized blitters for anything but
MMX-based CPUs, we fall into BlitNtoN, which is inefficient for several
reasons, even for scalar code.
Attached is a patch to add the start of Altivec-based blitters. Besides
the needed structure, I’ve filled in just the one function, which
swizzles from one 8888 format to another, and even there, just the
format I need for my project vs what OSX gives me. Adding new swizzlers
can use the same function, at the cost of 64 bytes of data per swizzler.
It’s a total win.
Other blitters (16-bit handlers, etc) would need more work, but are
The end result was hard to gauge, since Shark seems to kill the
performance boost you’d get from cache prefetching, but before adding
that, the CPU time spent in the blitter dropped from 55% to about 13%.
My framerate went from around a consistent 25-27 to 150-300. Once I
added prefetching, it went up to over 4500. Not a typo. Like I said,
- Needs other swizzle data filled in.
- Needs non 32-bit blitters written.
- Move this to a seperate file; SDL_blit_N.c is getting cluttered.
- vec_dst gives a HUGE improvement on a G4, but apparently stalls the
pipeline on a G5. Someone should fix that by figuring out how to toggle
use_software_prefetch to 0 on a G5 system (and how to do that on
Configure.in should let you enable/disable the altivec code, and
should let non-Macs (AmigaOS, PowerPC Linux, etc) use it. Right now it’s
a hardcoded #define to turn it on.
Configure.in must add -faltivec to gcc’s CFLAGS or it won’t
compile…I hacked the generated Makefile because I’m lazy.
- Someone should have MacOSX builds compile with -O3 instead of -O2
(this comes at Apple’s general recommendation that O3 is a significant
boost over O2, unlike, say, x86 Linux). -falign-loops=32 can be a big
help in some cases (especially in the blitters on a G5, if I had to assume).
If someone wants to give this patch some love, I’d like to get it into
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed…