[PATCH] Altivec blitters

I keep hearing from people that complain that SDL is slow on MacOSX.
Having shipped commercial projects using it, I couldn’t ever understand
why they’d say that. I think I figured out why: the GL codepath is fast,
but the 2D codepaths are not.

I was surprised to find that a 2D MacOSX project I am working on was
sitting in BlitNtoN() for 55% of my CPU time, so I set out to optimize a
little.

The project in question blits a 32-bit surface to the screen surface
once per frame, usually the whole 640x480 area (but less, in some
cases). Preconverting or producing source surfaces in screen format
isn’t practical, so the conversion gets done in SDL_BlitSurface(). The
application wants to write exclusively to a BGRA8888 surface. MacOS is
handing me a ARGB8888 surface, so a blit requires some basic swizzling
but no serious conversion. Having no optimized blitters for anything but
MMX-based CPUs, we fall into BlitNtoN, which is inefficient for several
reasons, even for scalar code.

Attached is a patch to add the start of Altivec-based blitters. Besides
the needed structure, I’ve filled in just the one function, which
swizzles from one 8888 format to another, and even there, just the
format I need for my project vs what OSX gives me. Adding new swizzlers
can use the same function, at the cost of 64 bytes of data per swizzler.
It’s a total win.

Other blitters (16-bit handlers, etc) would need more work, but are
possible.

The end result was hard to gauge, since Shark seems to kill the
performance boost you’d get from cache prefetching, but before adding
that, the CPU time spent in the blitter dropped from 55% to about 13%.
My framerate went from around a consistent 25-27 to 150-300. Once I
added prefetching, it went up to over 4500. Not a typo. Like I said,
total win.

Improvements needed:

  • Needs other swizzle data filled in.
  • Needs non 32-bit blitters written.
  • Move this to a seperate file; SDL_blit_N.c is getting cluttered.
  • vec_dst gives a HUGE improvement on a G4, but apparently stalls the
    pipeline on a G5. Someone should fix that by figuring out how to toggle
    use_software_prefetch to 0 on a G5 system (and how to do that on
    non-MacOS platforms).
  • Configure.in should let you enable/disable the altivec code, and
    should let non-Macs (AmigaOS, PowerPC Linux, etc) use it. Right now it’s
    a hardcoded #define to turn it on.
  • Configure.in must add -faltivec to gcc’s CFLAGS or it won’t
    compile…I hacked the generated Makefile because I’m lazy.
  • Someone should have MacOSX builds compile with -O3 instead of -O2
    (this comes at Apple’s general recommendation that O3 is a significant
    boost over O2, unlike, say, x86 Linux). -falign-loops=32 can be a big
    help in some cases (especially in the blitters on a G5, if I had to assume).

If someone wants to give this patch some love, I’d like to get it into
CVS eventually.

–ryan.

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed…
Name: SDL-altivec-swizzle-RYAN-1.diff
URL: http://lists.libsdl.org/pipermail/sdl-libsdl.org/attachments/20050106/21b4136e/attachment.txt

Ryan C. Gordon wrote:

I keep hearing from people that complain that SDL is slow on MacOSX.
Having shipped commercial projects using it, I couldn’t ever
understand why they’d say that. I think I figured out why: the GL
codepath is fast, but the 2D codepaths are not.

I was surprised to find that a 2D MacOSX project I am working on was
sitting in BlitNtoN() for 55% of my CPU time, so I set out to optimize
a little.

The project in question blits a 32-bit surface to the screen surface
once per frame, usually the whole 640x480 area (but less, in some
cases). Preconverting or producing source surfaces in screen format
isn’t practical, so the conversion gets done in SDL_BlitSurface(). The
application wants to write exclusively to a BGRA8888 surface. MacOS is
handing me a ARGB8888 surface, so a blit requires some basic swizzling
but no serious conversion. Having no optimized blitters for anything
but MMX-based CPUs, we fall into BlitNtoN, which is inefficient for
several reasons, even for scalar code.

Attached is a patch to add the start of Altivec-based blitters.
Besides the needed structure, I’ve filled in just the one function,
which swizzles from one 8888 format to another, and even there, just
the format I need for my project vs what OSX gives me. Adding new
swizzlers can use the same function, at the cost of 64 bytes of data
per swizzler. It’s a total win.

Other blitters (16-bit handlers, etc) would need more work, but are
possible.

The end result was hard to gauge, since Shark seems to kill the
performance boost you’d get from cache prefetching, but before adding
that, the CPU time spent in the blitter dropped from 55% to about 13%.
My framerate went from around a consistent 25-27 to 150-300. Once I
added prefetching, it went up to over 4500. Not a typo. Like I said,
total win.

Only half related to this patch, but thinking about some older SDL
discussion wrt portable vector assembly the idea occured to me that it
was possible to use prefetch in a portable fashion. So I wrapped a
prefetch macro that does this on ia64 and ia32 (the ia32 one also works
in x86_64 mode of course) with actually quite a bit of success.

So, what about having SDL blitters with portable prefetch support as a
second choice of optimization ? If you bring the powerpc prefetch in the
mix, that’s 4 architectures supported. This would allow fairly good
optimizations for a wider range of systems than just ia32 cpus with a
lot less work.

Also, if it’s 2D and you have OpenGL (who doesn’t on OSX ?), try glSDL :slight_smile:

Stephane

I keep hearing from people that complain that SDL is slow on MacOSX.
Having shipped commercial projects using it, I couldn’t ever
understand why they’d say that. I think I figured out why: the GL
codepath is fast, but the 2D codepaths are not.

I was surprised to find that a 2D MacOSX project I am working on was
sitting in BlitNtoN() for 55% of my CPU time, so I set out to optimize
a little.

The project in question blits a 32-bit surface to the screen surface
once per frame, usually the whole 640x480 area (but less, in some
cases). Preconverting or producing source surfaces in screen format
isn’t practical, so the conversion gets done in SDL_BlitSurface(). The
application wants to write exclusively to a BGRA8888 surface. MacOS is
handing me a ARGB8888 surface, so a blit requires some basic swizzling
but no serious conversion. Having no optimized blitters for anything
but MMX-based CPUs, we fall into BlitNtoN, which is inefficient for
several reasons, even for scalar code.

Attached is a patch to add the start of Altivec-based blitters.
Besides the needed structure, I’ve filled in just the one function,
which swizzles from one 8888 format to another, and even there, just
the format I need for my project vs what OSX gives me. Adding new
swizzlers can use the same function, at the cost of 64 bytes of data
per swizzler. It’s a total win.

Other blitters (16-bit handlers, etc) would need more work, but are
possible.

The end result was hard to gauge, since Shark seems to kill the
performance boost you’d get from cache prefetching, but before adding
that, the CPU time spent in the blitter dropped from 55% to about 13%.
My framerate went from around a consistent 25-27 to 150-300. Once I
added prefetching, it went up to over 4500. Not a typo. Like I said,
total win.

Improvements needed:

  • Needs other swizzle data filled in.
  • Needs non 32-bit blitters written.
  • Move this to a seperate file; SDL_blit_N.c is getting cluttered.
  • vec_dst gives a HUGE improvement on a G4, but apparently stalls the
    pipeline on a G5. Someone should fix that by figuring out how to
    toggle use_software_prefetch to 0 on a G5 system (and how to do that
    on non-MacOS platforms).
  • Configure.in should let you enable/disable the altivec code, and
    should let non-Macs (AmigaOS, PowerPC Linux, etc) use it. Right now
    it’s a hardcoded #define to turn it on.
  • Configure.in must add -faltivec to gcc’s CFLAGS or it won’t
    compile…I hacked the generated Makefile because I’m lazy.
  • Someone should have MacOSX builds compile with -O3 instead of -O2
    (this comes at Apple’s general recommendation that O3 is a significant
    boost over O2, unlike, say, x86 Linux). -falign-loops=32 can be a big
    help in some cases (especially in the blitters on a G5, if I had to
    assume).

This is cool! I think there should be three code paths though:

  • G3 (just use normal C code)
  • G4 (use vec_dst)
  • G5 (don’t use vec_dst)

At startup, you should be able to determine the current architecture
and pick the right function pointers.

-bobOn Jan 6, 2005, at 10:15, Ryan C. Gordon wrote:

Very cool! I’d love to include these in SDL, once you’ve fixed the FIXMEs. :slight_smile:

See ya,
-Sam Lantinga, Software Engineer, Blizzard Entertainment

At startup, you should be able to determine the current architecture
and pick the right function pointers.

Determining the existance of a vector unit is easy, but I’ll have to
look up how to figure out what to do with the G5. The code should run
and choose the C/Altivec path correctly for G3 vs G4/G5 machines right
now, though.

I actually have a significantly better version of this patch coming
soon. I’ll try to address these issues.

–ryan.

At startup, you should be able to determine the current architecture
and pick the right function pointers.

Determining the existance of a vector unit is easy, but I’ll have to
look up how to figure out what to do with the G5. The code should run
and choose the C/Altivec path correctly for G3 vs G4/G5 machines right
now, though.

I actually have a significantly better version of this patch coming
soon. I’ll try to address these issues.

From http://developer.apple.com/hardware/ve/g5.html, you should take
a look at /usr/include/sys/sysctl.h – hw.cputype, or possibly
something more specific to the performance issue such as the cache line
size, might be what you need.

I’m looking forward to your patch. I maintain several Mac OS X ports
of pygame software that would be greatly enhanced by this (they don’t
use OpenGL), and this will give me the excuse to update them :slight_smile:

-bobOn Jan 16, 2005, at 14:10, Ryan C. Gordon wrote:

Here’s another revision of my Altivec blitter patch. This one cleans up
a lot of the FIXMEs and generalizes the 32bit-to-32bit swizzler to be
able to convert between any 8888 format generically. It obsoletes the
previous patch, and can be applied directly to CVS.

I’ve also added a new program to the test directory named
testblitspeed.c; it’s got about a million options, but to get an idea of
how the Altivec code path did against the standard C version, I ran this:

./testblitspeed --dstbpp 32 --dstwidth 640 --dstheight 480 --srcbpp 32
–srcwidth 640 --srcheight 480 --seconds 10 --dstrmask 0x00FF0000
–dstgmask 0x0000FF00 --dstbmask 0x000000FF --dstamask 0x00000000
–srcrmask 0x000000FF --srcgmask 0x00FF0000 --srcbmask 0x0000FF00
–srcamask 0x00000000

The Altivec code was more than 3 times faster than the C codepath in the
above test.

testblitspeed is in CVS, the Altivec patch is attached. I’d like to hear
from PowerPC users that aren’t MacOS-based to make sure this compiles
cleanly and functions elsewhere.

Ideally, we’d get 32->16 (or more importantly, 16->32) Altivec blitters
in here to complete the speed boost for the rest of the feasible
scenarios, but I have no plans to do these.

–ryan.

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed…
Name: SDL-altivec-swizzle-RYAN-2.diff
URL: http://lists.libsdl.org/pipermail/sdl-libsdl.org/attachments/20050215/a43e5ecf/attachment.txt

Hi Ryan,

From your patch :

  • /* !!! FIXME: find some math to do this without conditionals! */
  • Uint32 srcashift = ( (srcamask & 0x000000FF) ? 0 :
  •                     (srcamask & 0x0000FF00) ? 8  :
    
  •                     (srcamask & 0x00FF0000) ? 16 :
    
  •                     (srcamask & 0xFF000000) ? 24 : 0 );
    
  • Uint32 dstashift = ( (dstamask & 0x000000FF) ? 3 :
  •                     (dstamask & 0x0000FF00) ? 2  :
    
  •                     (dstamask & 0x00FF0000) ? 1  : 0 );
    

Here is a quick proposal. Conditionnals are easier to read. I don’t know if they
are really less efficient.

Uint32 srcashift = (((((srcamask >> 8) & 0xFF) + 0xFF) >> 8) * 8) |
(((((srcamask >> 16) & 0xFF) + 0xFF) >> 8) * 16) |
(((((srcamask >> 24) ) + 0xFF) >> 8) * 24);

Uint32 dstashift = (((((dstamask ) & 0xFF) + 0xFF) >> 8) * 3) |
(((((dstamask >> 8) & 0xFF) + 0xFF) >> 8) * 2) |
(((((dstamask >> 16) & 0xFF) + 0xFF) >> 8) );

Regards,

Xavier

PS: Please note I actualy didn’t even try to compile that code…

Hi,

Hi Ryan,

From your patch :

  • /* !!! FIXME: find some math to do this without conditionals! */
  • Uint32 srcashift = ( (srcamask & 0x000000FF) ? 0 :
  •                     (srcamask & 0x0000FF00) ? 8  :
    
  •                     (srcamask & 0x00FF0000) ? 16 :
    
  •                     (srcamask & 0xFF000000) ? 24 : 0 );
    
  • Uint32 dstashift = ( (dstamask & 0x000000FF) ? 3 :
  •                     (dstamask & 0x0000FF00) ? 2  :
    
  •                     (dstamask & 0x00FF0000) ? 1  : 0 );
    

Here is a quick proposal. Conditionnals are easier to read. I don’t know if they
are really less efficient.

Uint32 srcashift = (((((srcamask >> 8) & 0xFF) + 0xFF) >> 8) * 8) |
(((((srcamask >> 16) & 0xFF) + 0xFF) >> 8) * 16) |
(((((srcamask >> 24) ) + 0xFF) >> 8) * 24);

Uint32 dstashift = (((((dstamask ) & 0xFF) + 0xFF) >> 8) * 3) |
(((((dstamask >> 8) & 0xFF) + 0xFF) >> 8) * 2) |
(((((dstamask >> 16) & 0xFF) + 0xFF) >> 8) );

I have to agree with Xavier, the conditionals are much easier to read;
if you go with the shifting tricks, just make sure that there’s a note
explaining what it does in the code.

I tried Xavier’s example, it does compile :slight_smile:

-JonOn Tue, 15 Feb 2005, Xavier Joubert wrote:

Regards,

Xavier

PS: Please note I actualy didn’t even try to compile that code…


SDL mailing list
SDL at libsdl.org
http://www.libsdl.org/mailman/listinfo/sdl

Here’s another revision of my Altivec blitter patch. This one cleans
up a lot of the FIXMEs and generalizes the 32bit-to-32bit swizzler to
be able to convert between any 8888 format generically. It obsoletes
the previous patch, and can be applied directly to CVS.

I’ve also added a new program to the test directory named
testblitspeed.c; it’s got about a million options, but to get an idea
of how the Altivec code path did against the standard C version, I ran
this:

./testblitspeed --dstbpp 32 --dstwidth 640 --dstheight 480 --srcbpp 32
–srcwidth 640 --srcheight 480 --seconds 10 --dstrmask 0x00FF0000
–dstgmask 0x0000FF00 --dstbmask 0x000000FF --dstamask 0x00000000
–srcrmask 0x000000FF --srcgmask 0x00FF0000 --srcbmask 0x0000FF00
–srcamask 0x00000000

The Altivec code was more than 3 times faster than the C codepath in
the above test.

testblitspeed is in CVS, the Altivec patch is attached. I’d like to
hear from PowerPC users that aren’t MacOS-based to make sure this
compiles cleanly and functions elsewhere.

Ideally, we’d get 32->16 (or more importantly, 16->32) Altivec
blitters in here to complete the speed boost for the rest of the
feasible scenarios, but I have no plans to do these.

Here’s my revised version of the patch
http://redivi.com/~bob/SDL-altivec-swizzle-bob-1.diff.

It’s largely just a cleanup, but provides a marginal amount of extra
functionality. Only tested on Mac OS X 10.3:

  • I made sure the configure.in checks to see that the syntax extension
    you’re using compiles.
  • It no longer tries to execute the code (in the case you’re compiling
    altivec support on a G3 for some reason)
  • It checks for altivec on darwin (I assume you were testing from
    Xcode?)
  • Checks a sysctl to see if it should use prefetch or not (if L3 cache
    present, or not OS X, it uses prefetch – optimal for G4)
  • Instead of changing the size of the blit function table I changed one
    of the fields to be a bitflag rather than a bool… right now MMX is 1,
    Altivec is 2, and don’t-use-prefetch is 4.
  • prefetch and no-prefetch 32-32 blits are separate functions (could be
    the same function with userdata I guess).

Using the same test as above, I was able to reproduce the 3x speed bump
on a dual 2ghz G5 (with second CPU disabled cause it’s broken, argh).

I’m going to profile some real-world SDL games (specifically the ones
that I sort-of-maintain OS X ports for) to see which of the other blit
functions I should vectorize, if any.

-bobOn Feb 15, 2005, at 7:03, Ryan C. Gordon wrote:

  • I made sure the configure.in checks to see that the syntax extension
    you’re using compiles.

Apparently Apple’s GCC uses -faltivec, but the FSF GCC uses -maltivec
and specify vector constants differently; can someone on linux/ppc
please test this patch?

  • It no longer tries to execute the code (in the case you’re compiling
    altivec support on a G3 for some reason)

You need to put it in a seperate, non-inline function if you use vector
intrinsics: GCC inserts an Altivec opcode at the top of the function if
it sees a vector thing…so it’ll still crash on a G3 as-is; the “if
(0)” isn’t enough to prevent it.

  • It checks for altivec on darwin (I assume you were testing from
    Xcode?)

I tested from the command line, but not on a real Darwin system, just
Panther.

  • Checks a sysctl to see if it should use prefetch or not (if L3 cache
    present, or not OS X, it uses prefetch – optimal for G4)

I got a huge boost on my powerbook (L2, but not L3 cache) with the
prefetch. On a G5, the prefetch instructions cause pipeline stalls
(which seems a really silly design decision from where I’m sitting, but
whatever), so those should always avoid the prefetch. The G5, however,
starts automatically prefetching when you touch a few cachelines
linearly, which we do in this function, so it should get the same result
as long as you don’t try to force it with vec_dst().

I’m not sure how to check for this reliably; there’s a way to ask MacOS
"am I on a G5?" but I’m not sure what that does when you are one day on
a G6…there might be a sysctl or Gestalt to query if there’s an
automatic prefetch, though.

The existance of G5-style prefetching is the only time we should avoid
vec_dst*, though.

  • prefetch and no-prefetch 32-32 blits are separate functions (could be
    the same function with userdata I guess).

There were three conditionals regardless of dataset; I’m not really sure
it’s worth splitting it into a seperate function.

Using the same test as above, I was able to reproduce the 3x speed bump
on a dual 2ghz G5 (with second CPU disabled cause it’s broken, argh).

A broken G5? That sucks!

I’m going to profile some real-world SDL games (specifically the ones
that I sort-of-maintain OS X ports for) to see which of the other blit
functions I should vectorize, if any.

There’s probably a bunch of games that want to write to a 16-bit surface
regardless of the screen format…there are also a LOT of people that
think running their system in 16-bit color will give them a better
framerate.

I can’t think of a useful way to vectorize 8-bit blits, but there’s
probably some clever way to do this.

–ryan.

  • I made sure the configure.in checks to see that the syntax
    extension you’re using compiles.

Apparently Apple’s GCC uses -faltivec, but the FSF GCC uses -maltivec
and specify vector constants differently; can someone on linux/ppc
please test this patch?

  • It no longer tries to execute the code (in the case you’re
    compiling altivec support on a G3 for some reason)

You need to put it in a seperate, non-inline function if you use
vector intrinsics: GCC inserts an Altivec opcode at the top of the
function if it sees a vector thing…so it’ll still crash on a G3
as-is; the “if (0)” isn’t enough to prevent it.

Ah, right. D’oh! I’ve never actually even used a G3, so I have no way
of testing this.

  • It checks for altivec on darwin (I assume you were testing from
    Xcode?)

I tested from the command line, but not on a real Darwin system, just
Panther.

Then how did the configure.in possibly turn on Altivec? Did you patch
the makefile or configure script directly?

  • Checks a sysctl to see if it should use prefetch or not (if L3
    cache present, or not OS X, it uses prefetch – optimal for G4)

I got a huge boost on my powerbook (L2, but not L3 cache) with the
prefetch. On a G5, the prefetch instructions cause pipeline stalls
(which seems a really silly design decision from where I’m sitting,
but whatever), so those should always avoid the prefetch. The G5,
however, starts automatically prefetching when you touch a few
cachelines linearly, which we do in this function, so it should get
the same result as long as you don’t try to force it with vec_dst().

I’m not sure how to check for this reliably; there’s a way to ask
MacOS “am I on a G5?” but I’m not sure what that does when you are one
day on a G6…there might be a sysctl or Gestalt to query if there’s
an automatic prefetch, though.

The existance of G5-style prefetching is the only time we should avoid
vec_dst*, though.

My 1ghz titanium powerbook G4 has 1mb of l3 cache:
% sysctl hw.l3cachesize
hw.l3cachesize: 1048576

I think it’s safe enough to use the version w/o prefetch on anything
that looks like a G5… when/if another processor comes out, then we can
worry about fixing the test (if it also does not have L3).

  • prefetch and no-prefetch 32-32 blits are separate functions (could
    be the same function with userdata I guess).

There were three conditionals regardless of dataset; I’m not really
sure it’s worth splitting it into a seperate function.

It was easiest this way because I didn’t feel like reading more of the
SDL headers to see what the userdata member of that struct was :slight_smile:

-bobOn Feb 19, 2005, at 7:39 AM, Ryan C. Gordon wrote:

  • Checks a sysctl to see if it should use prefetch or not (if L3
    cache present, or not OS X, it uses prefetch – optimal for G4)

I got a huge boost on my powerbook (L2, but not L3 cache) with the
prefetch. On a G5, the prefetch instructions cause pipeline stalls
(which seems a really silly design decision from where I’m sitting,
but whatever), so those should always avoid the prefetch. The G5,
however, starts automatically prefetching when you touch a few
cachelines linearly, which we do in this function, so it should get
the same result as long as you don’t try to force it with vec_dst().

What kind of huge boost you were seeing? On my 1ghz titanium powerbook
G4, I am seeing in the range of 5-10%. Beyond the margin of error, but
I’m not sure I’d call it huge in comparison to the 230-266% difference
between with and without altivec :slight_smile:

(test.sh runs testblitspeed with the same options you posted before,
but at an interval of 4 seconds)

this acts like a G5, no prefetch

% env SDL_ALTIVEC_BLIT_FEATURES=6 ./test.sh
672 blits took 3857 ms (174 fps).
688 blits took 3860 ms (178 fps).

this is the same as the default for the G4, with prefetch

% env SDL_ALTIVEC_BLIT_FEATURES=2 ./test.sh
738 blits took 3862 ms (191 fps).
730 blits took 3862 ms (189 fps).

this has no acceleration

% env SDL_ALTIVEC_BLIT_FEATURES=0 ./test.sh
280 blits took 3862 ms (72 fps).
289 blits took 3841 ms (75 fps).

hw.cputype: 18
hw.cpusubtype: 11
hw.cachelinesize: 32
hw.l1icachesize: 32768
hw.l1dcachesize: 32768
hw.l2cachesize: 262144
hw.l3cachesize: 1048576

Looking up the hw.cpu* values in <mach/machine.h>, this is a:
CPU_TYPE_POWERPC : CPU_SUBTYPE_POWERPC_7450

-bobOn Feb 19, 2005, at 7:39 AM, Ryan C. Gordon wrote:

Then how did the configure.in possibly turn on Altivec? Did you patch
the makefile or configure script directly?

Yeah, I did. I guess I should have tested that. :slight_smile:

–ryan.

Here’s another revision of my Altivec blitter patch. This one cleans
up a lot of the FIXMEs and generalizes the 32bit-to-32bit swizzler to
be able to convert between any 8888 format generically. It obsoletes
the previous patch, and can be applied directly to CVS.

I’ve also added a new program to the test directory named
testblitspeed.c; it’s got about a million options, but to get an idea
of how the Altivec code path did against the standard C version, I
ran this:

./testblitspeed --dstbpp 32 --dstwidth 640 --dstheight 480 --srcbpp 32
–srcwidth 640 --srcheight 480 --seconds 10 --dstrmask 0x00FF0000
–dstgmask 0x0000FF00 --dstbmask 0x000000FF --dstamask 0x00000000
–srcrmask 0x000000FF --srcgmask 0x00FF0000 --srcbmask 0x0000FF00
–srcamask 0x00000000

The Altivec code was more than 3 times faster than the C codepath in
the above test.

testblitspeed is in CVS, the Altivec patch is attached. I’d like to
hear from PowerPC users that aren’t MacOS-based to make sure this
compiles cleanly and functions elsewhere.

Ideally, we’d get 32->16 (or more importantly, 16->32) Altivec
blitters in here to complete the speed boost for the rest of the
feasible scenarios, but I have no plans to do these.

Here’s my revised version of the patch
http://redivi.com/~bob/SDL-altivec-swizzle-bob-1.diff.

It’s largely just a cleanup, but provides a marginal amount of extra
functionality. Only tested on Mac OS X 10.3:

  • I made sure the configure.in checks to see that the syntax extension
    you’re using compiles.
  • It no longer tries to execute the code (in the case you’re compiling
    altivec support on a G3 for some reason)
  • It checks for altivec on darwin (I assume you were testing from
    Xcode?)
  • Checks a sysctl to see if it should use prefetch or not (if L3 cache
    present, or not OS X, it uses prefetch – optimal for G4)
  • Instead of changing the size of the blit function table I changed
    one of the fields to be a bitflag rather than a bool… right now MMX
    is 1, Altivec is 2, and don’t-use-prefetch is 4.
  • prefetch and no-prefetch 32-32 blits are separate functions (could
    be the same function with userdata I guess).

Using the same test as above, I was able to reproduce the 3x speed
bump on a dual 2ghz G5 (with second CPU disabled cause it’s broken,
argh).

I’m going to profile some real-world SDL games (specifically the ones
that I sort-of-maintain OS X ports for) to see which of the other blit
functions I should vectorize, if any.

I’ve revised the patch again
http://redivi.com/~bob/SDL-altivec-swizzle-bob-2.diff.

After trying it on Blob Wars, I saw that the original calc_swizzle32
wasn’t implemented correctly, so I rewrote it. I also noticed that a
lot of the memory shuffling that it does might not be necessary on OS
X, but since it’s so much faster as-is I’m not going to prematurely
optimize that.

It still has some issues, for example, the configure test that checks
to see whether altivec code will compile is probably going to fail on a
G3 because main() will have vector stuff in it. I can’t really test
that, so I won’t fix it. It’s also not tested anywhere but OS X, so it
probably won’t work elsewhere unless someone with another PPC platform
is willing to step up and test.

Now, off to do some more profiling and perhaps implement the 16bit
blitters :slight_smile:

-bobOn Feb 18, 2005, at 11:14 PM, Bob Ippolito wrote:

On Feb 15, 2005, at 7:03, Ryan C. Gordon wrote:

Here’s another revision of my Altivec blitter patch. This one cleans
up a lot of the FIXMEs and generalizes the 32bit-to-32bit swizzler
to be able to convert between any 8888 format generically. It
obsoletes the previous patch, and can be applied directly to CVS.

I’ve also added a new program to the test directory named
testblitspeed.c; it’s got about a million options, but to get an
idea of how the Altivec code path did against the standard C
version, I ran this:

./testblitspeed --dstbpp 32 --dstwidth 640 --dstheight 480 --srcbpp
32
–srcwidth 640 --srcheight 480 --seconds 10 --dstrmask 0x00FF0000
–dstgmask 0x0000FF00 --dstbmask 0x000000FF --dstamask 0x00000000
–srcrmask 0x000000FF --srcgmask 0x00FF0000 --srcbmask 0x0000FF00
–srcamask 0x00000000

The Altivec code was more than 3 times faster than the C codepath in
the above test.

testblitspeed is in CVS, the Altivec patch is attached. I’d like to
hear from PowerPC users that aren’t MacOS-based to make sure this
compiles cleanly and functions elsewhere.

Ideally, we’d get 32->16 (or more importantly, 16->32) Altivec
blitters in here to complete the speed boost for the rest of the
feasible scenarios, but I have no plans to do these.

Here’s my revised version of the patch
http://redivi.com/~bob/SDL-altivec-swizzle-bob-1.diff.

It’s largely just a cleanup, but provides a marginal amount of extra
functionality. Only tested on Mac OS X 10.3:

  • I made sure the configure.in checks to see that the syntax
    extension you’re using compiles.
  • It no longer tries to execute the code (in the case you’re
    compiling altivec support on a G3 for some reason)
  • It checks for altivec on darwin (I assume you were testing from
    Xcode?)
  • Checks a sysctl to see if it should use prefetch or not (if L3
    cache present, or not OS X, it uses prefetch – optimal for G4)
  • Instead of changing the size of the blit function table I changed
    one of the fields to be a bitflag rather than a bool… right now MMX
    is 1, Altivec is 2, and don’t-use-prefetch is 4.
  • prefetch and no-prefetch 32-32 blits are separate functions (could
    be the same function with userdata I guess).

Using the same test as above, I was able to reproduce the 3x speed
bump on a dual 2ghz G5 (with second CPU disabled cause it’s broken,
argh).

I’m going to profile some real-world SDL games (specifically the ones
that I sort-of-maintain OS X ports for) to see which of the other
blit functions I should vectorize, if any.

I’ve revised the patch again
http://redivi.com/~bob/SDL-altivec-swizzle-bob-2.diff.

After trying it on Blob Wars, I saw that the original calc_swizzle32
wasn’t implemented correctly, so I rewrote it. I also noticed that a
lot of the memory shuffling that it does might not be necessary on OS
X, but since it’s so much faster as-is I’m not going to prematurely
optimize that.

It still has some issues, for example, the configure test that checks
to see whether altivec code will compile is probably going to fail on
a G3 because main() will have vector stuff in it. I can’t really test
that, so I won’t fix it. It’s also not tested anywhere but OS X, so
it probably won’t work elsewhere unless someone with another PPC
platform is willing to step up and test.

Now, off to do some more profiling and perhaps implement the 16bit
blitters :slight_smile:

Updated again http://redivi.com/~bob/SDL-altivec-swizzle-bob-3.diff.

This patch adds an accelerated source alpha blit and uses the 32->32
swizzle more often (when alpha is used in the dest, was only NO_COPY
before). Per-pixel alpha is pretty trivial too, will implement that in
a bit as soon as I can find an app to test it with. The given two
blits that we’ve accelerated seem to hit the hot spots in Blob Wars.

-bobOn Feb 19, 2005, at 11:58 PM, Bob Ippolito wrote:

On Feb 18, 2005, at 11:14 PM, Bob Ippolito wrote:

On Feb 15, 2005, at 7:03, Ryan C. Gordon wrote:

Here’s another revision of my Altivec blitter patch. This one
cleans up a lot of the FIXMEs and generalizes the 32bit-to-32bit
swizzler to be able to convert between any 8888 format generically.
It obsoletes the previous patch, and can be applied directly to
CVS.

I’ve also added a new program to the test directory named
testblitspeed.c; it’s got about a million options, but to get an
idea of how the Altivec code path did against the standard C
version, I ran this:

./testblitspeed --dstbpp 32 --dstwidth 640 --dstheight 480 --srcbpp
32
–srcwidth 640 --srcheight 480 --seconds 10 --dstrmask 0x00FF0000
–dstgmask 0x0000FF00 --dstbmask 0x000000FF --dstamask 0x00000000
–srcrmask 0x000000FF --srcgmask 0x00FF0000 --srcbmask 0x0000FF00
–srcamask 0x00000000

The Altivec code was more than 3 times faster than the C codepath
in the above test.

testblitspeed is in CVS, the Altivec patch is attached. I’d like to
hear from PowerPC users that aren’t MacOS-based to make sure this
compiles cleanly and functions elsewhere.

Ideally, we’d get 32->16 (or more importantly, 16->32) Altivec
blitters in here to complete the speed boost for the rest of the
feasible scenarios, but I have no plans to do these.

Here’s my revised version of the patch
http://redivi.com/~bob/SDL-altivec-swizzle-bob-1.diff.

It’s largely just a cleanup, but provides a marginal amount of extra
functionality. Only tested on Mac OS X 10.3:

  • I made sure the configure.in checks to see that the syntax
    extension you’re using compiles.
  • It no longer tries to execute the code (in the case you’re
    compiling altivec support on a G3 for some reason)
  • It checks for altivec on darwin (I assume you were testing from
    Xcode?)
  • Checks a sysctl to see if it should use prefetch or not (if L3
    cache present, or not OS X, it uses prefetch – optimal for G4)
  • Instead of changing the size of the blit function table I changed
    one of the fields to be a bitflag rather than a bool… right now MMX
    is 1, Altivec is 2, and don’t-use-prefetch is 4.
  • prefetch and no-prefetch 32-32 blits are separate functions (could
    be the same function with userdata I guess).

Using the same test as above, I was able to reproduce the 3x speed
bump on a dual 2ghz G5 (with second CPU disabled cause it’s broken,
argh).

I’m going to profile some real-world SDL games (specifically the
ones that I sort-of-maintain OS X ports for) to see which of the
other blit functions I should vectorize, if any.

I’ve revised the patch again
http://redivi.com/~bob/SDL-altivec-swizzle-bob-2.diff.

After trying it on Blob Wars, I saw that the original calc_swizzle32
wasn’t implemented correctly, so I rewrote it. I also noticed that a
lot of the memory shuffling that it does might not be necessary on OS
X, but since it’s so much faster as-is I’m not going to prematurely
optimize that.

It still has some issues, for example, the configure test that checks
to see whether altivec code will compile is probably going to fail on
a G3 because main() will have vector stuff in it. I can’t really
test that, so I won’t fix it. It’s also not tested anywhere but OS
X, so it probably won’t work elsewhere unless someone with another
PPC platform is willing to step up and test.

Now, off to do some more profiling and perhaps implement the 16bit
blitters :slight_smile:

Updated again http://redivi.com/~bob/SDL-altivec-swizzle-bob-3.diff.

This patch adds an accelerated source alpha blit and uses the 32->32
swizzle more often (when alpha is used in the dest, was only NO_COPY
before). Per-pixel alpha is pretty trivial too, will implement that
in a bit as soon as I can find an app to test it with. The given two
blits that we’ve accelerated seem to hit the hot spots in Blob Wars.

Another update http://redivi.com/~bob/SDL-altivec-swizzle-bob-4.diff.

This adds per-pixel alpha if the source and destination are 32 bits,
with a special case for (A)RGB -> ARGB. Frozen Bubble and Super Tux
benefit from these optimizations.

I’m going to take a look at key blits when source and dest are 32 bits,
Solar Wolf will benefit from that. I haven’t seen 16bit surfaces used
in any of the apps I’ve tested, so I don’t think I’m going to bother
with that after all. If there is an app that uses 1555 16bit surfaces
<-> 32bit, then it would really benefit from Altivec since it has
instructions specifically for doing that… but it’s not worth
implementing if nothing uses it.

-bobOn Feb 20, 2005, at 4:53 AM, Bob Ippolito wrote:

On Feb 19, 2005, at 11:58 PM, Bob Ippolito wrote:

On Feb 18, 2005, at 11:14 PM, Bob Ippolito wrote:

On Feb 15, 2005, at 7:03, Ryan C. Gordon wrote:

Here’s another revision of my Altivec blitter patch. This one
cleans up a lot of the FIXMEs and generalizes the 32bit-to-32bit
swizzler to be able to convert between any 8888 format
generically. It obsoletes the previous patch, and can be applied
directly to CVS.

I’ve also added a new program to the test directory named
testblitspeed.c; it’s got about a million options, but to get an
idea of how the Altivec code path did against the standard C
version, I ran this:

./testblitspeed --dstbpp 32 --dstwidth 640 --dstheight 480
–srcbpp 32
–srcwidth 640 --srcheight 480 --seconds 10 --dstrmask 0x00FF0000
–dstgmask 0x0000FF00 --dstbmask 0x000000FF --dstamask 0x00000000
–srcrmask 0x000000FF --srcgmask 0x00FF0000 --srcbmask 0x0000FF00
–srcamask 0x00000000

The Altivec code was more than 3 times faster than the C codepath
in the above test.

testblitspeed is in CVS, the Altivec patch is attached. I’d like
to hear from PowerPC users that aren’t MacOS-based to make sure
this compiles cleanly and functions elsewhere.

Ideally, we’d get 32->16 (or more importantly, 16->32) Altivec
blitters in here to complete the speed boost for the rest of the
feasible scenarios, but I have no plans to do these.

Here’s my revised version of the patch
http://redivi.com/~bob/SDL-altivec-swizzle-bob-1.diff.

It’s largely just a cleanup, but provides a marginal amount of
extra functionality. Only tested on Mac OS X 10.3:

  • I made sure the configure.in checks to see that the syntax
    extension you’re using compiles.
  • It no longer tries to execute the code (in the case you’re
    compiling altivec support on a G3 for some reason)
  • It checks for altivec on darwin (I assume you were testing from
    Xcode?)
  • Checks a sysctl to see if it should use prefetch or not (if L3
    cache present, or not OS X, it uses prefetch – optimal for G4)
  • Instead of changing the size of the blit function table I changed
    one of the fields to be a bitflag rather than a bool… right now
    MMX is 1, Altivec is 2, and don’t-use-prefetch is 4.
  • prefetch and no-prefetch 32-32 blits are separate functions
    (could be the same function with userdata I guess).

Using the same test as above, I was able to reproduce the 3x speed
bump on a dual 2ghz G5 (with second CPU disabled cause it’s broken,
argh).

I’m going to profile some real-world SDL games (specifically the
ones that I sort-of-maintain OS X ports for) to see which of the
other blit functions I should vectorize, if any.

I’ve revised the patch again
http://redivi.com/~bob/SDL-altivec-swizzle-bob-2.diff.

After trying it on Blob Wars, I saw that the original calc_swizzle32
wasn’t implemented correctly, so I rewrote it. I also noticed that
a lot of the memory shuffling that it does might not be necessary on
OS X, but since it’s so much faster as-is I’m not going to
prematurely optimize that.

It still has some issues, for example, the configure test that
checks to see whether altivec code will compile is probably going to
fail on a G3 because main() will have vector stuff in it. I can’t
really test that, so I won’t fix it. It’s also not tested anywhere
but OS X, so it probably won’t work elsewhere unless someone with
another PPC platform is willing to step up and test.

Now, off to do some more profiling and perhaps implement the 16bit
blitters :slight_smile:

Updated again http://redivi.com/~bob/SDL-altivec-swizzle-bob-3.diff.

This patch adds an accelerated source alpha blit and uses the 32->32
swizzle more often (when alpha is used in the dest, was only NO_COPY
before). Per-pixel alpha is pretty trivial too, will implement that
in a bit as soon as I can find an app to test it with. The given two
blits that we’ve accelerated seem to hit the hot spots in Blob Wars.

Another update http://redivi.com/~bob/SDL-altivec-swizzle-bob-4.diff.

This adds per-pixel alpha if the source and destination are 32 bits,
with a special case for (A)RGB -> ARGB. Frozen Bubble and Super Tux
benefit from these optimizations.

I’m going to take a look at key blits when source and dest are 32
bits, Solar Wolf will benefit from that. I haven’t seen 16bit
surfaces used in any of the apps I’ve tested, so I don’t think I’m
going to bother with that after all. If there is an app that uses
1555 16bit surfaces <-> 32bit, then it would really benefit from
Altivec since it has instructions specifically for doing that… but
it’s not worth implementing if nothing uses it.

I think I’m done with the Altivec speedups for now:
http://redivi.com/~bob/SDL-altivec-swizzle-bob-5.diff.

This adds 32bit->32bit color key blit acceleration.

-bobOn Feb 20, 2005, at 8:40 AM, Bob Ippolito wrote:

On Feb 20, 2005, at 4:53 AM, Bob Ippolito wrote:

On Feb 19, 2005, at 11:58 PM, Bob Ippolito wrote:

On Feb 18, 2005, at 11:14 PM, Bob Ippolito wrote:

On Feb 15, 2005, at 7:03, Ryan C. Gordon wrote:

My app uses 16 bit surfaces exclusively [1] (though could switch to 32
bit at the cost of memory bandwidth, as I run HQ2X like scalers on
surfaces before blitting them to screen), another popular app that is
16 bit only is ScummVM.

Fred

[1] fuse-emulator.sourceforge.netOn 20/02/2005, at 8:40, Bob Ippolito wrote:

I haven’t seen 16bit surfaces used in any of the apps I’ve tested, so
I don’t think I’m going to bother with that after all. If there is an
app that uses 1555 16bit surfaces <-> 32bit, then it would really
benefit from Altivec since it has instructions specifically for doing
that… but it’s not worth implementing if nothing uses it.

My app uses 16 bit surfaces exclusively [1] (though could switch to 32
bit at the cost of memory bandwidth, as I run HQ2X like scalers on
surfaces before blitting them to screen), another popular app that is
16 bit only is ScummVM.

In this case, he’s looking for 16-bit apps that have a specific (and in
my opinion, rare) format. 1555 format, which Altivec has special opcodes
to support, but most things that require a 16-bit surface use 565 as far
as I can tell.

–ryan.