[PATCH] Altivec blitters

I think I’m done with the Altivec speedups for now:
http://redivi.com/~bob/SDL-altivec-swizzle-bob-5.diff.

Thanks for all your hard work on this, Bob!

–ryan.

Apps that use 16 bit opengl textures, and modify them while their still in the
loaded-into-a-SDL_Surface form would benefit from that I think. If the
hardware supports it, why not support it anyhow?On Sunday 20 February 2005 08:40 am, Bob Ippolito wrote:

I’m going to take a look at key blits when source and dest are 32 bits,
Solar Wolf will benefit from that. I haven’t seen 16bit surfaces used
in any of the apps I’ve tested, so I don’t think I’m going to bother
with that after all. If there is an app that uses 1555 16bit surfaces
<-> 32bit, then it would really benefit from Altivec since it has
instructions specifically for doing that… but it’s not worth
implementing if nothing uses it.


Patrick “Diablo-D3” McFarland || pmcfarland at downeast.net
"Computer games don’t affect kids; I mean if Pac-Man affected us as kids, we’d
all be running around in darkened rooms, munching magic pills and listening to
repetitive electronic music." – Kristian Wilson, Nintendo, Inc, 1989
-------------- next part --------------
A non-text attachment was scrubbed…
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: http://lists.libsdl.org/pipermail/sdl-libsdl.org/attachments/20050221/dfedc58c/attachment.pgp

I’m going to take a look at key blits when source and dest are 32
bits,
Solar Wolf will benefit from that. I haven’t seen 16bit surfaces used
in any of the apps I’ve tested, so I don’t think I’m going to bother
with that after all. If there is an app that uses 1555 16bit surfaces
<-> 32bit, then it would really benefit from Altivec since it has
instructions specifically for doing that… but it’s not worth
implementing if nothing uses it.

Apps that use 16 bit opengl textures, and modify them while their
still in the
loaded-into-a-SDL_Surface form would benefit from that I think. If the
hardware supports it, why not support it anyhow?

Because writing SIMD code isn’t really that easy if you don’t do it
regularly, and I only have so much free time I’m willing to spend on
this.

-bobOn Feb 21, 2005, at 12:46 AM, Patrick McFarland wrote:

On Sunday 20 February 2005 08:40 am, Bob Ippolito wrote:

Because writing SIMD code isn’t really that easy if you don’t do it
regularly, and I only have so much free time I’m willing to spend on
this.

I’d say this is a fairly rare practice; most GL games don’t do any
serious texture manipulation on the CPU, and if they do, it’ll be
startup overhead rather than frame-by-frame expense. If someone does the
work, that’s great, but if your time is limited, I would say your effort
is better spent elsewhere.

The next question is: what needs to be done still before this is
reasonable to commit to CVS? It sounds like this patch is getting pretty
solid. As far as I know the TODO list is:

  1. Make sure it compiles on a G3.
  2. Make sure it compiles on Linux/ppc (or something not MacOSX)
  3. Maybe break the blitters off into a seperate file (since it’s getting
    pretty big for an ifdef, imho).

Obviously, this isn’t your responsibility to fix, just looking for an
idea about what’s left to do.

–ryan.

Because writing SIMD code isn’t really that easy if you don’t do it
regularly, and I only have so much free time I’m willing to spend on
this.

I’d say this is a fairly rare practice; most GL games don’t do any
serious texture manipulation on the CPU, and if they do, it’ll be
startup overhead rather than frame-by-frame expense. If someone does
the work, that’s great, but if your time is limited, I would say your
effort is better spent elsewhere.

The next question is: what needs to be done still before this is
reasonable to commit to CVS? It sounds like this patch is getting
pretty solid. As far as I know the TODO list is:

  1. Make sure it compiles on a G3.
  2. Make sure it compiles on Linux/ppc (or something not MacOSX)
  3. Maybe break the blitters off into a seperate file (since it’s
    getting pretty big for an ifdef, imho).

Obviously, this isn’t your responsibility to fix, just looking for an
idea about what’s left to do.

I had originally implemented (3) as a separate header file, but
modifying existing files turned out to be easier in that I didn’t have
to add source files for autoconf/Xcode and generating diffs is just
easier like this. Also, if it is refactored in that way, the
MMX/3DNow/etc. should also be split off into separate files.

There are some more things I want to do to the alpha blitters (more
SIMD color keying stuff), but I may not have time to work on that for a
week or two. What would really help me is if someone could show me how
to benchmark/test all these different blits (may require some code)…
it didn’t seem like the testblitter script had options to do what I
needed to do, so I was testing this stuff (for correctness) against
real applications. None of the applications I tried were any good for
benchmarking purposes.

  1. The surface alpha and per-pixel alpha blending SIMD algorithms I
    wrote actually do correct blending (scaled by 1/255, not 1/256)…
    except on the edges where the scalar algorithm is used. If someone
    could go clean up the scalar implementations that it uses to also use
    this higher precision algorithm, that’d be great, because it might
    cause some banding artifacts as-is.

-bobOn Feb 21, 2005, at 12:44 PM, Ryan C. Gordon wrote:

I had originally implemented (3) as a separate header file, but
modifying existing files turned out to be easier in that I didn’t have
to add source files for autoconf/Xcode and generating diffs is just
easier like this. Also, if it is refactored in that way, the
MMX/3DNow/etc. should also be split off into separate files.

Eh, too much trouble, we’ll leave that as-is. :slight_smile:

There are some more things I want to do to the alpha blitters (more
SIMD color keying stuff), but I may not have time to work on that for a
week or two. What would really help me is if someone could show me how
to benchmark/test all these different blits (may require some code)…
it didn’t seem like the testblitter script had options to do what I
needed to do, so I was testing this stuff (for correctness) against
real applications. None of the applications I tried were any good for
benchmarking purposes.

Tell me exactly what you want testblitspeed.c to do and I’ll implement
it. I just used it to say “load the bitmap in this format, make the
destination surface in this format, and blit as fast as you can”. The
–screen option slowed it down, but let you see if the blit code was
totally broken. For what I was implementing, this was good enough, but I
never tried it with, say, a 16 bit surface or alpha blending.

–ryan.

I had originally implemented (3) as a separate header file, but
modifying existing files turned out to be easier in that I didn’t
have to add source files for autoconf/Xcode and generating diffs is
just easier like this. Also, if it is refactored in that way, the
MMX/3DNow/etc. should also be split off into separate files.

Eh, too much trouble, we’ll leave that as-is. :slight_smile:

There are some more things I want to do to the alpha blitters (more
SIMD color keying stuff), but I may not have time to work on that for
a week or two. What would really help me is if someone could show me
how to benchmark/test all these different blits (may require some
code)… it didn’t seem like the testblitter script had options to do
what I needed to do, so I was testing this stuff (for correctness)
against real applications. None of the applications I tried were any
good for benchmarking purposes.

Tell me exactly what you want testblitspeed.c to do and I’ll implement
it. I just used it to say “load the bitmap in this format, make the
destination surface in this format, and blit as fast as you can”. The
–screen option slowed it down, but let you see if the blit code was
totally broken. For what I was implementing, this was good enough, but
I never tried it with, say, a 16 bit surface or alpha blending.

I don’t actually know the SDL API very well, so I’m not sure which
surface/blit flags need to be set… I’ve only used SDL surfaces myself
via pygame. Other than that, I’ve only ported other people’s open
source software that uses SDL and contributed some OS X specific
patches to SDL and pygame (basically at the Objective-C/Quartz level,
so it had little to do with surfaces, etc.).

I need code that triggers the blits of the following varieties. I
haven’t really read your test code and it didn’t seem to have any help
so I don’t know if it can do any of these beyond the obvious
32bit-32bit that you demonstrated in a previous email (I included it in
the list for completeness).

  • Per pixel 32bit-32bit alpha blit (Blit32to32PixelAlphaAltivec)
  • Per pixel ARGB888-(A)RGB888 alpha blit (BlitRGBtoRGBPixelAlphaAltivec)
  • Surface 32bit-32bit alpha blit (Blit32to32SurfaceAlphaAltivec)
  • Surface RGB888-(A)RGB888 alpha blit (BlitRGBtoRGBSurfaceAlphaAltivec)
  • 32bit-32bit blit (ConvertAltivec32to32_noprefetch /
    ConvertAltivec32to32_prefetch)
  • 32bit-32bit key color blit (Blit32to32KeyAltivec)
  • 32bit-32bit key color surface alpha blit
    (Blit32to32SurfaceAlphaKeyAltivec)

I think that one or two of these may not be in the last diff I posted
because they’re not finished.

I especially want to be able to benchmark these blits so I can see if
it’s worthwhile to special-case (A)RGB-(A)RGB… without being able to
get a good look at them in action, I can’t say whether it’s worthwhile
to have all that copied code lying around… the Altivec unit might just
be sitting these hungry for memory and be perfectly capable of a couple
extra permutes even in the trivial case without a slowdown.

-bobOn Feb 21, 2005, at 6:35 PM, Ryan C. Gordon wrote:

Bob Ippolito wrote:

I need code that triggers the blits of the following varieties. (…)

Maybe this will be useful: http://delirare.com/files/sdlbench.c.gz

It’s just a small tool I threw together in a hurry a year ago or so, so
it may have some bugs, but it seems to work =). It takes an image,
creates a surface with identical width, height and masks, does a single
blit to possibly cache data, and then does repeated offscreen software
blits of the image to the cloned surface. This is repeated for normal,
rle, colorkey and colorkey+rle with surface alpha 0, 0x80 and 0x7f, plus
rgba and rgba+rle (you’ll get times for all of these in the output, of
course).

The bit depth tested depends on the input image(s), so you can test
every depth SDL supports. Not all tests make sense for all bitdepths
(fx, the rgba tests means nothing if the surface doesn’t have alpha).

It doesn’t do blits between different bit depths though, or RGB->ARGB or
ARGB->RGB.

To compile:
gcc -O2 sdl-config --cflags --libs -lSDL_image sdlbench.c -o sdlbench

  • Gerry

Bob Ippolito wrote:

I need code that triggers the blits of the following varieties. (…)

Maybe this will be useful: http://delirare.com/files/sdlbench.c.gz

It’s just a small tool I threw together in a hurry a year ago or so, so
it may have some bugs, but it seems to work =). It takes an image,
creates a surface with identical width, height and masks, does a single
blit to possibly cache data, and then does repeated offscreen software
blits of the image to the cloned surface. This is repeated for normal,
rle, colorkey and colorkey+rle with surface alpha 0, 0x80 and 0x7f,
plus
rgba and rgba+rle (you’ll get times for all of these in the output, of
course).

The bit depth tested depends on the input image(s), so you can test
every depth SDL supports. Not all tests make sense for all bitdepths
(fx, the rgba tests means nothing if the surface doesn’t have alpha).

It doesn’t do blits between different bit depths though, or RGB->ARGB
or
ARGB->RGB.

To compile:
gcc -O2 sdl-config --cflags --libs -lSDL_image sdlbench.c -o sdlbench

This tool is great! I haven’t profiled it yet, so I’m not sure what
kind of code coverage I’m getting, but the Altivec enhancements that
Ryan and I have done are clearly a HUGE win. Take a look!

% env SDL_DISABLE_CPUFEATURES=1 SDL_ALTIVEC_BLIT_FEATURES=0 ./sdlbench
100 ./black817-480x360-3.5.png
./black817-480x360-3.5.png (480x360, 32-bit) x 100:
normal rle ckey ckey+rle
[ 819ms][ 667ms][ 772ms][ 691ms] surface alpha 0xff
[ 808ms][ 899ms][ 806ms][ 894ms] surface alpha 0x80
[ 925ms][ 1054ms][ 802ms][ 905ms] surface alpha 0x7f
rgba rgba+rle
[ 808ms][ 836ms]

% env ./sdlbench 100 ./black817-480x360-3.5.png
./black817-480x360-3.5.png (480x360, 32-bit) x 100:
normal rle ckey ckey+rle
[ 205ms][ 172ms][ 261ms][ 261ms] surface alpha 0xff
[ 398ms][ 398ms][ 351ms][ 346ms] surface alpha 0x80
[ 367ms][ 367ms][ 420ms][ 351ms] surface alpha 0x7f
rgba rgba+rle
[ 371ms][ 351ms]

One thing to note is that the Altivec versions (if they’re being used)
do accurate 1/255 alpha blending, where the scalar versions do a
slightly lossy 1/256.

These timings are on my 1ghz G4 powerbook, with a Deployment framework
build of SDL (-O3 -mtune=G4, etc.).

The image is
http://www.libpng.org/pub/png/img_png/black817-480x360-3.5.png.

These include some changes that were not in the last diff… I imported
SDL 1.2.8 into my public svn repository
http://svn.red-bean.com/bob/SDL-altivec/trunk/ and I’ve been working
off that.

-bobOn Feb 22, 2005, at 7:47 AM, Gerry wrote:

Bob Ippolito wrote:

I need code that triggers the blits of the following varieties. (…)

Maybe this will be useful: http://delirare.com/files/sdlbench.c.gz

It’s just a small tool I threw together in a hurry a year ago or so,
so
it may have some bugs, but it seems to work =). It takes an image,
creates a surface with identical width, height and masks, does a
single
blit to possibly cache data, and then does repeated offscreen software
blits of the image to the cloned surface. This is repeated for
normal,
rle, colorkey and colorkey+rle with surface alpha 0, 0x80 and 0x7f,
plus
rgba and rgba+rle (you’ll get times for all of these in the output, of
course).

The bit depth tested depends on the input image(s), so you can test
every depth SDL supports. Not all tests make sense for all bitdepths
(fx, the rgba tests means nothing if the surface doesn’t have alpha).

It doesn’t do blits between different bit depths though, or RGB->ARGB
or
ARGB->RGB.

To compile:
gcc -O2 sdl-config --cflags --libs -lSDL_image sdlbench.c -o
sdlbench

This tool is great! I haven’t profiled it yet, so I’m not sure what
kind of code coverage I’m getting, but the Altivec enhancements that
Ryan and I have done are clearly a HUGE win.

Its seems that this benchmark is only exercising:
Blit32to32PixelAlphaAltivec
Blit32to32KeyAltivec
ConvertAltivec32to32_prefetch (since this is a G4)

Which leaves the following two flavors (four implementations):
Blit32to32SurfaceAlphaKeyAltivec
Blit32to32SurfaceAlphaAltivec
BlitRGBtoRGBPixelAlphaAltivec
BlitRGBtoRGBSurfaceAlphaAltivec

The two RGB functions I don’t care so much about, they’re probably not
a big win anyway since vec_perm isn’t all that expensive. The pixel
alpha one maybe, since there are likely to be more cycles of
computation than load stalls going on, but it’s probably not that much
different than the generic 32->32.

Using the benchmark tool with jpeg images ended up with 3 bytes per
pixel source and destination surfaces, which are not optimized at all
(I guess they could be, but it would be ugly).

-bobOn Feb 24, 2005, at 2:53 AM, Bob Ippolito wrote:

On Feb 22, 2005, at 7:47 AM, Gerry wrote:

Bob Ippolito wrote:

I need code that triggers the blits of the following varieties. (…)

Maybe this will be useful: http://delirare.com/files/sdlbench.c.gz

It’s just a small tool I threw together in a hurry a year ago or so,
so
it may have some bugs, but it seems to work =). It takes an image,
creates a surface with identical width, height and masks, does a
single
blit to possibly cache data, and then does repeated offscreen software
blits of the image to the cloned surface. This is repeated for
normal,
rle, colorkey and colorkey+rle with surface alpha 0, 0x80 and 0x7f,
plus
rgba and rgba+rle (you’ll get times for all of these in the output, of
course).

The bit depth tested depends on the input image(s), so you can test
every depth SDL supports. Not all tests make sense for all bitdepths
(fx, the rgba tests means nothing if the surface doesn’t have alpha).

It doesn’t do blits between different bit depths though, or RGB->ARGB
or
ARGB->RGB.

To compile:
gcc -O2 sdl-config --cflags --libs -lSDL_image sdlbench.c -o
sdlbench

This tool is great! I haven’t profiled it yet, so I’m not sure what
kind of code coverage I’m getting, but the Altivec enhancements that
Ryan and I have done are clearly a HUGE win. Take a look!

Here are 2ghz G5 timings (which seem to be ~3-4 times faster with the
same altivec optimizations, at a glance, a little more of an
improvement than on the G4):

% env SDL_DISABLE_CPUFEATURES=1 SDL_ALTIVEC_BLIT_FEATURES=0 ./sdlbench
100 ./black817-480x360-3.5.png
./black817-480x360-3.5.png (480x360, 32-bit) x 100:
normal rle ckey ckey+rle
[ 228ms][ 227ms][ 237ms][ 234ms] surface alpha 0xff
[ 310ms][ 309ms][ 309ms][ 311ms] surface alpha 0x80
[ 317ms][ 322ms][ 311ms][ 313ms] surface alpha 0x7f
rgba rgba+rle
[ 311ms][ 309ms]

% env ./sdlbench 100 ./black817-480x360-3.5.png
./black817-480x360-3.5.png (480x360, 32-bit) x 100:
normal rle ckey ckey+rle
[ 69ms][ 69ms][ 76ms][ 77ms] surface alpha 0xff
[ 88ms][ 82ms][ 81ms][ 80ms] surface alpha 0x80
[ 82ms][ 81ms][ 82ms][ 80ms] surface alpha 0x7f
rgba rgba+rle
[ 83ms][ 80ms]

-bobOn Feb 24, 2005, at 2:53 AM, Bob Ippolito wrote:

On Feb 22, 2005, at 7:47 AM, Gerry wrote:

Bob Ippolito wrote:
ConvertAltivec32to32_prefetch (since this is a G4)

Which leaves the following two flavors (four implementations):
Blit32to32SurfaceAlphaKeyAltivec
Blit32to32SurfaceAlphaAltivec

Can altivec be used also to optimize 16(BE/LE)<->16(LE/BE) and (more
important) 16->15 (with and without alpha or colorkey?)

I don’t know altivec hw…

Many SDL games use 16bit pixel depth, this depth has a few performance
problems on OSX since most macs don’t allow 16bit fullscreen mode but
only 15bit or 32bit.

In those situations the game runs faster windowed than fullscreen
(16->32 is faster than 16->15)!

Bye,
Gabry

Bob Ippolito wrote:
ConvertAltivec32to32_prefetch (since this is a G4)

Which leaves the following two flavors (four implementations):
Blit32to32SurfaceAlphaKeyAltivec
Blit32to32SurfaceAlphaAltivec

Can altivec be used also to optimize 16(BE/LE)<->16(LE/BE) and (more
important) 16->15 (with and without alpha or colorkey?)

I don’t know altivec hw…

Many SDL games use 16bit pixel depth, this depth has a few performance
problems on OSX since most macs don’t allow 16bit fullscreen mode but
only 15bit or 32bit.

In those situations the game runs faster windowed than fullscreen
(16->32 is faster than 16->15)!

Yes, Altivec can do damn near anything when you have lots of data.

If you want 16/15 bit acceleration, then someone is going to have to
provide me with some examples of software that uses 16 bit surfaces,
because I haven’t seen any.

-bobOn Feb 24, 2005, at 12:38 PM, Gabriele Greco wrote: