Opteron MMX patches for SDL_blit.c and SDL_blit_A.c

The inline MMX assembly in SDL_blit.c and SDL_blit_A.c compiles and runs fine
unmodified under AMD Opteron. The inline assembly in SDL_yuv_mmx.c and
SDL_blit_N.c unfortunately isn’t directly compatible.

I’ve included diffs from SDL_blit.c and SDL_blit_A.c that allow the MMX
assembly to be compiled when USE_ASMBLIT, x86_64, and GNUC are all
defined. All I had to modify was typedefs, the inline assembly itself wasn’t
touched.
-------------- next part --------------
A non-text attachment was scrubbed…
Name: SDL_blit.diff
Type: text/x-diff
Size: 487 bytes
Desc: not available
URL: http://lists.libsdl.org/pipermail/sdl-libsdl.org/attachments/20040330/8e753a6a/attachment.diff
-------------- next part --------------
A non-text attachment was scrubbed…
Name: SDL_blit_A.diff
Type: text/x-diff
Size: 1326 bytes
Desc: not available
URL: http://lists.libsdl.org/pipermail/sdl-libsdl.org/attachments/20040330/8e753a6a/attachment-0001.diff

Tyler Montbriand wrote:

The inline MMX assembly in SDL_blit.c and SDL_blit_A.c compiles and runs fine
unmodified under AMD Opteron. The inline assembly in SDL_yuv_mmx.c and
SDL_blit_N.c unfortunately isn’t directly compatible.

I’ve included diffs from SDL_blit.c and SDL_blit_A.c that allow the MMX
assembly to be compiled when USE_ASMBLIT, x86_64, and GNUC are all
defined. All I had to modify was typedefs, the inline assembly itself wasn’t
touched.

Interesting.

Now, for some questions :
Do the mmx sound mixing routines in src/audio/SDL_mixer_mmx.c work ?
Do the mmx blitting routines in src/video/SDL_RLEaccel.c work ?

I can see why nasm code wouldn’t work on an opteron, however I can’t see
why the inline asm in SDL_yuv_mmx.c doesn’t work.
Do you have an idea why these two don’t work ?
I’m quite busy until the end of next week, but I’m interseted in
debugging it after that. Would you be interested in helping ?

Stephane

Tyler Montbriand wrote:

The inline MMX assembly in SDL_blit.c and SDL_blit_A.c compiles and runs fine
unmodified under AMD Opteron. The inline assembly in SDL_yuv_mmx.c and
SDL_blit_N.c unfortunately isn’t directly compatible.

Just a comment here:

Does SDL_memcpy[MMX|SSE] actually outperform glibc’s memcpy()? Generally
a given platform’s memcpy is way more complicated than these simple
two-instruction loops, and is much more highly tuned. Can someone who is
sufficiently bored take some time to benchmark these? I’d be inclined to
remove them outright if they are slower or roughly equivalent to glibc’s
memcpy().

–ryan.

Can someone who is sufficiently bored take some time to benchmark
these? I’d be inclined to remove them outright if they are slower or
roughly equivalent to glibc’s memcpy().

Ok, apparently I’m sufficiently bored, because I benchmarked it.

The short of it: the MMX/SSE versions are staying.

With glibc 2.3.2, which happens to be what’s on my system at the moment,
SDL’s SSE version of memcpy() is significantly faster, and the MMX
version is slightly faster.

Depending on how your glibc is built, it might be REALLY faster (if
glibc is built for a stock 386, this is just a rep/movsb block).

I haven’t benchmarked an Opteron in longword (64-bit) mode…glibc has a
different memcpy implementation for that target which appears to use MMX
and might be faster.

I think I’m fairly bothered by this…a lot of code counts on a
platform’s memcpy() to be as fast as anything a programmer could hope to
roll on her own, and I don’t think that’s an unreasonable expectation.

–ryan.

Ryan C. Gordon wrote:

Just a comment here:

Does SDL_memcpy[MMX|SSE] actually outperform glibc’s memcpy()? Generally
a given platform’s memcpy is way more complicated than these simple
two-instruction loops, and is much more highly tuned. Can someone who is
sufficiently bored take some time to benchmark these? I’d be inclined to
remove them outright if they are slower or roughly equivalent to glibc’s
memcpy().

Yep, I did some benchmarking while writing those routines. The outcome
is that it depends on the size of the block being copied. libc memcpy is
fine for big block copy (it’s optimized for at least 4K), same is true
for SDL_memcpySSE (that outperforms libc memcpy on my athlonxp for
aligned blocks, BTW :slight_smile:
But for small block copy (which is the real-life usage of these
functions in SDL), the SDL_memcpyMMX routine is faster than libc memcpy
or SDL_memcpySSE. I’ll have to benchmark that on different platforms,
but I think SDL_memcpySSE could be removed in favor of SDL_memcpyMMX.

I’ll look into that when I have some free time.
well… I have been thinking about doing this for months, indeed :frowning:

Stephane

Do the mmx sound mixing routines in src/audio/SDL_mixer_mmx.c work ?
Do the mmx blitting routines in src/video/SDL_RLEaccel.c work ?

I’ll try and see if they will compile later today.

I can see why nasm code wouldn’t work on an opteron, however I can’t see
why the inline asm in SDL_yuv_mmx.c doesn’t work.
Do you have an idea why these two don’t work ?

It assumes that pointers are 32-bit. On an Opteron they are 64-bit, ergo,
trying to load a pointer value into %eax gives you an invalid operand size.
The equivalent 64-bit register under amd64 is %rax.

It should be possible to modify those blocks to support amd64, but I’m not
particularly looking forward to rewriting those scary enormous ASM
blocks. :/On Wednesday 31 March 2004 05:06, Stephane Marchesin wrote:

Thanks! I’ve applied them to CVS.

See ya,
-Sam Lantinga, Software Engineer, Blizzard Entertainment