[PATCH] Re: SDL_memcpy variants used in SDL_BlitCopy

Dmitry_Yakimov · September 13, 2005, 12:00pm

Hi,

About MMX copying - AMD heavily optimized.
http://www.cs.virginia.edu/stream/FTP/Contrib/AMD/memcpy_amd.asm

Intel/AMD MMX routine, using in Linux sources.
http://grace-ist.org/horde/chora/co.php?r=1.1&f=xine-lib/src/xine-utils/memcpy.c&Horde=56b36958409aa348f1b989d45973dd9f

I suppose SDL should use ideas from http://grace-ist.org
and use different AMD/Intel routines.

simple patch for unrolling SDL_memcpyMMX loop:

static inline void SDL_memcpyMMX(char* to,char* from,int len)
{
int i;

    for(i=0; i<len/64; i++) {
            __asm__ __volatile__ (
            "movq (%0), %%mm0\n"
            "movq 8(%0), %%mm1\n"
            "movq 16(%0), %%mm2\n"
            "movq 24(%0), %%mm3\n"
            "movq 32(%0), %%mm4\n"
            "movq 40(%0), %%mm5\n"
            "movq 48(%0), %%mm6\n"
            "movq 56(%0), %%mm7\n"
            "movq %%mm0, (%1)\n"
            "movq %%mm1, 8(%1)\n"
            "movq %%mm2, 16(%1)\n"
            "movq %%mm3, 24(%1)\n"
            "movq %%mm4, 32(%1)\n"
            "movq %%mm5, 40(%1)\n"
            "movq %%mm6, 48(%1)\n"
            "movq %%mm7, 56(%1)\n"
            : : "r" (from), "r" (to) : "memory");
            from+=64;
            to+=64;
    }
    if (len&63)
            SDL_memcpy(to, from, len&63);

}–
Best regards,
Dmitry Yakimov, ISDEF member
ActiveKitten.com

Stephane_Marchesin · September 13, 2005, 1:06pm

About MMX copying - AMD heavily optimized.
http://www.cs.virginia.edu/stream/FTP/Contrib/AMD/memcpy_amd.asm

Intel/AMD MMX routine, using in Linux sources.
http://grace-ist.org/horde/chora/co.php?r=1.1&f=xine-lib/src/xine-utils/memcpy.c&Horde=56b36958409aa348f1b989d45973dd9f

I suppose SDL should use ideas from http://grace-ist.org
and use different AMD/Intel routines.

simple patch for unrolling SDL_memcpyMMX loop:

Actually, for the kind of memcpy that SDL does (small blocks on average), the rolled version is faster than the unrolled one. Same goes for the “heavily optimized” memcpy that you can find on the net, these are aimed at large blocks.

Stephane

Dmitry_Yakimov · September 13, 2005, 1:50pm

???,

Actually, for the kind of memcpy that SDL does (small blocks on average), the rolled version is faster than the unrolled one. Same goes for the “heavily optimized” memcpy that you can find on the net, these are aimed at large blocks.

Stephane

As I can see in SDL sources SDL_memcpy is used in SDL_BlitCopy.
Then
/* Check for special “identity” case – copy blit */
if ( surface->map->identity && blit_index == 0 ) {
surface->map->sw_data->blit = SDL_BlitCopy;

So in most cases blocks will be huge - backgrounds, sprites and so on.
Please give more detail on the point - “the rolled version is faster than the
unrolled one”.–
Best regards,
Dmitry Yakimov, ISDEF member
ActiveKitten.com

Stephane_Marchesin · September 13, 2005, 2:29pm

Actually, for the kind of memcpy that SDL does (small blocks on average), the rolled version is faster than the unrolled one. Same goes for the “heavily optimized” memcpy that you can find on the net, these are aimed at large blocks.

Stephane

As I can see in SDL sources SDL_memcpy is used in SDL_BlitCopy.
Then
/* Check for special “identity” case – copy blit */
if ( surface->map->identity && blit_index == 0 ) {
surface->map->sw_data->blit = SDL_BlitCopy;

So in most cases blocks will be huge - backgrounds, sprites and so on.
Please give more detail on the point - “the rolled version is faster than the
unrolled one”.

Long story : when I wrote that memcpy code, I Initially had an unrolled version, and I benchmarked it versus a rolled version versus the plain libc version. To my surprise, the rolled version was faster. The reason is that these highly optimized versions are made for page-sized (4096 bytes) chunks, while the SDL_BlitCopy function copies one line at a time, so the average block size we’re interested in is the average size of a line (for 100 pixels, that’s between 100 bytes and 400 bytes depending on the pixel format). So, for our SDL purposes, the rolled memcpy is the right compromise.

Heh, people suspect this code to be slow from time to time (slower than the unrolled version, slower than the libc version…) but I have yet to be proven wrong

Stephane

Ricardo_Cruz · September 13, 2005, 4:45pm

Perdon me, but may I ask why does SDL try to roll its own version of
memcpy()?
I mean, wouldn’t you think the authors of your’s system C library would do
all possible optmizations possible? If some systems don’t, maybe you should
send 'em a patch or, in last case, make your own memcpy(), if they are closed
systems.

Cheers,
RicardoEm Ter?a, 13 de Setembro de 2005 15:29, o Stephane Marchesin escreveu:

Actually, for the kind of memcpy that SDL does (small blocks on
average), the rolled version is faster than the unrolled one. Same goes
for the “heavily optimized” memcpy that you can find on the net, these
are aimed at large blocks.

Stephane

As I can see in SDL sources SDL_memcpy is used in SDL_BlitCopy.
Then
/* Check for special “identity” case – copy blit */
if ( surface->map->identity && blit_index == 0 ) {
surface->map->sw_data->blit = SDL_BlitCopy;

So in most cases blocks will be huge - backgrounds, sprites and so on.
Please give more detail on the point - “the rolled version is faster
than the unrolled one”.

Long story : when I wrote that memcpy code, I Initially had an unrolled
version, and I benchmarked it versus a rolled version versus the plain libc
version. To my surprise, the rolled version was faster. The reason is that
these highly optimized versions are made for page-sized (4096 bytes)
chunks, while the SDL_BlitCopy function copies one line at a time, so the
average block size we’re interested in is the average size of a line (for
100 pixels, that’s between 100 bytes and 400 bytes depending on the pixel
format). So, for our SDL purposes, the rolled memcpy is the right
compromise.

Heh, people suspect this code to be slow from time to time (slower than the
unrolled version, slower than the libc version…) but I have yet to be
proven wrong

Stephane

–
Freedom is what you do with what’s been done to you.
– Jean-Paul Sartre

Catatonic_Porpoise · September 13, 2005, 6:53pm

Ricardo Cruz wrote:

Perdon me, but may I ask why does SDL try to roll its own version of
memcpy()?
I mean, wouldn’t you think the authors of your’s system C library would do
all possible optmizations possible?

Not meaning to be rude, but did you read the message you replied to? It
answers your question.

(“Because SDL copies small memory blocks, and the usual optimizations
are only good for large memory blocks.”)

Graue

Stephane_Marchesin · September 13, 2005, 8:16pm

Catatonic Porpoise wrote:

Ricardo Cruz wrote:

Perdon me, but may I ask why does SDL try to roll its own version of
memcpy()?
I mean, wouldn’t you think the authors of your’s system C library
would do all possible optmizations possible?

Not meaning to be rude, but did you read the message you replied to? It
answers your question.

(“Because SDL copies small memory blocks, and the usual optimizations
are only good for large memory blocks.”)

Well, there’s a second reason. Most distros are compiled for i386, i585
or i686 so a memcpy implementation is at most rep movsd, and mmx is not
used since it’s not part of i686 (remember i686 includes pentium pro
which doesn’t have mmx). I’m not sure about source distros, these could
have an edge on that since you build your own libc, but for some reason
they don’t (not gentoo at least).

Also, choosing the right memcpy version at runtime is not an option, for
obvious performance reasons when doing small copies.

Stephane

icculus · September 13, 2005, 10:25pm

I mean, wouldn’t you think the authors of your’s system C library would do
all possible optmizations possible? If some systems don’t, maybe you should
send 'em a patch or, in last case, make your own memcpy(), if they are closed
systems.

We did make our own memcpy for closed systems.

Linux’s C library’s memcpy sucks badly (or at least did when I last
checked some time ago).

Apple’s C runtime, however, is hand-tuned for every CPU they support, so
you are less likely to beat it, and if you do, you’ll lose to it on some
other generation of chips.

I assume Microsoft’s is similar to Apple’s on all the popular chips, but
haven’t checked.

–ryan.

Alex_Volkov · September 13, 2005, 10:57pm

Stephane Marchesin wrote:

Well, there’s a second reason. Most distros are compiled for i386, i585
or i686 so a memcpy implementation is at most rep movsd, and mmx is not
used since it’s not part of i686 (remember i686 includes pentium pro
which doesn’t have mmx). I’m not sure about source distros, these could
have an edge on that since you build your own libc, but for some reason
they don’t (not gentoo at least).

I am testing inlined SDL_memcpy variants against an inlined ‘rep movsd’ (not
libc/msvcrt memcpy()), so no function calls are involved.

I just tested different size blocks with some interesting results.
For short rows (0x100-0x200 bytes range):
With srcskip=0, the MMX version is faster than the inlined ‘rep movsd’, and
the SSE version is horribly slow.
With srcskip>0x20, the MMX version is slower than the inlined ‘rep movsd’,
and the SSE version is faster.

The SSE version is much faster when row length and srcskip are 0x40-aligned.

For longer rows (above 0x300):
The MMX performs about the same as inlined ‘rep movsd’ or slower, with minor
variations caused by srcskip.
With srcskip=0, the SSE version is quite a bit slower.
With srcskip around 0x40, the SSE version is faster, but as srcskip
increases – prefetch stops covering the next row and it slows down to a
crawl.

Again, this is all on a dual Xeon 1.7, where the SMP arch could affect the
results.
Draw your own conclusions ;). But IMHO, the SSE version should never be
called with srcskip=0 – the speed loss is too great. It should also not be
called when prefetch stops covering the major portion of the next row, or
you get a performance penalty as well.

Also, choosing the right memcpy version at runtime is not an option,
for obvious performance reasons when doing small copies.

The version is already being chosen at runtime based on SDL_HasXXX, so I am
not sure what you mean. Rejecting the SSE version based on srcskip and row
width would not be terribly expensive.

Alex.

Alex_Volkov · September 13, 2005, 11:03pm

Ryan C. Gordon wrote:

We did make our own memcpy for closed systems.

Linux’s C library’s memcpy sucks badly (or at least did when I last
checked some time ago).

Apple’s C runtime, however, is hand-tuned for every CPU they support, so
you are less likely to beat it, and if > you do, you’ll lose to it on some
other generation of chips.

I assume Microsoft’s is similar to Apple’s on all the popular chips, but
haven’t checked.

Microsoft’s implementation is a simple ‘rep movsd’, either inlined, if
intrinsics optimizations are enabled, or calls to the library. The static
library, MSVCRT and intrinsic implementations are similar (at least in VC6).

Alex.

Clemens_Kirchgattere · September 16, 2005, 5:33pm

“Alex Volkov” wrote:

Linux’s C library’s memcpy sucks badly (or at least did when I last
checked some time ago).

i thought gcc plugs in its own memcpy, strcmp, … if compiling with
-O2. so wouldn’t be gcc to be blamed?

c.