OT: optimizing movebits()

Trying to find a better way of doing this:

void movebits1( void* dstBase, unsigned dstOff,
const void* srcBase, unsigned srcOff,
unsigned length ) {
unsigned char* d = (unsigned char*) dstBase;
const unsigned char* s = (const unsigned char*) srcBase;

d += dstOff / 8;
dstOff %= 8;

s += srcOff / 8;
srcOff %= 8;

while (length--)	{
	if (dstOff == 8)	{
		dstOff = 0;
		++d;
	}
	if (srcOff == 8)	{
		srcOff = 0;
		++s;
	}
	*d &= ~(1 << dstOff);
	*d |= ((*s >> srcOff++) & 1) << dstOff++;
}

}

This function takes a base pointer and bit offset for two blocks of
memory and copies bits from one to the other for a certain number of
bits, without wiping out the surrounding data.

I plan on using this to move data between samples in different
formats, so the amount of data moving around per call is fairly low in
that case, but I’m picky.

I’d like to find something reasonably efficient for large memory
blocks as well…

Anyone have any ideas?

Some CPUs support SIMD instructions that can rearrange bits with a
high degree of throughput. Altivec is one that comes to mind.On Fri, Jun 25, 2010 at 1:45 PM, Kenneth Bull wrote:

Trying to find a better way of doing this:

void movebits1( void* dstBase, unsigned dstOff,
? ? ? ? ? ? ? ? ? ? ? ?const void* srcBase, unsigned srcOff,
? ? ? ? ? ? ? ? ? ? ? ?unsigned length ) ? ? ? {
? ? ? ?unsigned char* d = (unsigned char*) dstBase;
? ? ? ?const unsigned char* s = (const unsigned char*) srcBase;

? ? ? ?d += dstOff / 8;
? ? ? ?dstOff %= 8;

? ? ? ?s += srcOff / 8;
? ? ? ?srcOff %= 8;

? ? ? ?while (length–) ? ? ? ?{
? ? ? ? ? ? ? ?if (dstOff == 8) ? ? ? ?{
? ? ? ? ? ? ? ? ? ? ? ?dstOff = 0;
? ? ? ? ? ? ? ? ? ? ? ?++d;
? ? ? ? ? ? ? ?}
? ? ? ? ? ? ? ?if (srcOff == 8) ? ? ? ?{
? ? ? ? ? ? ? ? ? ? ? ?srcOff = 0;
? ? ? ? ? ? ? ? ? ? ? ?++s;
? ? ? ? ? ? ? ?}
? ? ? ? ? ? ? ?*d &= ~(1 << dstOff);
? ? ? ? ? ? ? ?*d |= ((*s >> srcOff++) & 1) << dstOff++;
? ? ? ?}
}

This function takes a base pointer and bit offset for two blocks of
memory and copies bits from one to the other for a certain number of
bits, without wiping out the surrounding data.

I plan on using this to move data between samples in different
formats, so the amount of data moving around per call is fairly low in
that case, but I’m picky.

I’d like to find something reasonably efficient for large memory
blocks as well…

Anyone have any ideas?


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


http://codebad.com/

I’ll take a look, though I’d prefer to stick to C/C++ if possible.
I’m hoping there’s a better algorithm than what I’m using here. I’ve
got a couple for if srcOff == 0 or dstOff == 0, but not for the
general case.

Here’s one for dstOff == 0 (untested…):
int i = 0, j = srcOff;
while (i < length) {
if (j % 8 != 0) {
d[i / 8] = s[j / 8] >> (j % 8);
j += 8 - (j % 8);
i += 8 - (j % 8);
}
else {
d[i / 8] |= s[j / 8] << (i % 8);
j += 8 - (i % 8);
i += 8 - (i % 8);
}
}

Doing it like this for the general case would be a lot more
complicated though, and I’m not sure if it would actually be faster
than doing it bit by bit.

Here’s another:
int i = 0;
for (int mx = (srcOff + length + 7) / 8 - 1; i < mx; ++i) {
d[i] = (unsigned char) (((unsigned short)(s + i)) >> srcOff);
}
if (i * 8 < length) {
d[i] = s[i] >> srcOff;
}

Converting this one to the general case though, it gets harder to
avoid a potential segfault.On 25/06/2010, Donny Viszneki <donny.viszneki at gmail.com> wrote:

Some CPUs support SIMD instructions that can rearrange bits with a
high degree of throughput. Altivec is one that comes to mind.

Yeah, I’ve optimized this a few dozen times in my life. I’m not really
interested in writing the code again. And, as Donny said the newer
CPUs have instructions to optimize this. So did the DEC 10 and 20 was
back in the day.

The key is a combination of special casing and loop unrolling. Heavy
use of Duff’s device is called for. I’ll post more later. My wife just
told me that dinner is ready. :slight_smile:

Bob PendletonOn Fri, Jun 25, 2010 at 12:45 PM, Kenneth Bull wrote:

Trying to find a better way of doing this:

void movebits1( void* dstBase, unsigned dstOff,
? ? ? ? ? ? ? ? ? ? ? ?const void* srcBase, unsigned srcOff,
? ? ? ? ? ? ? ? ? ? ? ?unsigned length ) ? ? ? {
? ? ? ?unsigned char* d = (unsigned char*) dstBase;
? ? ? ?const unsigned char* s = (const unsigned char*) srcBase;

? ? ? ?d += dstOff / 8;
? ? ? ?dstOff %= 8;

? ? ? ?s += srcOff / 8;
? ? ? ?srcOff %= 8;

? ? ? ?while (length–) ? ? ? ?{
? ? ? ? ? ? ? ?if (dstOff == 8) ? ? ? ?{
? ? ? ? ? ? ? ? ? ? ? ?dstOff = 0;
? ? ? ? ? ? ? ? ? ? ? ?++d;
? ? ? ? ? ? ? ?}
? ? ? ? ? ? ? ?if (srcOff == 8) ? ? ? ?{
? ? ? ? ? ? ? ? ? ? ? ?srcOff = 0;
? ? ? ? ? ? ? ? ? ? ? ?++s;
? ? ? ? ? ? ? ?}
? ? ? ? ? ? ? ?*d &= ~(1 << dstOff);
? ? ? ? ? ? ? ?*d |= ((*s >> srcOff++) & 1) << dstOff++;
? ? ? ?}
}

This function takes a base pointer and bit offset for two blocks of
memory and copies bits from one to the other for a certain number of
bits, without wiping out the surrounding data.

I plan on using this to move data between samples in different
formats, so the amount of data moving around per call is fairly low in
that case, but I’m picky.

I’d like to find something reasonably efficient for large memory
blocks as well…

Anyone have any ideas?


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


±----------------------------------------------------------

Amiga Blitter programming! Ahh the memories…

You can do a lot with a 16bit barrel roll operation, start word masking, and end word masking…On 06/26/2010 02:52 PM, Bob Pendleton wrote:

Yeah, I’ve optimized this a few dozen times in my life. I’m not really
interested in writing the code again. And, as Donny said the newer
CPUs have instructions to optimize this. So did the DEC 10 and 20 was
back in the day.

The key is a combination of special casing and loop unrolling. Heavy
use of Duff’s device is called for. I’ll post more later. My wife just
told me that dinner is ready. :slight_smile:

Bob Pendleton

On Fri, Jun 25, 2010 at 12:45 PM, Kenneth Bull wrote:

Trying to find a better way of doing this:

void movebits1( void* dstBase, unsigned dstOff,
const void* srcBase, unsigned srcOff,
unsigned length ) {
unsigned char* d = (unsigned char*) dstBase;
const unsigned char* s = (const unsigned char*) srcBase;

   d += dstOff / 8;
   dstOff %= 8;

   s += srcOff / 8;
   srcOff %= 8;

   while (length--)        {
           if (dstOff == 8)        {
                   dstOff = 0;
                   ++d;
           }
           if (srcOff == 8)        {
                   srcOff = 0;
                   ++s;
           }
           *d &= ~(1 << dstOff);
           *d |= ((*s >> srcOff++) & 1) << dstOff++;
   }

}

This function takes a base pointer and bit offset for two blocks of
memory and copies bits from one to the other for a certain number of
bits, without wiping out the surrounding data.

I plan on using this to move data between samples in different
formats, so the amount of data moving around per call is fairly low in
that case, but I’m picky.

I’d like to find something reasonably efficient for large memory
blocks as well…

Anyone have any ideas?


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


LordHavoc
Author of DarkPlaces Quake1 engine - http://icculus.org/twilight/darkplaces
Co-designer of Nexuiz - http://alientrap.org/nexuiz
"War does not prove who is right, it proves who is left." - Unknown
"Any sufficiently advanced technology is indistinguishable from a rigged demo." - James Klass
"A game is a series of interesting choices." - Sid Meier

Amiga Blitter programming! ?Ahh the memories…

Damn straight!

Bob PendletonOn Sat, Jun 26, 2010 at 6:25 PM, Forest Hale wrote:

You can do a lot with a 16bit barrel roll operation, start word masking, and end word masking…

On 06/26/2010 02:52 PM, Bob Pendleton wrote:

Yeah, I’ve optimized this a few dozen times in my life. I’m not really
interested in writing the code again. And, as Donny said the newer
CPUs have instructions to optimize this. So did the DEC 10 and 20 was
back in the day.

The key is a combination of special casing and loop unrolling. Heavy
use of Duff’s device is called for. I’ll post more later. My wife just
told me that dinner is ready. :slight_smile:

Bob Pendleton

On Fri, Jun 25, 2010 at 12:45 PM, Kenneth Bull wrote:

Trying to find a better way of doing this:

void movebits1( void* dstBase, unsigned dstOff,
? ? ? ? ? ? ? ? ? ? ? ?const void* srcBase, unsigned srcOff,
? ? ? ? ? ? ? ? ? ? ? ?unsigned length ) ? ? ? {
? ? ? ?unsigned char* d = (unsigned char*) dstBase;
? ? ? ?const unsigned char* s = (const unsigned char*) srcBase;

? ? ? ?d += dstOff / 8;
? ? ? ?dstOff %= 8;

? ? ? ?s += srcOff / 8;
? ? ? ?srcOff %= 8;

? ? ? ?while (length–) ? ? ? ?{
? ? ? ? ? ? ? ?if (dstOff == 8) ? ? ? ?{
? ? ? ? ? ? ? ? ? ? ? ?dstOff = 0;
? ? ? ? ? ? ? ? ? ? ? ?++d;
? ? ? ? ? ? ? ?}
? ? ? ? ? ? ? ?if (srcOff == 8) ? ? ? ?{
? ? ? ? ? ? ? ? ? ? ? ?srcOff = 0;
? ? ? ? ? ? ? ? ? ? ? ?++s;
? ? ? ? ? ? ? ?}
? ? ? ? ? ? ? ?*d &= ~(1 << dstOff);
? ? ? ? ? ? ? ?*d |= ((*s >> srcOff++) & 1) << dstOff++;
? ? ? ?}
}

This function takes a base pointer and bit offset for two blocks of
memory and copies bits from one to the other for a certain number of
bits, without wiping out the surrounding data.

I plan on using this to move data between samples in different
formats, so the amount of data moving around per call is fairly low in
that case, but I’m picky.

I’d like to find something reasonably efficient for large memory
blocks as well…

Anyone have any ideas?


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


LordHavoc
Author of DarkPlaces Quake1 engine - http://icculus.org/twilight/darkplaces
Co-designer of Nexuiz - http://alientrap.org/nexuiz
"War does not prove who is right, it proves who is left." - Unknown
"Any sufficiently advanced technology is indistinguishable from a rigged demo." - James Klass
"A game is a series of interesting choices." - Sid Meier


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


±----------------------------------------------------------

First off, can you tell use what kind of formats you are moving around
and what you are doing as you move them around? Optimizing the lowest
level of the task doesn’t get you nearly as much performance
improvement as optimizing at the top level.

Bob PendletonOn Fri, Jun 25, 2010 at 12:45 PM, Kenneth Bull wrote:

Trying to find a better way of doing this:

void movebits1( void* dstBase, unsigned dstOff,
? ? ? ? ? ? ? ? ? ? ? ?const void* srcBase, unsigned srcOff,
? ? ? ? ? ? ? ? ? ? ? ?unsigned length ) ? ? ? {
? ? ? ?unsigned char* d = (unsigned char*) dstBase;
? ? ? ?const unsigned char* s = (const unsigned char*) srcBase;

? ? ? ?d += dstOff / 8;
? ? ? ?dstOff %= 8;

? ? ? ?s += srcOff / 8;
? ? ? ?srcOff %= 8;

? ? ? ?while (length–) ? ? ? ?{
? ? ? ? ? ? ? ?if (dstOff == 8) ? ? ? ?{
? ? ? ? ? ? ? ? ? ? ? ?dstOff = 0;
? ? ? ? ? ? ? ? ? ? ? ?++d;
? ? ? ? ? ? ? ?}
? ? ? ? ? ? ? ?if (srcOff == 8) ? ? ? ?{
? ? ? ? ? ? ? ? ? ? ? ?srcOff = 0;
? ? ? ? ? ? ? ? ? ? ? ?++s;
? ? ? ? ? ? ? ?}
? ? ? ? ? ? ? ?*d &= ~(1 << dstOff);
? ? ? ? ? ? ? ?*d |= ((*s >> srcOff++) & 1) << dstOff++;
? ? ? ?}
}

This function takes a base pointer and bit offset for two blocks of
memory and copies bits from one to the other for a certain number of
bits, without wiping out the surrounding data.

I plan on using this to move data between samples in different
formats, so the amount of data moving around per call is fairly low in
that case, but I’m picky.

I’d like to find something reasonably efficient for large memory
blocks as well…

Anyone have any ideas?


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


±----------------------------------------------------------

I’ll be using this as a fallback for whatever formats I don’t
specifically implement.
This will be going into a block of code that separates and recombines
channels (or individual samples within a channel) within an image or
audio recording (or whatever).

So, for example, you have an RGBA image and you want to extract the
alpha channel and save it as a greyscale image. So you create a
greyscale image with the same dimensions as the RGBA image and the
same number of bits in it’s luminance channel as the alpha channel in
the RGBA image, then do greyImage[“L”].SetData(rgbaImage[“A”]); to
copy the data from the alpha channel to the luminance channel.

Basically, I’m writing something absurdly general. If I were
targeting a specific format, I probably wouldn’t need something like
this.On 27/06/2010, Bob Pendleton wrote:

First off, can you tell use what kind of formats you are moving around
and what you are doing as you move them around? Optimizing the lowest
level of the task doesn’t get you nearly as much performance
improvement as optimizing at the top level.

Trying to find a better way of doing this:

void movebits1( void* dstBase, unsigned dstOff,
const void* srcBase, unsigned srcOff,
unsigned length ) {

Here’s movebits2, which should be a bit faster. I still need to do
benchmarks, and I haven’t touched Bob’s Duff’s Device suggestion yet.
I’m not sure if what I’ve done with the switch statements would be an
improvement over ifs or not. I guess it depends on how the compiler
optimizes conditional expressions.

In this version I normalize the offsets so that dstOff is always 0
before moving most of the data. It makes things much simpler.

void movebits2( void* dstBase, unsigned dstOff,
const void* srcBase, unsigned srcOff,
unsigned length ) {
unsigned char* d;
const unsigned char* s;

d = (unsigned char*) dstBase;
s = (const unsigned char*) srcBase;

// normalize offset and base pointers
d += dstOff >> 3;
dstOff &= 7;
s += srcOff >> 3;
srcOff &= 7;

// deal with short transfers
switch (((dstOff + length <= 8) << 1) |
	(srcOff + length <= 8))	{
	
	case 3:
		*d = (*d & ~((~(255 << length)) << dstOff)) | (((*s >> srcOff) &

~(255 << length)) << dstOff);
return;
case 2:
d = (d & ~((~(255 << length)) << dstOff)) | (((unsigned char)
(
(unsigned short
)s >> srcOff) & ~(255 << length)) << dstOff);
return;
case 1:
(unsigned short)d = ((unsigned short)d & ~((~(0xFFFF <<
length)) << dstOff)) | (((*s >> srcOff) & ~(255 << length)) <<
dstOff);
return;
}

// deal with dstOff
if (dstOff != 0)	{
	*d = (*d & ~(255 << dstOff)) | ((unsigned char) (*(unsigned short*)s

srcOff) << dstOff);
++d;

	// adjust length and srcOff
	length -= 8 - dstOff;
	srcOff += 8 - dstOff;
	dstOff = 0;
	
	s += srcOff >> 3;
	srcOff &= 7;
}

// copy remaining bits
switch (((unsigned)((length & 7) == 0) << 1) |
	(unsigned)(srcOff == 0))	{

	case 0:	{	/* default */
		unsigned l2 = 255 << (length & 7);
		for (length = length >> 3; length--; ++d, ++s)	{
			*d = (unsigned char)(*(unsigned short*)s >> srcOff);
		}
		*d = (*d & l2) | ((unsigned char)(*(unsigned short*)s >> srcOff) & ~l2);
		break;
	}
	case 1:	{	/* srcOff == 0 */
		unsigned l1 = length >> 3, l2 = 255 << (length & 7);
		memcpy(d, s, l1); d += l1; s += l1;
		*d = (*d & l2) | (*s & ~l2);
		break;
	}
	
	case 2:		/* (length & 7) == 0 */
		for (length = length >> 3; length--; ++d, ++s)	{
			*d = (unsigned char)(*(unsigned short*)s >> srcOff);
		}
		break;
		
	case 3:		/* moving bytes */
		memcpy(d, s, length >> 3);
		
}

}

There’s a test program attached.
-------------- next part --------------
#include
#include
#include
#include <conio.h>

#include

void printbits(const void* base, unsigned offset, unsigned length) {
const unsigned char* b = (const unsigned char*) base;
b += offset / 8;
offset %= 8;

for (; length--; ++offset)	{
	if (offset == 8)	{
		offset = 0;
		++b;
	}
	std::printf("%d", (*b >> offset) & 1);
}

}

int cmpbits(const void* base1, unsigned off1,
const void* base2, unsigned off2,
unsigned length) {
const unsigned char* b1;
const unsigned char* b2;

b1 = (unsigned char*) base1;
b2 = (const unsigned char*) base2;

while (length--)	{
	b1 += off1 >> 3;
	off1 &= 7;
	
	b2 += off2 >> 3;
	off2 &= 7;
	
	switch ((((*b1 >> off1++) & 1) << 1) |
			(*b2 >> off2++) & 1)	{
		case 1:
			return -1;
		case 2:
			return 1;
	}
}
return 0;

}

//bit by bit
void movebits1( void* dstBase, unsigned dstOff,
const void* srcBase, unsigned srcOff,
unsigned length ) {
unsigned char* d = (unsigned char*) dstBase;
const unsigned char* s = (const unsigned char*) srcBase;

d += dstOff / 8;
dstOff %= 8;

s += srcOff / 8;
srcOff %= 8;

while (1)	{
	*d = (*d & ~(1 << dstOff)) | (((*s >> srcOff) & 1) << dstOff);
	
	if (!--length)	{
		break;
	}
	
	++srcOff;
	++dstOff;
	
	d += dstOff >> 3;
	dstOff &= 7;
	s += srcOff >> 3;
	srcOff &= 7;
}

}

void movebits2( void* dstBase, unsigned dstOff,
const void* srcBase, unsigned srcOff,
unsigned length ) {
unsigned char* d;
const unsigned char* s;

d = (unsigned char*) dstBase;
s = (const unsigned char*) srcBase;

// normalize offset and base pointers
d += dstOff >> 3;
dstOff &= 7;
s += srcOff >> 3;
srcOff &= 7;

// deal with short transfers
switch (((dstOff + length <= 8) << 1) |
		(srcOff + length <= 8))	{
	case 3:
		*d = (*d & ~((~(255 << length)) << dstOff)) | (((*s >> srcOff) & ~(255 << length)) << dstOff);
		return;
	case 2:
		*d = (*d & ~((~(255 << length)) << dstOff)) | (((unsigned char) (*(unsigned short*)s >> srcOff) & ~(255 << length)) << dstOff);
		return;
	case 1:
		*(unsigned short*)d = (*(unsigned short*)d & ~((~(0xFFFF << length)) << dstOff)) | (((*s >> srcOff) & ~(255 << length)) << dstOff);
		return;
}

// deal with dstOff
if (dstOff != 0)	{
	*d = (*d & ~(255 << dstOff)) | ((unsigned char) (*(unsigned short*)s >> srcOff) << dstOff);
	++d;
	
	// adjust length and srcOff
	length -= 8 - dstOff;
	srcOff += 8 - dstOff;
	dstOff = 0;
	
	s += srcOff >> 3;
	srcOff &= 7;
}

// copy remaining bits
switch (((unsigned)((length & 7) == 0) << 1) | 
		(unsigned)(srcOff == 0))	{

	case 0:	{	/* default */
		unsigned l2 = 255 << (length & 7);
		for (length = length >> 3; length--; ++d, ++s)	{
			*d = (unsigned char)(*(unsigned short*)s >> srcOff);
		}
		*d = (*d & l2) | ((unsigned char)(*(unsigned short*)s >> srcOff) & ~l2);
		break;
	}
	case 1:	{	/* srcOff == 0 */
		unsigned l1 = length >> 3, l2 = 255 << (length & 7);
		memcpy(d, s, l1); d += l1; s += l1;
		*d = (*d & l2) | (*s & ~l2);
		break;
	}
	
	case 2:		/* (length & 7) == 0 */
		for (length = length >> 3; length--; ++d, ++s)	{
			*d = (unsigned char)(*(unsigned short*)s >> srcOff);
		}
		break;
		
	case 3:		/* moving bytes */
		memcpy(d, s, length >> 3);
		
}

}

using namespace std;

int main(int, char**) {
unsigned char buffer[16];
unsigned i, A, B, L;

srand(time(0));

while (!_kbhit())	{
	for (i = 0; i < sizeof(buffer); ++i)	{
		buffer[i] = (unsigned char) rand();
	}
	
	L = rand() % (8 * sizeof(buffer) / 3);
	A = rand() % (8 * sizeof(buffer) - L);
	do	{
		B = rand() % (8 * sizeof(buffer) - L);
	} while ((A <= B && A+L > B) || (B <= A && B+L > A));
	
	printf("\nA = %d, B = %d, L = %d\n", A, B, L);
	printf("A&7 = %d, B&7 = %d, L&7 = %d\n", A&7, B&7, L&7);

	for (i = 0; i < 64; ++i)	{
		printf((i&7)?"%c":(i&63)?" %c":"\n%c",
			(A <= i && A+L > i)? 'A':
			(B <= i && B+L > i)? 'B': ' '
		);
	}
	for (i = 0; i < sizeof(buffer); ++i)	{
		printf((i&7)?" ":"\n");
		printbits(buffer, i*8, 8);
	}
	for (i = sizeof(buffer)*8 - 64; i < sizeof(buffer) * 8; ++i)	{
		printf((i&7)?"%c":(i&63)?" %c":"\n%c",
			(A <= i && A+L > i)? 'A':
			(B <= i && B+L > i)? 'B': ' '
		);
	}
	
	printf("\nA -> ");
	printbits(buffer, A, L);
	printf("\nB -> ");
	printbits(buffer, B, L);
	
	printf("\n\nMoving A to B...\n");
	movebits2(buffer, B, buffer, A, L);
	
	for (i = 0; i < 64; ++i)	{
		printf((i&7)?"%c":(i&63)?" %c":"\n%c",
			(A <= i && A+L > i)? 'A':
			(B <= i && B+L > i)? 'B': ' '
		);
	}
	for (i = 0; i < sizeof(buffer); ++i)	{
		printf((i&7)?" ":"\n");
		printbits(buffer, i*8, 8);
	}
	for (i = sizeof(buffer)*8 - 64; i < sizeof(buffer) * 8; ++i)	{
		printf((i&7)?"%c":(i&63)?" %c":"\n%c",
			(A <= i && A+L > i)? 'A':
			(B <= i && B+L > i)? 'B': ' '
		);
	}
	
	printf("\nA -> ");
	printbits(buffer, A, L);
	printf("\nB -> ");
	printbits(buffer, B, L);
	printf("\n");
	
	if (cmpbits(buffer, A, buffer, B, L) != 0)	{
		printf ("\nA and B don't match\n");
		return -1;
	}
}

return 0;

}On 25/06/2010, Kenneth Bull <@Kenneth_Bull> wrote:

One of the more obvious improvements is to change it to use unsigned int (or on some platforms, unsigned long long may be preferable for 64bit processing), unsigned char is not the memory bus width
(or even close!), it’s less efficient to operate on a type smaller than the memory bus width, especially one so much smaller… There is rarely any reason to operate on bytes when processing stream
data of this nature.

Another odd consideration is that on certain architectures (particularly Pentium 4 for example) the shifting operations are slow… But there isn’t a lot you can do about that.On 06/27/2010 09:52 PM, Kenneth Bull wrote:

On 25/06/2010, Kenneth Bull wrote:

Trying to find a better way of doing this:

void movebits1( void* dstBase, unsigned dstOff,
const void* srcBase, unsigned srcOff,
unsigned length ) {

Here’s movebits2, which should be a bit faster. I still need to do
benchmarks, and I haven’t touched Bob’s Duff’s Device suggestion yet.
I’m not sure if what I’ve done with the switch statements would be an
improvement over ifs or not. I guess it depends on how the compiler
optimizes conditional expressions.


LordHavoc
Author of DarkPlaces Quake1 engine - http://icculus.org/twilight/darkplaces
Co-designer of Nexuiz - http://alientrap.org/nexuiz
"War does not prove who is right, it proves who is left." - Unknown
"Any sufficiently advanced technology is indistinguishable from a rigged demo." - James Klass
"A game is a series of interesting choices." - Sid Meier

Or you could use (u)intmax_t.

Be sure to get a copy of stdint.h if you’re using a Microsoft
development suite that doesn’t include this standard header.On Mon, Jun 28, 2010 at 2:14 AM, Forest Hale wrote:

One of the more obvious improvements is to change it to use unsigned int (or on some platforms, unsigned long long may be preferable for 64bit processing), unsigned char is not the memory bus width
(or even close!), it’s less efficient to operate on a type smaller than the memory bus width, especially one so much smaller… ?There is rarely any reason to operate on bytes when processing stream
data of this nature.


http://codebad.com/

One of the more obvious improvements is to change it to use unsigned int (or
on some platforms, unsigned long long may be preferable for 64bit
processing), unsigned char is not the memory bus width
(or even close!), it’s less efficient to operate on a type smaller than the
memory bus width, especially one so much smaller… There is rarely any
reason to operate on bytes when processing stream
data of this nature.

The way I understand it, unsigned int would be 64 bits in a 64 bit
compiler and 32 on a 32 bit compiler, unsigned long is always 32,
unsigned long long is always 64.

I could use unsigned int, then throw in an if (sizeof(int) == 64)
which would be optimized out at compile time.

Another odd consideration is that on certain architectures (particularly
Pentium 4 for example) the shifting operations are slow… But there isn’t
a lot you can do about that.

I doubt it’s slower than converting it to multiplication…On 28/06/2010, Forest Hale wrote:

Probably wouldn’t want uintmax_t. The idea is to match the bus size.
Using a 64 bit type on a 32 bit bus wouldn’t help. I’m not sure if it
would reduce the number of instructions for large buffers once
compiled either. It would reduce the number of iterations, but I’d be
better off to use loop unwinding or leave it up to compiler
optimizations.On 28/06/2010, Donny Viszneki <donny.viszneki at gmail.com> wrote:

Or you could use (u)intmax_t.

Be sure to get a copy of stdint.h if you’re using a Microsoft
development suite that doesn’t include this standard header.

I have benchmarks!

These are the results for my eMachines T6216 (Athlon 64 3200+ at 2.01
GHz) running Windows XP Home (32 bit). I have a few fancy flags in
boot.ini that probably affect scheduling…

The benchmark uses QueryPerformanceCounter() so it’s Windows only for
the moment. Shouldn’t be hard to convert it if anyone wants to try it
on Linux.

I compiled using VC10 compiler in the Windows SDK build environment.

The benchmark was compiled like so:
cl /c /O2tx /arch:SSE2 /EHsc bittest.cpp
link /out:bittest.exe bittest.obj

and like so:
cl /c /EHsc bittest.cpp
link /out:bittest.exe bittest.obj

The revised bittest.cpp and results are attached.
-------------- next part --------------
#include
#include
#include
#include <conio.h>

#include

void printbits(const void* base, unsigned offset, unsigned length) {
const unsigned char* b = (const unsigned char*) base;
b += offset / 8;
offset %= 8;

for (; length--; ++offset)	{
	if (offset == 8)	{
		offset = 0;
		++b;
	}
	std::printf("%d", (*b >> offset) & 1);
}

}

int cmpbits(const void* base1, unsigned off1,
const void* base2, unsigned off2,
unsigned length) {
const unsigned char* b1;
const unsigned char* b2;

b1 = (unsigned char*) base1;
b2 = (const unsigned char*) base2;

while (length--)	{
	b1 += off1 >> 3;
	off1 &= 7;
	
	b2 += off2 >> 3;
	off2 &= 7;
	
	switch ((((*b1 >> off1++) & 1) << 1) |
			(*b2 >> off2++) & 1)	{
		case 1:
			return -1;
		case 2:
			return 1;
	}
}
return 0;

}

//bit by bit
void movebits1( void* dstBase, unsigned dstOff,
const void* srcBase, unsigned srcOff,
unsigned length ) {
unsigned char* d = (unsigned char*) dstBase;
const unsigned char* s = (const unsigned char*) srcBase;

while (length--)	{
	d += dstOff >> 3;
	dstOff &= 7;
	s += srcOff >> 3;
	srcOff &= 7;
	
	*d = (*d & ~(1 << dstOff)) | (((*s >> srcOff) & 1) << dstOff);
	
	++srcOff;
	++dstOff;
}

}

void movebits2( void* dstBase, unsigned dstOff,
const void* srcBase, unsigned srcOff,
unsigned length ) {
unsigned char* d;
const unsigned char* s;

d = (unsigned char*) dstBase;
s = (const unsigned char*) srcBase;

// normalize offset and base pointers
d += dstOff >> 3;
dstOff &= 7;
s += srcOff >> 3;
srcOff &= 7;

// deal with short transfers
switch (((dstOff + length <= 8) << 1) |
		(srcOff + length <= 8))	{
	case 3:
		*d = (*d & ~((~(255 << length)) << dstOff)) | (((*s >> srcOff) & ~(255 << length)) << dstOff);
		return;
	case 2:
		*d = (*d & ~((~(255 << length)) << dstOff)) | (((unsigned char) (*(unsigned short*)s >> srcOff) & ~(255 << length)) << dstOff);
		return;
	case 1:
		*(unsigned short*)d = (*(unsigned short*)d & ~((~(0xFFFF << length)) << dstOff)) | (((*s >> srcOff) & ~(255 << length)) << dstOff);
		return;
}

// deal with dstOff
if (dstOff != 0)	{
	*d = (*d & ~(255 << dstOff)) | ((unsigned char) (*(unsigned short*)s >> srcOff) << dstOff);
	++d;
	
	// adjust length and srcOff
	length -= 8 - dstOff;
	srcOff += 8 - dstOff;
	dstOff = 0;
	
	s += srcOff >> 3;
	srcOff &= 7;
}

// copy remaining bits
switch (((unsigned)((length & 7) == 0) << 1) | 
		(unsigned)(srcOff == 0))	{

	case 0:	{	/* default */
		unsigned l2 = 255 << (length & 7);
		for (length = length >> 3; length--; ++d, ++s)	{
			*d = (unsigned char)(*(unsigned short*)s >> srcOff);
		}
		*d = (*d & l2) | ((unsigned char)(*(unsigned short*)s >> srcOff) & ~l2);
		break;
	}
	case 1:	{	/* srcOff == 0 */
		unsigned l2 = 255 << (length & 7);
		memcpy(d, s, length = length >> 3); d += length; s += length;
		*d = (*d & l2) | (*s & ~l2);
		break;
	}
	
	case 2:		/* (length & 7) == 0 */
		for (length = length >> 3; length--; ++d, ++s)	{
			*d = (unsigned char)(*(unsigned short*)s >> srcOff);
		}
		break;
		
	case 3:		/* moving bytes */
		memcpy(d, s, length >> 3);
		
}

}

int test() {
unsigned char buffer[16];
unsigned i, A, B, L;

srand(time(0));

while (!_kbhit())	{
	for (i = 0; i < sizeof(buffer); ++i)	{
		buffer[i] = (unsigned char) rand();
	}
	
	L = rand() % (8 * sizeof(buffer) / 3);
	A = rand() % (8 * sizeof(buffer) - L);
	do	{
		B = rand() % (8 * sizeof(buffer) - L);
	} while ((A <= B && A+L > B) || (B <= A && B+L > A));
	
	printf("\nA = %d, B = %d, L = %d\n", A, B, L);

	for (i = 0; i < 64; ++i)	{
		printf((i&7)?"%c":(i&63)?" %c":"\n%c",
			(A <= i && A+L > i)? 'A':
			(B <= i && B+L > i)? 'B': ' '
		);
	}
	for (i = 0; i < sizeof(buffer); ++i)	{
		printf((i&7)?" ":"\n");
		printbits(buffer, i*8, 8);
	}
	for (i = sizeof(buffer)*8 - 64; i < sizeof(buffer) * 8; ++i)	{
		printf((i&7)?"%c":(i&63)?" %c":"\n%c",
			(A <= i && A+L > i)? 'A':
			(B <= i && B+L > i)? 'B': ' '
		);
	}
	
	printf("\nA -> ");
	printbits(buffer, A, L);
	printf("\nB -> ");
	printbits(buffer, B, L);
	
	printf("\n\nMoving A to B...\n");
	movebits2(buffer, B, buffer, A, L);
	
	for (i = 0; i < 64; ++i)	{
		printf((i&7)?"%c":(i&63)?" %c":"\n%c",
			(A <= i && A+L > i)? 'A':
			(B <= i && B+L > i)? 'B': ' '
		);
	}
	for (i = 0; i < sizeof(buffer); ++i)	{
		printf((i&7)?" ":"\n");
		printbits(buffer, i*8, 8);
	}
	for (i = sizeof(buffer)*8 - 64; i < sizeof(buffer) * 8; ++i)	{
		printf((i&7)?"%c":(i&63)?" %c":"\n%c",
			(A <= i && A+L > i)? 'A':
			(B <= i && B+L > i)? 'B': ' '
		);
	}
	
	printf("\nA -> ");
	printbits(buffer, A, L);
	printf("\nB -> ");
	printbits(buffer, B, L);
	printf("\n");
	
	if (cmpbits(buffer, A, buffer, B, L) != 0)	{
		printf ("\nA and B don't match\n");
		return 1;
	}
}

return 0;

}

#include <windows.h>

int bench(unsigned count, size_t size) {
unsigned i;
srand(time(0));

// printf ("\nCreating dataset… “);
unsigned char* buffer = new unsigned char[size];
for (i = 0; i < size; ++i) {
buffer[i] = rand();
// printf((i&7)?” “:”\n");
// printbits(buffer, i*8, 8);
}

// printf ("\n\nInitializing variables…\n");
unsigned* vars = new unsigned[count*3];
for (i = count; i–; ) {
unsigned& A = vars[i*3+0];
unsigned& B = vars[i*3+1];
unsigned& L = vars[i*3+2];

	L = rand() % (8 * size / 3);
	A = rand() % (8 * size - L);
	do	{
		B = rand() % (8 * size - L);
	} while ((A <= B && A+L > B) || (B <= A && B+L > A));

// printf("%3d %3d %3d\t", A, B, L);
}

printf ("\n\nRunning benchmark (%u iterations, %u byte buffer)...", count, size);
long long freq = 0, start = 0, end = 0;
QueryPerformanceFrequency((LARGE_INTEGER*) &freq);

printf("\nmovebits1: ");
QueryPerformanceCounter((LARGE_INTEGER*) &start);

for (i = count; i--; )	{
	movebits1(buffer, vars[i*3+0], buffer, vars[i*3+1], vars[i*3+2]);
}

QueryPerformanceCounter((LARGE_INTEGER*) &end);
printf ("%lf seconds", (double)(end-start)/freq);

printf("\nmovebits2: ");
QueryPerformanceCounter((LARGE_INTEGER*) &start);

for (i = count; i--; )	{
	movebits2(buffer, vars[i*3+0], buffer, vars[i*3+1], vars[i*3+2]);
}

QueryPerformanceCounter((LARGE_INTEGER*) &end);
printf ("%lf seconds", (double)(end-start)/freq);

printf("\nempty loop:");
unsigned junk = 0;
QueryPerformanceCounter((LARGE_INTEGER*) &start);

for (i = count; i--; )	{
	junk |= vars[i*3+0]|vars[i*3+1]|vars[i*3+2];
}

QueryPerformanceCounter((LARGE_INTEGER*) &end);
printf ("%lf seconds\n", (double)(end-start)/freq, junk);

delete vars;
delete buffer;
return 0;

}

int main(int, char**) {
bench(1000000, 256);
bench(1000000, 128);
bench(1000000, 64);
bench(1000000, 32);
bench(1000000, 16);
bench(1000000, 8);
bench(1000000, 4);
bench(1000000, 2);
bench(1000000, 1);
return 0;
}
-------------- next part --------------

Running benchmark (1000000 iterations, 256 byte buffer)…
movebits1: 1.810916 seconds
movebits2: 0.101904 seconds
empty loop:0.006286 seconds

Running benchmark (1000000 iterations, 128 byte buffer)…
movebits1: 0.905211 seconds
movebits2: 0.072261 seconds
empty loop:0.006125 seconds

Running benchmark (1000000 iterations, 64 byte buffer)…
movebits1: 0.456868 seconds
movebits2: 0.058807 seconds
empty loop:0.006458 seconds

Running benchmark (1000000 iterations, 32 byte buffer)…
movebits1: 0.237964 seconds
movebits2: 0.049728 seconds
empty loop:0.006458 seconds

Running benchmark (1000000 iterations, 16 byte buffer)…
movebits1: 0.124081 seconds
movebits2: 0.042296 seconds
empty loop:0.006100 seconds

Running benchmark (1000000 iterations, 8 byte buffer)…
movebits1: 0.069005 seconds
movebits2: 0.036983 seconds
empty loop:0.006107 seconds

Running benchmark (1000000 iterations, 4 byte buffer)…
movebits1: 0.040075 seconds
movebits2: 0.030821 seconds
empty loop:0.005774 seconds

Running benchmark (1000000 iterations, 2 byte buffer)…
movebits1: 0.026830 seconds
movebits2: 0.022643 seconds
empty loop:0.006423 seconds

Running benchmark (1000000 iterations, 1 byte buffer)…
movebits1: 0.016274 seconds
movebits2: 0.022905 seconds
empty loop:0.006127 seconds
-------------- next part --------------

Running benchmark (1000000 iterations, 256 byte buffer)…
movebits1: 4.619098 seconds
movebits2: 0.389204 seconds
empty loop:0.012779 seconds

Running benchmark (1000000 iterations, 128 byte buffer)…
movebits1: 2.296133 seconds
movebits2: 0.228250 seconds
empty loop:0.012435 seconds

Running benchmark (1000000 iterations, 64 byte buffer)…
movebits1: 1.168352 seconds
movebits2: 0.146105 seconds
empty loop:0.012440 seconds

Running benchmark (1000000 iterations, 32 byte buffer)…
movebits1: 0.586166 seconds
movebits2: 0.104490 seconds
empty loop:0.012716 seconds

Running benchmark (1000000 iterations, 16 byte buffer)…
movebits1: 0.307077 seconds
movebits2: 0.080317 seconds
empty loop:0.012499 seconds

Running benchmark (1000000 iterations, 8 byte buffer)…
movebits1: 0.170761 seconds
movebits2: 0.067020 seconds
empty loop:0.012712 seconds

Running benchmark (1000000 iterations, 4 byte buffer)…
movebits1: 0.091559 seconds
movebits2: 0.053966 seconds
empty loop:0.012461 seconds

Running benchmark (1000000 iterations, 2 byte buffer)…
movebits1: 0.055960 seconds
movebits2: 0.038912 seconds
empty loop:0.012841 seconds

Running benchmark (1000000 iterations, 1 byte buffer)…
movebits1: 0.032166 seconds
movebits2: 0.036359 seconds
empty loop:0.012471 secondsOn 28/06/2010, Kenneth Bull <@Kenneth_Bull> wrote:

Here’s movebits2, which should be a bit faster. I still need to do
benchmarks, and I haven’t touched Bob’s Duff’s Device suggestion yet.
I’m not sure if what I’ve done with the switch statements would be an
improvement over ifs or not. I guess it depends on how the compiler
optimizes conditional expressions.

Has there been a 32 bit or 64 bit bus machine built in a the last 5
years? Or even longer? The native word size and the bus size no longer
have much in common. Look at how the caches handle memory references
if you want to optimize transfer rate. Mostly doesn’t matter in this
case because you are reading fields in a format. They are either
memory aligned or not. If they are, then great. If not, oh well.

Lets say I am trying to get the Green field out of a RGB565 color
image. The green field starts at bit 5 which is never going to be
aligned. OTOH, if you are writing to an 8 bit gray scale image then
you are going to write to 8 bit aligned addresses and you will get
good speed.

Basically, worry about it later.

Bob PendletonOn Mon, Jun 28, 2010 at 10:16 AM, Kenneth Bull wrote:

On 28/06/2010, Donny Viszneki <donny.viszneki at gmail.com> wrote:

Or you could use (u)intmax_t.

Be sure to get a copy of stdint.h if you’re using a Microsoft
development suite that doesn’t include this standard header.

Probably wouldn’t want uintmax_t. ?The idea is to match the bus size.
Using a 64 bit type on a 32 bit bus wouldn’t help. ?I’m not sure if it
would reduce the number of instructions for large buffers once
compiled either. ?It would reduce the number of iterations, but I’d be
better off to use loop unwinding or leave it up to compiler
optimizations.

±----------------------------------------------------------

Cool! Interesting results! Let us know if you run the benchmarks on
another type of machine :)On Mon, Jun 28, 2010 at 11:39 AM, Kenneth Bull wrote:

On 28/06/2010, Kenneth Bull wrote:

Here’s movebits2, which should be a bit faster. ?I still need to do
benchmarks, and I haven’t touched Bob’s Duff’s Device suggestion yet.
I’m not sure if what I’ve done with the switch statements would be an
improvement over ifs or not. ?I guess it depends on how the compiler
optimizes conditional expressions.

I have benchmarks!

These are the results for my eMachines T6216 (Athlon 64 3200+ at 2.01
GHz) running Windows XP Home (32 bit). ?I have a few fancy flags in
boot.ini that probably affect scheduling…

The benchmark uses QueryPerformanceCounter() so it’s Windows only for
the moment. ?Shouldn’t be hard to convert it if anyone wants to try it
on Linux.

I compiled using VC10 compiler in the Windows SDK build environment.

The benchmark was compiled like so:
cl /c /O2tx /arch:SSE2 /EHsc bittest.cpp
link /out:bittest.exe bittest.obj

and like so:
cl /c /EHsc bittest.cpp
link /out:bittest.exe bittest.obj

The revised bittest.cpp and results are attached.


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


http://codebad.com/

The native word size still matters though, even if the bus size
doesn’t. Doing one 32 bit operation instead of four 8 bit operations
is an improvement. However, I’d still have to fall back to 8 bits
near the beginning or end of the buffer if it’s an odd size or
alignment to avoid segfaults.On 28/06/2010, Bob Pendleton wrote:

Has there been a 32 bit or 64 bit bus machine built in a the last 5
years? Or even longer? The native word size and the bus size no longer
have much in common. Look at how the caches handle memory references
if you want to optimize transfer rate. Mostly doesn’t matter in this
case because you are reading fields in a format. They are either
memory aligned or not. If they are, then great. If not, oh well.

To clarify, I was not saying that 32bit or 64bit would be bus width (128bit is the norm today), I was saying that 8bit is a waste of iterations because it is so small in comparison to the bus width
that it is unlikely to even be able to saturate the bus - in other words it is an artificial bottleneck on the algorithm.

For better optimization suggestions I would need to know what the purpose of this function is - are you doing blitting operations in 1bit planar for FAX machines or printers or what? Because if the
format is in fact NOT 1bit planar, this algorithm is probably a complete waste of time and a different approach would be more optimal.On 06/28/2010 09:02 AM, Bob Pendleton wrote:

On Mon, Jun 28, 2010 at 10:16 AM, Kenneth Bull wrote:

On 28/06/2010, Donny Viszneki <donny.viszneki at gmail.com> wrote:

Or you could use (u)intmax_t.

Be sure to get a copy of stdint.h if you’re using a Microsoft
development suite that doesn’t include this standard header.

Probably wouldn’t want uintmax_t. The idea is to match the bus size.
Using a 64 bit type on a 32 bit bus wouldn’t help. I’m not sure if it
would reduce the number of instructions for large buffers once
compiled either. It would reduce the number of iterations, but I’d be
better off to use loop unwinding or leave it up to compiler
optimizations.

Has there been a 32 bit or 64 bit bus machine built in a the last 5
years? Or even longer? The native word size and the bus size no longer
have much in common. Look at how the caches handle memory references
if you want to optimize transfer rate. Mostly doesn’t matter in this
case because you are reading fields in a format. They are either
memory aligned or not. If they are, then great. If not, oh well.

Lets say I am trying to get the Green field out of a RGB565 color
image. The green field starts at bit 5 which is never going to be
aligned. OTOH, if you are writing to an 8 bit gray scale image then
you are going to write to 8 bit aligned addresses and you will get
good speed.

Basically, worry about it later.

Bob Pendleton

±----------------------------------------------------------


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


LordHavoc
Author of DarkPlaces Quake1 engine - http://icculus.org/twilight/darkplaces
Co-designer of Nexuiz - http://alientrap.org/nexuiz
"War does not prove who is right, it proves who is left." - Unknown
"Any sufficiently advanced technology is indistinguishable from a rigged demo." - James Klass
"A game is a series of interesting choices." - Sid Meier

Well, if speaking of alignment, know that malloc always returns 8-byte aligned data on all the platforms I’m aware of, this means that even if the buffer provided is slightly above an 8-byte aligned
address, you can read down to the 8-byte aligned address without ill-effect.

However this is mostly a matter of how you define the algorithm, what its proper inputs and outputs are.

Know also that on most CPU architectures (but not x86) you can get a crash from accessing 16bit or 32bit on unaligned addresses, this may significantly affect the design of your algorithm.

In general you want to define your algorithms with more optimal restrictions on inputs and outputs to ensure that a fast approach is used in the vast majority of cases, requiring that the addresses
are provided as unsigned int pointer rather than unsigned char pointer would be a perfectly reasonable restriction in most algorithms of this nature, then whether that would cause an alignment crash
is no longer a concern of your algorithm, the blame is shifted to the caller.On 06/28/2010 09:34 AM, Kenneth Bull wrote:

On 28/06/2010, Bob Pendleton wrote:

Has there been a 32 bit or 64 bit bus machine built in a the last 5
years? Or even longer? The native word size and the bus size no longer
have much in common. Look at how the caches handle memory references
if you want to optimize transfer rate. Mostly doesn’t matter in this
case because you are reading fields in a format. They are either
memory aligned or not. If they are, then great. If not, oh well.

The native word size still matters though, even if the bus size
doesn’t. Doing one 32 bit operation instead of four 8 bit operations
is an improvement. However, I’d still have to fall back to 8 bits
near the beginning or end of the buffer if it’s an odd size or
alignment to avoid segfaults.


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


LordHavoc
Author of DarkPlaces Quake1 engine - http://icculus.org/twilight/darkplaces
Co-designer of Nexuiz - http://alientrap.org/nexuiz
"War does not prove who is right, it proves who is left." - Unknown
"Any sufficiently advanced technology is indistinguishable from a rigged demo." - James Klass
"A game is a series of interesting choices." - Sid Meier

Use orc :slight_smile: http://code.entropywave.com/projects/orc/