OT: optimizing movebits()

Kenneth_Bull · June 28, 2010, 5:01pm

This is not for a specific format. It will be used as a fallback when
no specific implementation exists and I am dealing with densely packed
fields of non-exponent-of-2 sizes. Refer back to my earlier reply to
Bob.

Another example:
Having extracted the 5-bit red channel of a 16-bit RGB image into a
densely packed buffer, extract 100 samples starting at index 27.

ChannelCopy red(rgbImage[“R”]);
// …
ChannelCopy blah(5, 100);
blah.SetData(red.GetRange(27,100));

or ChannelCopy blah(red.GetRange(27,100)); since GetRange would return
a Channel object.On 28/06/2010, Forest Hale wrote:

For better optimization suggestions I would need to know what the purpose of
this function is - are you doing blitting operations in 1bit planar for FAX
machines or printers or what? Because if the
format is in fact NOT 1bit planar, this algorithm is probably a complete
waste of time and a different approach would be more optimal.

Kenneth_Bull · June 28, 2010, 5:25pm

I’ll have to try that. If the compilation is fast enough, it might be
a good option.

To be honest, I’m really not impressed with GStreamer, but that might
not be Orc’s fault.On 28/06/2010, Ren? Dudfield wrote:

Use orc http://code.entropywave.com/projects/orc/

Donny_Viszneki · June 28, 2010, 5:36pm

Based on Ken’s benchmark results, I’d say that the primary factor
determining performance here is that his CPU can only bit shift 8 bit
values natively.On Mon, Jun 28, 2010 at 12:34 PM, Kenneth Bull wrote:

On 28/06/2010, Bob Pendleton wrote:

Has there been a 32 bit or 64 bit bus machine built in a the last 5
years? Or even longer? The native word size and the bus size no longer
have much in common. Look at how the caches handle memory references
if you want to optimize transfer rate. Mostly doesn’t matter in this
case because you are reading fields in a format. They are either
memory aligned or not. If they are, then great. If not, oh well.

The native word size still matters though, even if the bus size
doesn’t. ?Doing one 32 bit operation instead of four 8 bit operations
is an improvement. ?However, I’d still have to fall back to 8 bits
near the beginning or end of the buffer if it’s an odd size or
alignment to avoid segfaults.

SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org

–
http://codebad.com/

Kenneth_Bull · June 28, 2010, 5:46pm

The way I plan on using this, the base pointers will be exactly what’s
returned by new / malloc and both byte and bit offset will be handled
by the offset arguments. So if malloc is giving me 32 or 64 bit
aligned addresses, the base pointers will still be aligned when they
get to this function.

Using unsigned int or unsigned long base pointers sounds reasonable.
This is in a Utility.hpp header that may or may not find it’s way into
the API, so I may not be the only one using this function. It would
make for messier code though (more pointer casts).On 28/06/2010, Forest Hale wrote:

Know also that on most CPU architectures (but not x86) you can get a crash
from accessing 16bit or 32bit on unaligned addresses, this may significantly
affect the design of your algorithm.

In general you want to define your algorithms with more optimal restrictions
on inputs and outputs to ensure that a fast approach is used in the vast
majority of cases, requiring that the addresses
are provided as unsigned int pointer rather than unsigned char pointer would
be a perfectly reasonable restriction in most algorithms of this nature,
then whether that would cause an alignment crash
is no longer a concern of your algorithm, the blame is shifted to the
caller.

Kenneth_Bull · June 28, 2010, 5:55pm

No… I’m pretty sure it can do better than that.
SHL works on 8, 16 and 32 bit values (at least).

The algorithm in movebits1 doesn’t really take advantage of that much,
but most of the shifts in movebits2 are 16 bits.

I’m working on a movebits3 that works on 32-bits. If I were doing
this in assembly language I could probably get up to 128 bits using ye
olde MMX.On 28/06/2010, Donny Viszneki <donny.viszneki at gmail.com> wrote:

Based on Ken’s benchmark results, I’d say that the primary factor
determining performance here is that his CPU can only bit shift 8 bit
values natively.

Donny_Viszneki · June 28, 2010, 6:13pm

I thought your benchmarks showed using different sized data types for
shifting the data. Is that not the case?On Mon, Jun 28, 2010 at 1:55 PM, Kenneth Bull wrote:

The algorithm in movebits1 doesn’t really take advantage of that much,
but most of the shifts in movebits2 are 16 bits.

–
http://codebad.com/

Rainer_Deyke · June 28, 2010, 8:11pm

‘long’ is guaranteed to be at least as long as ‘int’, so there is no
compiler or OS anywhere with 64 bit ‘int’ and 32 bit ‘long’. The two
most common models are LLP64 (32 bit int and long, 64 bit long long) and
LP64 (32 bit int, 64 bit long and long long). See
http://en.wikipedia.org/wiki/64-bit#Specific_C-language_data_models.On 6/28/2010 09:08, Kenneth Bull wrote:

The way I understand it, unsigned int would be 64 bits in a 64 bit
compiler and 32 on a 32 bit compiler, unsigned long is always 32,
unsigned long long is always 64.

–
Rainer Deyke - rainerd at eldwood.com

Kenneth_Bull · June 28, 2010, 9:28pm

No, that’s the number of bytes in the buffer containing all the bits
it’s fiddling with.
It affects the size of the blocks being moved, which is always less
than one third of the number of bits in the buffer.

The relevant code is here:

unsigned char* buffer = new unsigned char[size];
for (i = 0; i < size; ++i) {
buffer[i] = rand();
}

and here:

L = rand() % (8 * size / 3);
A = rand() % (8 * size - L);
do {
B = rand() % (8 * size - L);
} while ((A <= B && A+L > B) || (B <= A && B+L > A));

L is the length of the block being copied in bits, A is the source
offset and B is the destination offset. The ranges A : A+L and B :
B+L never overlap.On 28/06/2010, Donny Viszneki <donny.viszneki at gmail.com> wrote:

On Mon, Jun 28, 2010 at 1:55 PM, Kenneth Bull <@Kenneth_Bull> wrote:

The algorithm in movebits1 doesn’t really take advantage of that much,
but most of the shifts in movebits2 are 16 bits.

I thought your benchmarks showed using different sized data types for
shifting the data. Is that not the case?