RLE Compression

Rudolf_Janz · April 10, 2009, 8:57am

Hi,

I have implemented RLE compression for per-pixel alpha surfaces (could be adapted for color key as well). The new implementation compresses horizontal lines of the same color (translucent and opaque). It does this using signed run length (negative means repeat next color). The current RLE acceleration only removes transparent pixels and groups opaque/translucent pixels for faster blitting.

So how much memory does it save? Of course that depends on the content.
On the application I am using it for, a GUI with a lot of translucency and in general large amounts of same color areas (e.g. text with large font sizes), it is a major win (savings of 50-90%).
In the worst case e.g. dithered images, there is no gain and potentially minimal loss (opaque runs in 16 bit color are now limited to 127 pixels instead of 255)

I did not do any performance measurements, but there should be not much difference, there are a little bit more conditions in the blitting involved, on the other hand less memory is touched.

Additionally this implementation uses a two pass RLE conversion. First the size of the resulting RLE compressed data is computed (by running the same algorithm on a dummy buffer).
THen the surface is actually compressed into a memory block of the right size. The RLE implementation in the current SDL allocates a worst case buffer (which in most cases is much too large) and reallocs that at the end, after the size is known. At least on Windows CE with the limited Virtual Memory (max. 32 MB, around 24 M usable) that leads to serious memory fragmentation.

On a normal PC memory is normally not much of a problem. But on a memory limited device like the NDS this might be useful.

I have attached a diff (against SDL 1.2.13), this is not meant to be final (especially the UnRLE code is not well tested). If somebody is interested I might clean it up and test it little bit more.

One problem that I have is, that with the current API (I was using 1.2.13, but I saw no changes in 1.3) it is not possible to directly RLE compress the surface without creating a new surface and thereby allocating a large block of memory (DisplayFormatAlpha/ConvertSurface), as a workaround I used the internal function MapSurface to do this. Is there a better way?

Rudolf

-------------- next part --------------
A non-text attachment was scrubbed…
Name: rle.diff
Type: application/octet-stream
Size: 24988 bytes
Desc: not available
URL: http://lists.libsdl.org/pipermail/sdl-libsdl.org/attachments/20090410/e67897cc/attachment.obj

Donny_Viszneki · April 10, 2009, 5:05pm

I have implemented RLE compression for per-pixel alpha surfaces (could be adapted for color key as well). The new implementation compresses horizontal lines of the same color (translucent and opaque). It does this using signed run length (negative means repeat next color). The current RLE acceleration only removes transparent pixels and groups opaque/translucent pixels for faster blitting.

Interesting!

Additionally this implementation uses a two pass RLE conversion. First the size of the resulting RLE compressed data is computed (by running the same algorithm on a dummy buffer).
THen the surface is actually compressed into a memory block of the right size. The RLE implementation in the current SDL allocates a worst case buffer (which in most cases is much too large) and reallocs that at the end, after the size is known. At least on Windows CE with the limited Virtual Memory (max. 32 MB, around 24 M usable) that leads to serious memory fragmentation.

That’s a shame. shrinking the most recently allocated chunk of memory
should not cause memory fragmentation!

On a normal PC memory is normally not much of a problem. But on a memory limited device like the NDS this might be useful.

A good libc allocator should not have a hard time with either approach.On Fri, Apr 10, 2009 at 4:57 AM, Rudolf Janz <rudolf.janz at gmx.net> wrote:

–
http://codebad.com/

Rudolf_Janz · April 12, 2009, 12:28pm

Additionally this implementation uses a two pass RLE conversion. First the size of the resulting RLE compressed data is computed (by running the same algorithm on a dummy buffer).
THen the surface is actually compressed into a memory block of the right size. The RLE implementation in the current SDL allocates a worst case buffer (which in most cases is much too large) and reallocs that at the end, after the size is known. At least on Windows CE with the limited Virtual Memory (max. 32 MB, around 24 M usable) that leads to serious memory fragmentation.

That’s a shame. shrinking the most recently allocated chunk of memory
should not cause memory fragmentation!
A good libc allocator should not have a hard time with either approach.

It happens if there are other threads doing allocations. But even without other allocations, if realloc does not move the memory block, the amount of contiguous virtual memory is reduced, on a sane operating system, you would have at least 2GB of virtual memory so no problem.
On Windows CE it is just 32MB with around 10-12 MB already prefragmented (because of an incredibly stupid shared library loading mechanism)

In my case the avoiding the large estimates solved the out of memory problems.

Rudolf

Donny_Viszneki · April 12, 2009, 4:45pm

That’s a shame. shrinking the most recently allocated chunk of memory
should not cause memory fragmentation!
A good libc allocator should not have a hard time with either approach.

It happens if there are other threads doing allocations.

This makes sense. I think for most applications, a better allocation
heuristic would be to accept a little more average overhead, and avoid
the memory fragmentation from concurrent allocator accesses by giving
each thread its own space within which to allocate. (There’s certainly
room for more detailed heuristic analysis on how best to do that,
though.)

But even without other allocations, if realloc does not move the memory block, the amount of contiguous virtual memory is reduced,

This is precisely the behavior I was criticizing. Good allocators
shouldn’t do this.

On Windows CE it is just 32MB with around 10-12 MB already prefragmented (because of an incredibly stupid shared library loading mechanism)

ugh

In my case the avoiding the large estimates solved the out of memory problems.

:)On Sun, Apr 12, 2009 at 8:28 AM, Rudolf Janz <rudolf.janz at gmx.net> wrote:

–
http://codebad.com/