Dear all,
I have been thinking about the problem of optimising bitblit operations, and I
believe that I have found a good general solution, though it is far from
complete.
The following is a general description of my proposed solution. I would be
grateful for any criticism, comments, suggestions, etc. I have a prototype
mostly implemented, which currently interprets a parsed tree structure,
though I am in the process of writing a simple compiler. Though my background
is in ARM coding and optimisation, I’m tackling x86 as best I can.
The correct model to use, I believe, is that of regex facilities in many
libraries, including GNU libc. In essence, the intended result is specified
in a suitable language. This is compiled, and may then be executed as and
when needed.
A blit is specified using a language vaguely inspired by LISP. The destination
pixel is specified in terms of the coordinates of that pixel. For example,
here is a ‘zero fill’ operation:
zerofill16bpp(x y)=(setwidth 0 16)
And here is a ‘fill with interesting colours’ operation:
myblit(x y)=(packbits (setwidth x 5) (setwidth y 6) (setwidth x 5))
General characteristics of the destination surface (bpp, etc) and blit rect
are supplied to the compiler. Flags allow optimisation for special cases. For
example, if the blit rect is fixed.
A blit can have 0 or more sources. Each of these have their own
characteristics and flags.
Here is a simple ‘copy’ blit. It copies from source 0:
copyblit16bpp(x y)=(fetch 0 x y)
More interesting, here is a simple 1-bit mask. Source 2 is the same as the
destination, source 1 is the mask, and source 0 is the image.
maskblit16bpp(x y)=(if (eq (fetch 1 x y) 1) (fetch 0 x y) (fetch 2 x y))
(Highly optimised image processing filters are also a distinct possibility.)
It is important to note a few important things:
- The language described is inherently parallel. It does not specify ordering.
- The operations described operate (theoretically) on pixels or usually small
numbers of bits. These can often be combined into a 32-bit or wider
operation. - The compiler can easily discover CPU features such as MMX and optimise at
runtime.
As such, it is relatively easy to optimise for specific features. (I say
’relatively’ easy, since optimisation is not particularly easy.)
I’d be grateful for any comments. Would people find this useful? Would anyone
be interested in collaborating on this?
Jake Waskett