Best way to blit 32-bit data to a non 32-bit screen

OK, first of all, I’d like to thank everyone who answered my OpenGL
questions, particularly Niels Wagenaar.

I now have another problem. The emulator I’m working on is based on an
8-bit palette. At the start of the program, the internal 256 color
palette is mapped to an SDL_Color palette that represents (as closely as
possible) those original colors. This works at 15, 16, 24, and 32-bit
color depth.

Now, we plan to go to a full 32-bit internal representation for the
internal palette so that we can use opacity, etc. My question is this:
Assuming each uInt32 in the internal framebuffer represents a pixel color
(in RGB 8888 format), what would be the fastest way to blit this to an
SDL surface that is not necessarily 32 bit?

The underlying core of the emulator is cross-platform, so the original
data isn’t stored in an SDL surface. Each update() call will receive the
framebuffer (an array of uInt32’s), and I have to take that, blit it to
an SDL surface, and update the screen. There will also be a dirty
bitmask for doing dirty updates.

So say I have to update a rectangle in the SDL surface at position (0,10)
and size (20, 50). If the SDL surface is on a 16-bit screen, and all I
have is an array of uInt32, what’s the fastest way to get that data into
the surface, assuming this could be done in multiple places on the
surface at 60 fps?

Or is it better to stay with a 256 color palette, and create other 256
color palettes for opacity (where each color in that array is
pre-computed to be 25% less bright)? And would doing it this way negate
moving to OpenGL rendering in the future?

I’m beginning to think that the latter is probably easier.

BTW, this emulator is for Linux and SDL.

Any info is greatly appreciated,
Steve

OK, first of all, I’d like to thank everyone who answered my OpenGL
questions, particularly Niels Wagenaar.

I now have another problem. The emulator I’m working on is based on an
8-bit palette. At the start of the program, the internal 256 color
palette is mapped to an SDL_Color palette that represents (as closely as
possible) those original colors. This works at 15, 16, 24, and 32-bit
color depth.

Now, we plan to go to a full 32-bit internal representation for the
internal palette so that we can use opacity, etc. My question is this:
Assuming each uInt32 in the internal framebuffer represents a pixel color
(in RGB 8888 format), what would be the fastest way to blit this to an
SDL surface that is not necessarily 32 bit?

The underlying core of the emulator is cross-platform, so the original
data isn’t stored in an SDL surface. Each update() call will receive the
framebuffer (an array of uInt32’s), and I have to take that, blit it to
an SDL surface, and update the screen. There will also be a dirty
bitmask for doing dirty updates.

So say I have to update a rectangle in the SDL surface at position (0,10)
and size (20, 50). If the SDL surface is on a 16-bit screen, and all I
have is an array of uInt32, what’s the fastest way to get that data into
the surface, assuming this could be done in multiple places on the
surface at 60 fps?

To the best of my knowledge there is only one way to do that, you have
to extract the rgb values, shift to match the size of the destination
field, and repack the pixel. That process requires a lot of operations.
If you really want to do that I would look very carefully at the MMX
instruction set for a fast way to do it.

Or is it better to stay with a 256 color palette, and create other 256
color palettes for opacity (where each color in that array is
pre-computed to be 25% less bright)?

Yes, much better. As you know, you can just index into a table and get
the pixel you need in the format you want. Very simple, always works.
The tables are small and fit easily in cache. A little loop unrolling
can be applied to make it even faster.

Does the machine you are emulating do most of its graphics with sprites?
Or is it mostly pixel bashing? Sprites can be accelerated very simply
without having to deal with any of these problems.

And would doing it this way negate
moving to OpenGL rendering in the future?

Don’t know enough about the project to answer that.

I’m beginning to think that the latter is probably easier.

That would be my bet.

BTW, this emulator is for Linux and SDL.

Any info is greatly appreciated,
Steve

Bob PendletonOn Tue, 2003-10-07 at 17:11, Stephen Anthony wrote:


SDL mailing list
SDL at libsdl.org
http://www.libsdl.org/mailman/listinfo/sdl

±----------------------------------+


So say I have to update a rectangle in the SDL surface at position
(0,10)

and size (20, 50). If the SDL surface is on a 16-bit screen, and all I
have is an array of uInt32, what’s the fastest way to get that data into
the surface, assuming this could be done in multiple places on the
surface at 60 fps?

To the best of my knowledge there is only one way to do that, you have
to extract the rgb values, shift to match the size of the destination
field, and repack the pixel. That process requires a lot of operations.
If you really want to do that I would look very carefully at the MMX
instruction set for a fast way to do it.

The Hermes library (http://www.clanlib.org/hermes/) does exactly that and
in a portable and fast way.

ciao, Ivan

----- Original Message -----
From: bob@pendleton.com (Bob Pendleton)
On Tue, 2003-10-07 at 17:11, Stephen Anthony wrote:

The Hermes library (http://www.clanlib.org/hermes/) does exactly that and
in a portable and fast way.

FYI, SDL incorporates the Hermex MMX blitters (in addition to it own
carefully optimized C blitters)

See ya,
-Sam Lantinga, Software Engineer, Blizzard Entertainment

Now, we plan to go to a full 32-bit internal representation for the
internal palette so that we can use opacity, etc. My question is this:
Assuming each uInt32 in the internal framebuffer represents a pixel color
(in RGB 8888 format), what would be the fastest way to blit this to an
SDL surface that is not necessarily 32 bit?

Probably the fastest way to do this is to create a 32-bit SDL video surface
and let SDL convert from 32-bpp to the actual display depth. SDL contains
specially optimized blitters for converting from 32-bpp to common display
formats.

However, if your engine is up for it, 8-bpp is always faster, since there’s
much less data to move and fewer operations to get it to the display format.

See ya!
-Sam Lantinga, Software Engineer, Blizzard Entertainment

[snipped]

So say I have to update a rectangle in the SDL surface at position
(0,10) and size (20, 50). If the SDL surface is on a 16-bit screen,
and all I have is an array of uInt32, what’s the fastest way to get
that data into the surface, assuming this could be done in multiple
places on the surface at 60 fps?

To the best of my knowledge there is only one way to do that, you have
to extract the rgb values, shift to match the size of the destination
field, and repack the pixel. That process requires a lot of operations.
If you really want to do that I would look very carefully at the MMX
instruction set for a fast way to do it.

Yes, that’s why I do it right now at program startup and cache the LUT. I
didn’t think there was any other way to do it, but I just wanted to make
sure.

I hadn’t actually thought of creating separate LUT for opacity until I
wrote this message. Guess I sort of answered my own question. Sometimes
just structuring a question can give you the answer :slight_smile:

Does the machine you are emulating do most of its graphics with
sprites? Or is it mostly pixel bashing? Sprites can be accelerated very
simply without having to deal with any of these problems.

Well, it’s for the Stella emulator (Atari 2600), so there is no concept of
a sprite (and least not at the framebuffer level that the update() call
receives). It’s basically pixel access at 160x240. And the underlying
engine basically works by poking values into certain memory locations.

Thanks for the info,
SteveOn October 7, 2003 08:39 pm, Bob Pendleton wrote:

Yes, after speaking with some other people, it seems to be a much better
option to stick with the 8-bit palette, and create other 8-bit palette’s
for various opacity levels. Then the color information can be computed
at program start, and not every frame.

Thanks for the info,
SteveOn October 8, 2003 03:18 am, Sam Lantinga wrote:

Now, we plan to go to a full 32-bit internal representation for the
internal palette so that we can use opacity, etc. My question is
this: Assuming each uInt32 in the internal framebuffer represents a
pixel color (in RGB 8888 format), what would be the fastest way to
blit this to an SDL surface that is not necessarily 32 bit?

Probably the fastest way to do this is to create a 32-bit SDL video
surface and let SDL convert from 32-bpp to the actual display depth.
SDL contains specially optimized blitters for converting from 32-bpp to
common display formats.

However, if your engine is up for it, 8-bpp is always faster, since
there’s much less data to move and fewer operations to get it to the
display format.

[snipped]

Does the machine you are emulating do most of its graphics with
sprites? Or is it mostly pixel bashing? Sprites can be accelerated very
simply without having to deal with any of these problems.

And would doing it this way negate
moving to OpenGL rendering in the future?

Don’t know enough about the project to answer that.

Forgot to answer this part before. What I meant is this; assuming that I
go with multiple 256 LUT’s (or one 512 color LUT), how would this be sent
to OpenGL? Can you send a framebuffer of palette indices and the color
lookup table itself to an OpenGL texture, or does it work only on
15/16/32-bit values?

SteveOn October 7, 2003 08:39 pm, Bob Pendleton wrote:

[snipped]

Does the machine you are emulating do most of its graphics with
sprites? Or is it mostly pixel bashing? Sprites can be accelerated very
simply without having to deal with any of these problems.

And would doing it this way negate
moving to OpenGL rendering in the future?

Don’t know enough about the project to answer that.

Forgot to answer this part before. What I meant is this; assuming that I
go with multiple 256 LUT’s (or one 512 color LUT), how would this be sent
to OpenGL? Can you send a framebuffer of palette indices and the color
lookup table itself to an OpenGL texture, or does it work only on
15/16/32-bit values?

Take a look at glPixelMap() and glPixelTransfer. You can set color maps
and then have OpenGL do a copy from a frame buffer of indices to an rgb
frame buffer. The maximum size of the tables is implementation
dependent. The minimum size (according to OpenGL.org) is 32, but I
suspect it is usually larger since 32 is nearly useless. On my NVidia
card it is 65536, large enough to handle 16 bit indices.

	Bob PendletonOn Wed, 2003-10-08 at 06:51, Stephen Anthony wrote:

On October 7, 2003 08:39 pm, Bob Pendleton wrote:

Steve


SDL mailing list
SDL at libsdl.org
http://www.libsdl.org/mailman/listinfo/sdl

±----------------------------------+

Forgot to answer this part before. What I meant is this; assuming
that I go with multiple 256 LUT’s (or one 512 color LUT), how would
this be sent to OpenGL? Can you send a framebuffer of palette
indices and the color lookup table itself to an OpenGL texture, or
does it work only on 15/16/32-bit values?

Take a look at glPixelMap() and glPixelTransfer. You can set color maps
and then have OpenGL do a copy from a frame buffer of indices to an rgb
frame buffer. The maximum size of the tables is implementation
dependent. The minimum size (according to OpenGL.org) is 32, but I
suspect it is usually larger since 32 is nearly useless. On my NVidia
card it is 65536, large enough to handle 16 bit indices.

Thanks, this sounds like exactly what I would need. Do you know if these
calls are hardware-accelerated in most OpenGL drivers?

Part of the reason for adding an OpenGL target is to get (basically) free
resizing and rotation with the associated filtering. Would this still be
possible with glPixelMap() and glPixelTransfer? Or would I have to use
textures instead? And if so, what are your suggestions for updating a
texture with colormap data?

On a related topic, since I would have two buffers (one containing 32-bit
values from the emulation core and the other a cached CLUT), which would
be faster in OpenGL? Using the CLUT with glPixelMap() and
glPixelTransfer(), or using the 32-bit values directly with a texture
(and associated texture commands)?

Thanks for the info,
SteveOn October 8, 2003 12:13 pm, Bob Pendleton wrote:

Forgot to answer this part before. What I meant is this; assuming
that I go with multiple 256 LUT’s (or one 512 color LUT), how would
this be sent to OpenGL? Can you send a framebuffer of palette
indices and the color lookup table itself to an OpenGL texture, or
does it work only on 15/16/32-bit values?

Take a look at glPixelMap() and glPixelTransfer. You can set color maps
and then have OpenGL do a copy from a frame buffer of indices to an rgb
frame buffer. The maximum size of the tables is implementation
dependent. The minimum size (according to OpenGL.org) is 32, but I
suspect it is usually larger since 32 is nearly useless. On my NVidia
card it is 65536, large enough to handle 16 bit indices.

Thanks, this sounds like exactly what I would need. Do you know if these
calls are hardware-accelerated in most OpenGL drivers?

I do not know. I would suggest looking at product specs and writing a
test program that you can send out to people to try so that you can
collect some data on that. If you do get good data on this, please
share :slight_smile:

Part of the reason for adding an OpenGL target is to get (basically) free
resizing and rotation with the associated filtering. Would this still be
possible with glPixelMap() and glPixelTransfer? Or would I have to use
textures instead? And if so, what are your suggestions for updating a
texture with colormap data?

I don’t know that answer either! I would suggest looking at pbuffers.

On a related topic, since I would have two buffers (one containing 32-bit
values from the emulation core and the other a cached CLUT), which would
be faster in OpenGL? Using the CLUT with glPixelMap() and
glPixelTransfer(), or using the 32-bit values directly with a texture
(and associated texture commands)?

I think the key here is that both require moving large amounts of data
from main memory to video memory. The CLUT approach moves 25% as much
data as the 32 bit approach. So, right there it has a 4 to 1 speed
advantage over the 32 bit approach. The rest of the analysis depends on
things that I don’t know.

Another thing to consider is that the 32 bit format will not be directly
usable on all systems. On Linux/X11 if I have selected a desktop depth
of 16 bits I can only get 16 bit OpenGL visuals even though my video
card supports 32 bit visuals. That is true for windowed and full screen
applications. It is pretty much true for all windowed OpenGL
applications on all OSes. The window visual depth must match the desktop
visual depth. (I have helped build systems where each window can have
its own visual, but I haven’t seen anything like that in 10 years now.
Of course, I haven’t been looking either :slight_smile: You can not assume that you
can directly display a 32 bit color buffer. You have to adapt to what
the customer has and how they have it configured. It seems much easier
to expand 8 bit data to match the target visual than to shrink 32 bit
data down to 16 bits.

			Bob Pendleton

Thanks for the info,
Steve

	Hope I've helped,
		Bob PendletonOn Wed, 2003-10-08 at 10:19, Stephen Anthony wrote:

On October 8, 2003 12:13 pm, Bob Pendleton wrote:


SDL mailing list
SDL at libsdl.org
http://www.libsdl.org/mailman/listinfo/sdl

±----------------------------------+

Stephen Anthony wrote:

Take a look at glPixelMap() and glPixelTransfer. You can set color maps
and then have OpenGL do a copy from a frame buffer of indices to an rgb
frame buffer. The maximum size of the tables is implementation
dependent. The minimum size (according to OpenGL.org) is 32, but I
suspect it is usually larger since 32 is nearly useless. On my NVidia
card it is 65536, large enough to handle 16 bit indices.

Thanks, this sounds like exactly what I would need. Do you know if these
calls are hardware-accelerated in most OpenGL drivers?

I don’t think those are accelerated, especially if they need format
conversion. However, having hardware acceleration or not for those
operations will not be your biggest speed issue. The problems come from
the AGP bus which is the bottleneck of the operation when uploading
large surfaces to video memory.

Part of the reason for adding an OpenGL target is to get (basically) free
resizing and rotation with the associated filtering. Would this still be
possible with glPixelMap() and glPixelTransfer?

Nope, you won’t have any filtering/rotation with glDrawPixels. You’re
really drawing independent pixels.
If you want resizing, you should do it using an OpenGL texture, or you
could write a scaling video backend for SDL that would scale the picture
before displaying it (using software scaling, or OpenGL if available).

Or would I have to use
textures instead? And if so, what are your suggestions for updating a
texture with colormap data?

There’s an OpenGL extension (GL_EXT_paletted_texture) that does what you
want, but it’s not too widely supported, see
http://delphi3d.net/hardware/extsupport.php?extension=GL_EXT_paletted_texture
. Notably, ATI cards don’t have it.

On a related topic, since I would have two buffers (one containing 32-bit
values from the emulation core and the other a cached CLUT), which would
be faster in OpenGL? Using the CLUT with glPixelMap() and
glPixelTransfer(), or using the 32-bit values directly with a texture
(and associated texture commands)?

David Olofson and I wrote such a benchmark for glSDL. It also benchmarks
how (glDrawPixels) or (texture creation/blitting/delete) mix with
standard rendering :
http://icps.u-strasbg.fr/~marchesin/benchd1.tar.gz

And there are some quickly-generated results for this as postscript
(using a square surface, X axis is texture width/height, Y axis is
rendering time) :
http://icps.u-strasbg.fr/~marchesin/bench.ps

gf4 is a geforce 4 ti 4200, firegl is an ATI Fire GL, drawpix means
using drawpixels, and tex means creating and drawing a texture.

So :

  • on an nvidia card you should use a texure
  • on an ATI card you should use drawpixels

Or at least, that’s what it seems, judging from our (ridicoulousy) small
test sample.

Feel free to add your results !

Stephane