HW accelaration

slouken · August 17, 1999, 8:37pm

Hi guys (and gals!)
I have, like usually, a couple of question.
I’m wondering cause SDL_GetVideoInfo can’t detect
any HW acceleration in my S3 (virge dx).

DGA currently supports NO hardware acceleration.
This will change with XFree86 4.0

I know how to render a 640x480 screen from a 320x200 screen via cpu.
But I was looking for something smarter than just take a pixel and
to write it in a buffer 2 times (the same for each scanline…)

If the 320x200 screen is 8-bit, you can build a 32-bit lookup table
which contains two pixels for every one original pixels. The other
advantage of this is that you get 16-bit depth support for free.
Then, when you’re finished with a scanline, you can double it using
memcpy().

This is a very fast operation on modern systems, using software
surfaces.

-Sam Lantinga				(slouken at devolution.com)

Lead Programmer, Loki Entertainment Software–
“Any sufficiently advanced bug is indistinguishable from a feature”
– Rich Kulawiec

slouken · August 17, 1999, 11:44pm

DGA currently supports NO hardware acceleration.
This will change with XFree86 4.0

Umh…so actually, no gfx board can perform hw blit (or other
hw accelaration) under SDl->DGA->X11?
Maybe I misunderstood your statement

That’s correct.

-Sam Lantinga				(slouken at devolution.com)

Lead Programmer, Loki Entertainment Software–
“Any sufficiently advanced bug is indistinguishable from a feature”
– Rich Kulawiec

Marco_Salvi · August 18, 1999, 2:10am

Hi guys (and gals!)
I have, like usually, a couple of question.
I’m wondering cause SDL_GetVideoInfo can’t detect
any HW acceleration in my S3 (virge dx).
I mean, I can’t use NONE of the hw accelaration
provided by SDL,no hw blitting, or double buffering.

This is very strange, cause I know for sure that
S3 virge supports hw blitting and double buffering.
I used also test program in SDL/test dir to detect
my gfx board acceleration, and eventually avoid my faults
in detection code.
The result is the same

Now what? there is a way to force hw blitting or a way
to make a right detect?
Maybe this is not a SDL fault, but a DGA one, I just don’t know.
Any hints?

The second question is, there are hw facilities to perform a 2x
scale transformation on graphics buffer?
I’m porting under Linux a games originally wrote to run in a 320x240
(full screen)
Now…Linux’s version will run even in window mode.
The problem is a 320x240 window in a 1024x760 screen is very tiny.
So…I just thought to perform all the rendering in a 320x240 buffer,
and then (if the player wants…) scale it to a 640x480 buffer…
in a window or fullscreen.

I know how to render a 640x480 screen from a 320x200 screen via cpu.
But I was looking for something smarter than just take a pixel and
to write it in a buffer 2 times (the same for each scanline…)

ciao,
Marco

Marco_Salvi · August 18, 1999, 4:52am

Hello Sam

Hi guys (and gals!)
I have, like usually, a couple of question.
I’m wondering cause SDL_GetVideoInfo can’t detect
any HW acceleration in my S3 (virge dx).

DGA currently supports NO hardware acceleration.
This will change with XFree86 4.0

Umh…so actually, no gfx board can perform hw blit (or other
hw accelaration) under SDl->DGA->X11?
Maybe I misunderstood your statement

I know how to render a 640x480 screen from a 320x200 screen via cpu.
But I was looking for something smarter than just take a pixel and
to write it in a buffer 2 times (the same for each scanline…)

If the 320x200 screen is 8-bit, you can build a 32-bit lookup table
which contains two pixels for every one original pixels. The other
advantage of this is that you get 16-bit depth support for free.
Then, when you’re finished with a scanline, you can double it using
memcpy().

This is a very fast operation on modern systems, using software
surfaces.

thanks! it seems a very smart idea

ciao,
MarcoOn 17-Ago-99, you wrote:

dcsin_at_islandnet.c · August 18, 1999, 12:28am

I know how to render a 640x480 screen from a 320x200 screen via cpu.
But I was looking for something smarter than just take a pixel and
to write it in a buffer 2 times (the same for each scanline…)

Rather than writing it twice, why not four times?

If the 320x200 screen is 8-bit, you can build a 32-bit lookup table
which contains two pixels for every one original pixels. The other
advantage of this is that you get 16-bit depth support for free.
Then, when you’re finished with a scanline, you can double it using
memcpy().

This is a very fast operation on modern systems, using software
surfaces.

Would the cache thrashing slow this method down a whole lot? Using
lookup tables isn’t ALWAYS a good idea. I think a shift/or might
actually be faster. There are even better ways in assembly, but that
sort of prevents cross-platform development.

shift/or method:

dest = (source << 8) | source;

80x386 method:

mov al, source
mov ah, al
mov dest, ax

Wow. It’s been a LOOONG time since I’ve used assembly.On Tue, 17 Aug 1999 13:37:25 -0700, Sam wrote:

Chuck_Homic · August 18, 1999, 3:16pm

dest = (source << 8) | source;

80x386 method:

mov al, source
mov ah, al
mov dest, ax

DEAR GOD, THE PIPELINE STALL!!!

Wow. It’s been a LOOONG time since I’ve used assembly.

I see that. However, I agree that this is better than using a lookup
table. How about this asm instead:On Wed, 18 Aug 1999 dcsin at islandnet.com wrote:

lodsd ; read 4 bytes from [esi] into eax, and increment esi
mov ebx,eax ; save rest for later

mov edx,eax ; load into dx and bp for masking
mov ebp,eax
shl edx,16 ; move into position
shl ebp,8

and eax,0x000000FF ; isolate bitmasks, and merge into output pixels
and edx,0xFF000000
and ebp,0x00FFFF00
or eax,edx
or eax,ebp

stosd ; save eax to [edi], and increment edi

;; do something similar for next 16 bits in ebx
;;

stosd

This is off the top of my head, so you might come up with something that
uses fewer shifts and masks, and doesn’t use EBP as a scrap register
(however, I left ECX open for counting the loops) but the advantage here
is that it utilizes the pipleines better (I’m assuming a pentium or
better, here). So even though it is twice the size, it can do several of
these operations simulteneously, and when you’re done, it has extended
four pixels instead of one, using all 32 bits of the CPU.

On second thought, I think it would be better to load 16 bits at a time
from the input, so it wouldn’t waste the bx register. Oh well.

(I apoligize, I’ve been itching to write some asm for a while…)

-Chuck

dcsin_at_islandnet.c · August 19, 1999, 2:10pm

dest = (source << 8) | source;

80x386 method:

mov al, source
mov ah, al
mov dest, ax

DEAR GOD, THE PIPELINE STALL!!!

Yeah, I thought as much. My assembly days were pre-pentium so I don’t
know much of the details of optimizing for them. Also, I’ve never
actually owned an Intel CPU - I’ve always bought AMDs so the details
are different for those.

Wow. It’s been a LOOONG time since I’ve used assembly.

I see that. However, I agree that this is better than using a lookup
table. How about this asm instead:

It was sort of meant to be pseudo-assembly

lodsd ; read 4 bytes from [esi] into eax, and increment esi
mov ebx,eax ; save rest for later

mov edx,eax ; load into dx and bp for masking
mov ebp,eax
shl edx,16 ; move into position
shl ebp,8

and eax,0x000000FF ; isolate bitmasks, and merge into output pixels
and edx,0xFF000000
and ebp,0x00FFFF00
or eax,edx
or eax,ebp

stosd ; save eax to [edi], and increment edi

;; do something similar for next 16 bits in ebx
;;

stosd

I don’t feel like dissecting that right now, but why the ANDs? Using
masks like that requires a memory access, which is of course slow.

This is off the top of my head, so you might come up with something that
uses fewer shifts and masks, and doesn’t use EBP as a scrap register
(however, I left ECX open for counting the loops) but the advantage here
is that it utilizes the pipleines better (I’m assuming a pentium or
better, here). So even though it is twice the size, it can do several of
these operations simulteneously, and when you’re done, it has extended
four pixels instead of one, using all 32 bits of the CPU.

Maybe MMX would help out in this situation. Too bad I don’t know
anything about those instructions.

On second thought, I think it would be better to load 16 bits at a time
from the input, so it wouldn’t waste the bx register. Oh well.

Agreed. That could save a lot of memory accesses.

(I apoligize, I’ve been itching to write some asm for a while…)

No problem. I’ve been wishing I had the time to learn Pentium
optimizations and MMX for a while now. Especially after writing some
stuff that could really use it (like a lovely little 2D bumpmapper).On Wed, 18 Aug 1999 11:16:11 -0400 (EDT), you wrote:

On Wed, 18 Aug 1999 @dcsin_at_islandnet.c wrote:

Tomas_Andrle · August 20, 1999, 10:40am

Hi guys (and gals!)
I have, like usually, a couple of question.
I’m wondering cause SDL_GetVideoInfo can’t detect
any HW acceleration in my S3 (virge dx).

DGA currently supports NO hardware acceleration.
This will change with XFree86 4.0

So there might be even some stretched HW blitting? That would be really cool
because blitting a 320x200 image to 1024x768 with virtually any
accelerator (like S3 Virge) will look great, because of interpolation (and it
will be fast, too).