YUV Overlay scaling

Robert_D · October 31, 2001, 11:40am

How can I downscale a 640x480 YUV420P image down to 320x240? I can 2x zoom
a 320x240 image to 640x480 but its a bit blocky. Can this be done in under
2-5ms? Can it be done in hardware?

Mikko_Rantalainen · October 31, 2001, 12:24pm

How can I downscale a 640x480 YUV420P image down to 320x240? I can 2x
zoom
a 320x240 image to 640x480 but its a bit blocky. Can this be done in
under
2-5ms? Can it be done in hardware?

I didn’t bother to check but I think YUV420P would be close to average
16b/pixel.

Frame would take 6404802 bytes and if you need to transfer it in 2 ms
you need 6404802/0.002/2^20 = 292 MB/s bandwidth. You should be able
to show a frame in just under 5 ms with Matrox G4XX combined with Intel
chipset and good drivers. Don’t bother if you have VIA KT133 chipset.
You won’t get even close - I have tried with both Matrox G400 and ATI
Radeon and both get me close to 30 MB/s with YUV surface.

If you don’t use Linux (I think none of the current display drivers
support DMA for YUV surface) and happen to have DMA transfer for YUV
surfaces it might be little easier to get under 5 ms.

BTW, Matrox G400 driver for XFree 4.x had at least half a year ago bug
that corrupted display when you used small YUV surfaces. I don’t know if
this has been fixed.

Also if you want that 5 ms to be hard limit expect hard times. No
desktop OS can guarantee this. You should get close with preemptive
kernel patches for Linux 2.4.10+.

M

Robert_D · October 31, 2001, 2:59pm

How can I downscale a 640x480 YUV420P image down to 320x240? I can 2x
zoom
a 320x240 image to 640x480 but its a bit blocky. Can this be done in
under
2-5ms? Can it be done in hardware?

I didn’t bother to check but I think YUV420P would be close to average
16b/pixel.

12bpp

Frame would take 6404802 bytes and if you need to transfer it in 2 ms
you need 6404802/0.002/2^20 = 292 MB/s bandwidth. You should be able
to show a frame in just under 5 ms with Matrox G4XX combined with Intel
chipset and good drivers. Don’t bother if you have VIA KT133 chipset.
You won’t get even close - I have tried with both Matrox G400 and ATI
Radeon and both get me close to 30 MB/s with YUV surface.

what’s wrong with KT133/A? They can easily do 400-600MB/s? i815/BX do
300-450MB/s?

If you don’t use Linux (I think none of the current display drivers
support DMA for YUV surface) and happen to have DMA transfer for YUV
surfaces it might be little easier to get under 5 ms.

I want portability.

BTW, Matrox G400 driver for XFree 4.x had at least half a year ago bug
that corrupted display when you used small YUV surfaces. I don’t know if
this has been fixed.

Also if you want that 5 ms to be hard limit expect hard times. No
desktop OS can guarantee this. You should get close with preemptive
kernel patches for Linux 2.4.10+.

I’m capturing video + compressing with RTjpeg + displaying it + saving to
disk. 25fps video capture leads to around 40ms for a single cycle.
A P3 866 + TNT2 M64 can handle 4 such streams. The last time I checked it
took me around 10ms to merge 4 small image into a big YUV overlay and
display it.

I’m fine with the YUV display speed, I just need to scale down the input
images. With RGB data I would just skip every other pixel, but with YUV
data I’m also gonna me to merge stuff in the U and V planes, which means
sum(u)/4, sum(v)/4 for each pixel. This will easily eat up 10ms per frame
:(On Wed, 31 Oct 2001, Mikko Rantalainen wrote:

David_Olofson · October 31, 2001, 4:00pm

[…]

BTW, Matrox G400 driver for XFree 4.x had at least half a year ago
bug that corrupted display when you used small YUV surfaces. I don’t
know if this has been fixed.

Also if you want that 5 ms to be hard limit expect hard times. No
desktop OS can guarantee this. You should get close with preemptive
kernel patches for Linux 2.4.10+.

I’m capturing video + compressing with RTjpeg + displaying it + saving
to disk. 25fps video capture leads to around 40ms for a single cycle. A
P3 866 + TNT2 M64 can handle 4 such streams. The last time I checked
it took me around 10ms to merge 4 small image into a big YUV overlay
and display it.

I’m fine with the YUV display speed, I just need to scale down the
input images. With RGB data I would just skip every other pixel, but
with YUV data I’m also gonna me to merge stuff in the U and V planes,
which means sum(u)/4, sum(v)/4 for each pixel. This will easily eat up
10ms per frame

I’m not sure why you would need to do that with YUV but not with RGB…
(YUV and video stuff is pretty much the only field of graphics I haven’t
been playing with.)

Anyway, if you need to do that, use a right shift of 2 units instead of
divisions. (Division is the slowest ALU operation there is on most CPUs.)
And, instead of doing it all on one component at a time, use SIMD style
code. (Possible even with normal instructions, in pure C code.)

Just stuff as many components you can fit in a 32 bit int, making sure
you have a sufficient number of “scratch” bits between each field. Then
do the maths as usual, and finally clear the scratch bits and shift + OR
the components together. Works for multiplications as well; just add the
number of bits/component together to get the field+scratch size.

Have a look at the SDL blitting code for examples of this method.

//David Olofson — Programmer, Reologica Instruments AB

.- M A I A -------------------------------------------------.
| Multimedia Application Integration Architecture |
| A Free/Open Source Plugin API for Professional Multimedia |
----------------------------> http://www.linuxdj.com/maia -' .- David Olofson -------------------------------------------. | Audio Hacker - Open Source Advocate - Singer - Songwriter | -------------------------------------> http://olofson.net -'On Wednesday 31 October 2001 23:58, Robert D. wrote:

Mikko_Rantalainen · November 1, 2001, 1:40am

How can I downscale a 640x480 YUV420P image down to 320x240? I can
2x
I didn’t bother to check but I think YUV420P would be close to
average
16b/pixel.
12bpp

OK. It makes ~220MB/s, not a big difference.

chipset and good drivers. Don’t bother if you have VIA KT133
chipset.

what’s wrong with KT133/A? They can easily do 400-600MB/s? i815/BX do
300-450MB/s?

Well, I don’t know for sure about KT133A, but I have MSI K7T Pro with
KT133 (without A) and it’s lagging in linux. I cannot get more than
30MB/s even through dga. You can test your performance - just run “dga”
and press “b” and “q” to quit. Under windows I get nice write speed, but
windows drivers use DMA for system memory to VRAM transfers. I haven’t
benchmarked since I updated from G400 to ATI Radeon, but an app that I
use to filter live TV stream goes pretty much equally slow as with G400
under Linux. It takes like 80% of CPU and half of this is used to show
YUV overlay. Because SDL internally doesn’t copy overlay data when
hardware support is available it’s due driver that should use DMA
instead of spending half the CPU for copying surface to video memory.

I’m still unable to figure out if this is because chipset sucks, linux
doesn’t initialize chipset correctly or simply because display drivers
suck. I’m running 2.4.5 and haven’t checked if this is fixed in later
kernels.

System memory to system memory transfer speed is like it should and
OpenGL stuff is working at the performance level expected and reporting
AGP 4x. Still it takes forever for CPU to write directly to VRAM.

If you don’t use Linux (I think none of the current display drivers
support DMA for YUV surface) and happen to have DMA transfer for YUV
surfaces it might be little easier to get under 5 ms.
I want portability.

You don’t have to code anything for DMA support. It’s that video drivers
should use DMA, but they usually don’t under Linux because X makes it
hard. With YUV overlays the final image size doesn’t matter but the data
size that needs to be transferred to card.

Can you capture video in a portable way? I use bttv under linux but I
think it doesn’t run in other platforms.

I’m capturing video + compressing with RTjpeg + displaying it + saving
to
disk. 25fps video capture leads to around 40ms for a single cycle.
A P3 866 + TNT2 M64 can handle 4 such streams. The last time I
checked it
took me around 10ms to merge 4 small image into a big YUV overlay and
display it.

Let me get this straight. You’re able to capture (via bttv?)
640x480 at 25fps from 4 devices in paraller and compress all of that with
RTjpeg and save results to disk. Huh? And I thought capturing
720x576 at 25fps, filtering it and showing resulting image without dropping
frames was hard.

I’m fine with the YUV display speed, I just need to scale down the
input
images. With RGB data I would just skip every other pixel, but with
YUV
data I’m also gonna me to merge stuff in the U and V planes, which
means
sum(u)/4, sum(v)/4 for each pixel. This will easily eat up 10ms per
frame

If you don’t require HQ downscaling you can skip pixels in YUV too
instead of averaging between two or more pixels. I cannot find specs for
YUV420P - is it same as 8 bit Y-plane followed with 2x2 subsampled U and
V planes? That is, U and V changed from YV12. Just take every other
pixel from Y-plane and every other pixel from U/V planes and discard
half the lines.

-M

Robert_D · November 1, 2001, 1:50am

I’m not sure why you would need to do that with YUV but not with RGB…
(YUV and video stuff is pretty much the only field of graphics I haven’t
been playing with.)

It’s faster to compress YUV images. RTjpeg compression requires YUV420P as
the input format. Also, by using YUV420P I reduce my PCI/RAM/VIDEO usage by 20%.

Anyway, if you need to do that, use a right shift of 2 units instead of
divisions. (Division is the slowest ALU operation there is on most CPUs.)

Obviously. That’s why I was trying to avoid the issue.

And, instead of doing it all on one component at a time, use SIMD style
code. (Possible even with normal instructions, in pure C code.)

How can I do that? I’ve done simple asm programming but nothing with MMX
or SIMD.

Just stuff as many components you can fit in a 32 bit int, making sure
you have a sufficient number of “scratch” bits between each field. Then
do the maths as usual, and finally clear the scratch bits and shift + OR
the components together. Works for multiplications as well; just add the
number of bits/component together to get the field+scratch size.
Have a look at the SDL blitting code for examples of this method.

Cool, any hints which file has this code?

Robert_D · November 1, 2001, 3:14am

Can you capture video in a portable way? I use bttv under linux but I
think it doesn’t run in other platforms.

I wrote a small wrapper api on top of the v4l calls. All I need is
open/start/capture/read calls. The capture() just sits in a thread, and
the read() tells me which buffers are used.

I haven’t dared to look at DirectX video capture stuff yet.

I’m capturing video + compressing with RTjpeg + displaying it + saving
to
disk. 25fps video capture leads to around 40ms for a single cycle.
A P3 866 + TNT2 M64 can handle 4 such streams. The last time I
checked it
took me around 10ms to merge 4 small image into a big YUV overlay and
display it.

Let me get this straight. You’re able to capture (via bttv?)
640x480 at 25fps from 4 devices in paraller and compress all of that with
RTjpeg and save results to disk.

4x 384x288 at 25fps on a P3 866
640x480 should be fine on a dual P3 866 or a T’bird1.4GHz+

Keep in mind RTjpeg is a really lousy compressor, each stream is about
8Mbit/s! 4 streams of 640x480 could easily max out new 5400RPM / old
7200RPM IDE drives.

Huh? And I thought capturing
720x576 at 25fps, filtering it and showing resulting image without dropping
frames was hard.

If I go higher than 384x288 I’ll probably need to de-interlace the images
otherwise there’s going to be tearing during motion. I’ve got no idea how
slow that’ll be.

I simply want put more streams on a 800x600 screen at the same time. I
could run the thing in 1600x1200, but I’d be just wasting
PCI/RAM/VIDEO/DISK bandwidth.

I’m fine with the YUV display speed, I just need to scale down the
input
images. With RGB data I would just skip every other pixel, but with
YUV
data I’m also gonna me to merge stuff in the U and V planes, which
means
sum(u)/4, sum(v)/4 for each pixel. This will easily eat up 10ms per
frame

If you don’t require HQ downscaling you can skip pixels in YUV too
instead of averaging between two or more pixels. I cannot find specs for
YUV420P - is it same as 8 bit Y-plane followed with 2x2 subsampled U and
V planes? That is, U and V changed from YV12.

2x2 pixels

[RGB]
R1,R2, G1,G2, B1,B2
R3,R4, G3,G4, B3,B4

[YUV420P]
Y1,Y2, U1, V1
Y3,Y4,

U1/V1 area responsible for 4 Y pixels!

Just take every other pixel from Y-plane and every other pixel from U/V
planes and discard half the lines.

I think that’s gonna look really baaaaad

Mikko_Rantalainen · November 1, 2001, 6:59am

Huh? And I thought capturing
720x576 at 25fps, filtering it and showing resulting image without
dropping
frames was hard.

If I go higher than 384x288 I’ll probably need to de-interlace the
images
otherwise there’s going to be tearing during motion. I’ve got no idea
how
slow that’ll be.

In fact the “filtering” I mentioned before was deinterlacing & lousy
noise removal. My deinterlacer takes about 20-25 ms per 720x576 YV12
frame depending on noise (branch prediction misses). My routine could
easily be modified to convert interlaced 720x576 video to progressive
720x576 twice the framerate - I would have done this but it takes
12-15ms to show image via YUV Overlay with my hardware so I cannot
display 720x576 at 50fps. Optimized MMX version should be able to
deinterlace and remove noise in less than that, I think 15-18 ms should
be close. The times above are for Duron 650/KT133/Matrox G400.

You could also capture 640x288. Non-square pixels of course.

[RGB]
R1,R2, G1,G2, B1,B2
R3,R4, G3,G4, B3,B4

[YUV420P]
Y1,Y2, U1, V1
Y3,Y4,

[YV12]
Y1,Y2, V1, U1
Y3,Y4,

That is, exactly the same but V plane is before U plane. I can only
wonder why do we have such many YUV surface types…

U1/V1 area responsible for 4 Y pixels!

Just take every other pixel from Y-plane and every other pixel from
U/V
planes and discard half the lines.

I think that’s gonna look really baaaaad

Yes it will, but you cannot[1] get any better with RGB without averaging
between four or more samples (pixels in this case). With RGB you have to
divide sums of R, G and B pixels (3 divides/pixel). With YUV420P you
have to divide sum of Y pixels and for every forth pixel divide sums of
U and V pixels (1.5 divides/pixel). I really cannot see how using RGB
could save anything.

You could also purchase display adapter that has 4 or more YUV overlay
ports supporting YUV420P and use YUV420P data you already have… I
don’t know about any such adapter though…

About removing divide: ((Y1 + Y2 + Y3 + Y4 + 2) >> 2) == round((Y1 + Y2

Y3 + Y4)/4.0), where Y[1-4] are unsigned. Same applies to U&V also.
I’d suggest you to use YUV422 or something like that for display if you
think that color information matters. You really should be using MMX as
this is what’s it’s designed for.

This is getting OT. Reply to me only.

Mikko________________________
[1] Well… if you have input as

R1,R2, G1,G2, B1,B2
R3,R4, G3,G4, B3,B4

you could output

R1, G4, B2

instead of

(R1+R2+R3+R4)/4, (G1+G2+G3+G4)/4, (B1+B2+B3+B4)/4

and hope for best.