YUV speed?

Last night I finally got our DV decoder to use SDL. I can play 720x480
video in full-screen in a 2048x1536 modeline rather nicely, albeit
slowly… :wink:

My question is: what kind of speed should I be expecting? My first test
program looked like this:

for (i=0;i<300;i++) {
SDL_PumpEvents();
SDL_LockYUVOverlay(overlay);
memcpy(overlay->pixels,yuvdata,length);
SDL_UnlockYUVOverlay(overlay);
SDL_DisplayYUVOverlay(overlay,&screenrect);
}

This was taking about 6-7 seconds (real-time would be 10) on my K6-2
running at 360, with a G400 MAX dual-head, runing XFree 4.0 from Rawhide.
This seems awfully high to me, with very little room left over for actual
processing (DV is about as heavy as software MPEG-2 MP at ML, that is to say:
very). Granted, the machine’s running the stock RH6.2 kernel, i.e. no AGP
kernel code. I’m attempting to run it on 2.4.0-test1-ac3, but for some
reason everything is segfaulting under that kernel…

Total bandwidth should be about 20MB/sec at 30 frames/sec, which shouldn’t
even make the bus break a sweat. Am I doing something else wrong? It’s
hard to tell from the smpeg code what the proper sequence is. smpeg seems
to do a SDL_Flip() in plaympeg.c, but I’m not sure how that works with
overlays.

Sometime early this week I hope to get my hands on some HDTV material,
which I’ll be trying to display on this machine (but most definitely not
real-time), so I’ll have a killer test-case for the YUV overlays
(1920x1080). I also just noticed that I have to get the CVS XFree, since
there was a commit two days ago that enables texture-based overlays on
the G400. This is notable since I just ran across a table in CVS that says
regular overlay is limited to 1024x1024, whereas texture overlays can go
to 2046x2046.

I just spent a few minutes perusing the MGA driver in XFree CVS, it seems
that the X server is doing memcpy’s of all the data. AGP isn’t utilized
for the transfer at all. This is a significant time drain: top(1) shows
almost equal usage of both the test program and X. I would bet that a
profile of X during that time pots the vast majority of the cyles in
MGACopyData(), which is a simple striding memcpy. Things get worse if
you’re not using YUY2, since YV12 does a significant amount of work in
MGACopyMungedData():

U32 *dst = (U32 *)dst1;
for(j = 0; j < h; j++) {
for(i = 0; i < w; i++) {
dst[i] = src1[i << 1] | (src1[(i << 1) + 1] << 16) |
(src3[i] << 8) | (src2[i] << 24);
}

}

I’ll try to decipher what that loop does, and see if it’s even necessary
or just an unnecessary format conversion (looks vaguely like a
planar->packed conversion, maybe?). According to FourCC
(webartz.com/fourcc), the G400 supports YV12 natively in DirectX, so such
a conversion shouldn’t be necessary. Regardless, I’ve signed up for
Matrox’s developer program, in hopes of being able to get some kind of
hardware assisted copy set up from the XShm segment.

Anyway, you’ll probably be hearing more from me in the near future, as I
push as hard as I can on this stuff.

     Erik Walthinsen <@Erik_Walthinsen> - Staff Programmer @ OGI
    Quasar project - http://www.cse.ogi.edu/DISC/projects/quasar/

Video4Linux Two drivers and stuff - http://www.cse.ogi.edu/~omega/v4l2/__
/ \ SEUL: Simple End-User Linux - http://www.seul.org/
| | M E G A Helping Linux become THE choice
\ / for the home or office user

Hello there. I also have been working on something myself for the last
few months. I havn’t been working with the G200/G400, nor X4.0, but under X3
I have been getting around 10-15 fps using 50% of a Celeron 533. I would be
most interested in any ideas that you might have for speeding this up.

-Benjamin Meyer

Erik Walthinsen wrote:> Last night I finally got our DV decoder to use SDL. I can play 720x480

video in full-screen in a 2048x1536 modeline rather nicely, albeit
slowly… :wink:

My question is: what kind of speed should I be expecting? My first test
program looked like this:

for (i=0;i<300;i++) {
SDL_PumpEvents();
SDL_LockYUVOverlay(overlay);
memcpy(overlay->pixels,yuvdata,length);
SDL_UnlockYUVOverlay(overlay);
SDL_DisplayYUVOverlay(overlay,&screenrect);
}

This was taking about 6-7 seconds (real-time would be 10) on my K6-2
running at 360, with a G400 MAX dual-head, runing XFree 4.0 from Rawhide.
This seems awfully high to me, with very little room left over for actual
processing (DV is about as heavy as software MPEG-2 MP at ML, that is to say:
very). Granted, the machine’s running the stock RH6.2 kernel, i.e. no AGP
kernel code. I’m attempting to run it on 2.4.0-test1-ac3, but for some
reason everything is segfaulting under that kernel…

Total bandwidth should be about 20MB/sec at 30 frames/sec, which shouldn’t
even make the bus break a sweat. Am I doing something else wrong? It’s
hard to tell from the smpeg code what the proper sequence is. smpeg seems
to do a SDL_Flip() in plaympeg.c, but I’m not sure how that works with
overlays.

Sometime early this week I hope to get my hands on some HDTV material,
which I’ll be trying to display on this machine (but most definitely not
real-time), so I’ll have a killer test-case for the YUV overlays
(1920x1080). I also just noticed that I have to get the CVS XFree, since
there was a commit two days ago that enables texture-based overlays on
the G400. This is notable since I just ran across a table in CVS that says
regular overlay is limited to 1024x1024, whereas texture overlays can go
to 2046x2046.

I just spent a few minutes perusing the MGA driver in XFree CVS, it seems
that the X server is doing memcpy’s of all the data. AGP isn’t utilized
for the transfer at all. This is a significant time drain: top(1) shows
almost equal usage of both the test program and X. I would bet that a
profile of X during that time pots the vast majority of the cyles in
MGACopyData(), which is a simple striding memcpy. Things get worse if
you’re not using YUY2, since YV12 does a significant amount of work in
MGACopyMungedData():

U32 *dst = (U32 *)dst1;
for(j = 0; j < h; j++) {
for(i = 0; i < w; i++) {
dst[i] = src1[i << 1] | (src1[(i << 1) + 1] << 16) |
(src3[i] << 8) | (src2[i] << 24);
}

}

I’ll try to decipher what that loop does, and see if it’s even necessary
or just an unnecessary format conversion (looks vaguely like a
planar->packed conversion, maybe?). According to FourCC
(webartz.com/fourcc), the G400 supports YV12 natively in DirectX, so such
a conversion shouldn’t be necessary. Regardless, I’ve signed up for
Matrox’s developer program, in hopes of being able to get some kind of
hardware assisted copy set up from the XShm segment.

Anyway, you’ll probably be hearing more from me in the near future, as I
push as hard as I can on this stuff.

     Erik Walthinsen <omega at cse.ogi.edu> - Staff Programmer @ OGI
    Quasar project - http://www.cse.ogi.edu/DISC/projects/quasar/

Video4Linux Two drivers and stuff - http://www.cse.ogi.edu/~omega/v4l2/
__
/ \ SEUL: Simple End-User Linux - http://www.seul.org/
| | M E G A Helping Linux become THE choice
\ / for the home or office user

Last night I finally got our DV decoder to use SDL. I can play 720x480
video in full-screen in a 2048x1536 modeline rather nicely, albeit
slowly… :wink:

My question is: what kind of speed should I be expecting? My first test
program looked like this:

for (i=0;i<300;i++) {
SDL_PumpEvents();
SDL_LockYUVOverlay(overlay);
memcpy(overlay->pixels,yuvdata,length);
SDL_UnlockYUVOverlay(overlay);
SDL_DisplayYUVOverlay(overlay,&screenrect);
}

If possibly, lock the overlay and decode directly to the overlay->pixels.
This will save a memcpy().

Total bandwidth should be about 20MB/sec at 30 frames/sec, which shouldn’t
even make the bus break a sweat. Am I doing something else wrong? It’s
hard to tell from the smpeg code what the proper sequence is. smpeg seems
to do a SDL_Flip() in plaympeg.c, but I’m not sure how that works with
overlays.

What you’re probably running into is software YUV conversion to RGB.
SDL doesn’t have the X video extension support enabled by default
(there are other problems with XFree86 4.0 that prevent it), but if
you recompile from the source, you should get MUCH better framerates.

Sometime early this week I hope to get my hands on some HDTV material,
which I’ll be trying to display on this machine (but most definitely not
real-time), so I’ll have a killer test-case for the YUV overlays
(1920x1080). I also just noticed that I have to get the CVS XFree, since
there was a commit two days ago that enables texture-based overlays on
the G400. This is notable since I just ran across a table in CVS that says
regular overlay is limited to 1024x1024, whereas texture overlays can go
to 2046x2046.

Very cool.

I just spent a few minutes perusing the MGA driver in XFree CVS, it seems
that the X server is doing memcpy’s of all the data. AGP isn’t utilized
for the transfer at all. This is a significant time drain: top(1) shows
almost equal usage of both the test program and X. I would bet that a
profile of X during that time pots the vast majority of the cyles in
MGACopyData(), which is a simple striding memcpy. Things get worse if
you’re not using YUY2, since YV12 does a significant amount of work in
MGACopyMungedData():

U32 *dst = (U32 *)dst1;
for(j = 0; j < h; j++) {
for(i = 0; i < w; i++) {
dst[i] = src1[i << 1] | (src1[(i << 1) + 1] << 16) |
(src3[i] << 8) | (src2[i] << 24);
}

}

This looks like converting between FOURCC formats. If there is a better
format for the hardware, use that. SDL should handle YUY2 and YV12
equally well, though I’ve only tested YV12, so if there are problems
please let me know.

Regardless, I’ve signed up for
Matrox’s developer program, in hopes of being able to get some kind of
hardware assisted copy set up from the XShm segment.

Anyway, you’ll probably be hearing more from me in the near future, as I
push as hard as I can on this stuff.

Great, I’m really looking forward to it. :slight_smile:

See ya!
-Sam Lantinga, Lead Programmer, Loki Entertainment Software

If possibly, lock the overlay and decode directly to the overlay->pixels.
This will save a memcpy().
This was just a test app. The real decoder places macroblocks directly
into the overlay->pixels buffer.

What you’re probably running into is software YUV conversion to RGB.
SDL doesn’t have the X video extension support enabled by default
(there are other problems with XFree86 4.0 that prevent it), but if
you recompile from the source, you should get MUCH better framerates.
I’ve been working off the latest&greatest CVS since I started. I’m pretty
sure it’s doing the overlay. IIRC, I timed it with SDL_VIDEO_YUV_HWACCEL
turned off (or built without Xv support, I forget) and it got
significantly slower. Regardless, when the test app (repeated memcpy()s)
takes the same processor time as X, it’s a strong hint that Xv is being
properly utilized. I’ll add some debugging to SDL to verify once and for
all that it’s being used, though.

Very cool.
Now I just have to get it off the Orb disk and somewhere useful… I
couldn’t get very much HD material today for a number of reasons, but I do
have some. I also got ahold of some dual-SD streams (2x 8Mbit 720x480
MPEG-2), which should be quite useful.

This looks like converting between FOURCC formats. If there is a better
format for the hardware, use that. SDL should handle YUY2 and YV12
equally well, though I’ve only tested YV12, so if there are problems
please let me know.
Both of them work fine on the G400, and they seem roughly equivalent in
speed, contrary to the indications in the XFree driver code… Luckily
I’m using the format supported directly in hardware, not that it makes
much difference ;-(

Great, I’m really looking forward to it. :slight_smile:
See ya!
I’m leaving for a month in Europe on Monday, but I’ll be trying to keep up
and even get some stuff done (hoping the Matrox stuff comes through before
I leave so I can print it out and read it on the plane…).

TTYL,
Omega

     Erik Walthinsen <@Erik_Walthinsen> - Staff Programmer @ OGI
    Quasar project - http://www.cse.ogi.edu/DISC/projects/quasar/

Video4Linux Two drivers and stuff - http://www.cse.ogi.edu/~omega/v4l2/On Tue, 13 Jun 2000, Sam Lantinga wrote:
__
/ \ SEUL: Simple End-User Linux - http://www.seul.org/
| | M E G A Helping Linux become THE choice
\ / for the home or office user