Threaded Disk Access Causes Process Blocking w/ 2.6 Kernel

Chris_Nelson · June 22, 2004, 7:56pm

The one reason I haven’t yet switched the linux 2.6 is because threaded
disk access behaves differently than on 2.4.

On 2.4, when I use SDL_CreateThread(), I can use this thread to load
images from the disk and, aside from the CPU power it takes to decode
them, not notice any performance hit in my main drawing thread.

On 2.6, the image loading thread causes MAJOR slowdown. I very much
suspect that this thread is actually causing the entire process to block
when it’s waiting for data from the hard drive. The framerate will be
very skippy, and the sound will even skip when loading sufficiently
large images. Likewise, streaming music from the hard drive causes
pretty constant skippiness in the framerate.

Has anybody run into this problem before? Is there something special I
have to do, when I fork my thread, to make it not block the entire
process on disk access?

Thanks in advance.

-Chris
-------------- next part --------------
A non-text attachment was scrubbed…
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: http://lists.libsdl.org/pipermail/sdl-libsdl.org/attachments/20040622/8965876c/attachment.pgp

Glenn_Maynard · June 23, 2004, 1:00am

Make sure DMA access is enabled:

8:59pm root at zewt.pts/1 [~] hdparm -d /dev/hda

/dev/hda:
using_dma = 1 (on)On Tue, Jun 22, 2004 at 03:56:32PM -0400, Chris Nelson wrote:

The one reason I haven’t yet switched the linux 2.6 is because threaded
disk access behaves differently than on 2.4.

On 2.4, when I use SDL_CreateThread(), I can use this thread to load
images from the disk and, aside from the CPU power it takes to decode
them, not notice any performance hit in my main drawing thread.

On 2.6, the image loading thread causes MAJOR slowdown. I very much
suspect that this thread is actually causing the entire process to block
when it’s waiting for data from the hard drive. The framerate will be
very skippy, and the sound will even skip when loading sufficiently
large images. Likewise, streaming music from the hard drive causes
pretty constant skippiness in the framerate.

–
Glenn Maynard

Chris_Nelson · June 23, 2004, 1:16am

[linux 2.6 causes blocking on threaded disk access]

Make sure DMA access is enabled:

8:59pm root at zewt.pts/1 [~] hdparm -d /dev/hda

/dev/hda:
using_dma = 1 (on)

Unfortunately, it was already enabled…

Might this be an issue I should take to the lkml? I’d really like to
exhaust all options here before going to them about this… They scare
me.

-Chris
-------------- next part --------------
A non-text attachment was scrubbed…
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: http://lists.libsdl.org/pipermail/sdl-libsdl.org/attachments/20040622/fb4e399b/attachment.pgp

Michel_Nolard · June 23, 2004, 10:22am

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Le mercredi 23 Juin 2004 03:16, Chris Nelson a ?crit :

[linux 2.6 causes blocking on threaded disk access]

Make sure DMA access is enabled:

8:59pm root at zewt.pts/1 [~] hdparm -d /dev/hda

/dev/hda:
using_dma = 1 (on)

Unfortunately, it was already enabled…

Might this be an issue I should take to the lkml? I’d really like to
exhaust all options here before going to them about this… They scare
me.

-Chris

And what does reveal this for the PIO modes, DMA modes and UDMA modes ?
$ hdparm -i /dev/hda

Can you try this ?

hdparm -t -T /dev/hda

This will run READING-ONLY speed tests and show the results.

Michel Nolard
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFA2VomyAKwOMHoSb0RAmxdAJ0Sxz3pexLGalfiY1q3BAk4a8ucuQCgqa4S
I6t+p89+CR9NWhqGoigUp3g=
=TUmE
-----END PGP SIGNATURE-----

Chris_Nelson · June 23, 2004, 2:42pm

hdparm -i /dev/hda produces identical results under 2.4.24 and 2.6.7:—

linverse at KOS-MOS-2.6.7:~$ sudo hdparm -i /dev/hda

/dev/hda:

Model=HITACHI_DK23EB-40, FwRev=00K0A0C0, SerialNo=727825
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
BuffType=DualPortCache, BuffSize=2048kB, MaxMultSect=16, MultSect=16
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=78140160
IORDY=yes, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
AdvancedPM=yes: mode=0x80 (128) WriteCache=enabled
Drive conforms to: ATA/ATAPI-5 T13 1321D revision 3:

signifies the current active mode

linverse at KOS-MOS-2.4.24:~$ sudo hdparm -i /dev/hda
Password:

/dev/hda:

Model=HITACHI_DK23EB-40, FwRev=00K0A0C0, SerialNo=727825
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
BuffType=DualPortCache, BuffSize=2048kB, MaxMultSect=16, MultSect=16
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=78140160
IORDY=yes, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
AdvancedPM=yes: mode=0x80 (128) WriteCache=enabled
Drive conforms to: ATA/ATAPI-5 T13 1321D revision 3:

signifies the current active mode

hdparm -t -T /dev/hda does show 2.4.24 having a moderate advantage, but
nothing on the order of the problems I have been seeing.

linverse at KOS-MOS-2.6.7:~$ sudo hdparm -t -T /dev/hda

/dev/hda:
Timing buffer-cache reads: 1132 MB in 2.01 seconds = 564.39 MB/sec
Timing buffered disk reads: 84 MB in 3.06 seconds = 27.49 MB/sec

linverse at KOS-MOS-2.4.24:~$ sudo hdparm -t -T /dev/hda

/dev/hda:
Timing buffer-cache reads: 1276 MB in 2.00 seconds = 638.00 MB/sec
Timing buffered disk reads: 84 MB in 3.05 seconds = 27.54 MB/sec

I would expect the program to draw perfectly fast (though load things
slowly) if hda were slow under 2.4.24, because I’m loading things in
seperate threads, so the main thread is free to draw as quickly as
possible, without waiting for the hard drive. In 2.6.7, though, it’s
exactly as if I were loading from the main thread.

Does anybody know how §threads are supposed to be handled with 2.4 vs
2.6, with respect to blocking the entire process on disk access? I can’t
imagine they’d intentionally change this for the worse. Does anybody
else, with certainty, NOT have this problem with 2.6?
-------------- next part --------------
A non-text attachment was scrubbed…
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: http://lists.libsdl.org/pipermail/sdl-libsdl.org/attachments/20040623/2b8ffa49/attachment.pgp

Stephane_Marchesin · June 23, 2004, 10:57pm

Chris Nelson wrote:

[…]

I would expect the program to draw perfectly fast (though load things
slowly) if hda were slow under 2.4.24, because I’m loading things in
seperate threads, so the main thread is free to draw as quickly as
possible, without waiting for the hard drive. In 2.6.7, though, it’s
exactly as if I were loading from the main thread.

Does anybody know how (p)threads are supposed to be handled with 2.4 vs
2.6, with respect to blocking the entire process on disk access? I can’t
imagine they’d intentionally change this for the worse. Does anybody
else, with certainty, NOT have this problem with 2.6?

Well, I don’t think anyone can test this unless you can give us some
sourcecode to reproduce it

Alson, do you happen to use NPTL ?

Stephane

Chris_Nelson · June 24, 2004, 3:29am

Well, I don’t think anyone can test this unless you can give us some
sourcecode to reproduce it

It turns out I was wrong. I wrote more, but that’s the short version, to
save people the time of reading about what I learned in my
investigation.

I created a simple 85 line test program, which basically spawned a
thread that just read from a file over + over, 1 megabyte at a time. In
the main thread, I just had it loop, and keep track of skipping / delay
statistics. If you don’t feed it a command line argument (any will do),
then it just loops without the reader thread, as a control in the
experiment.

I expected 2.4 to behave nearly the same, with and without the reader
thread. I expected 2.6 to choke, and skip a lot, given a reader thread.

What instead happened was that both kernels behaved fine on both the
controlled run and in the initial reader thread run. The surprising find
was that, by repeatedly running the program with the reader thread, I
was able increase skipping more + more with each iteration.

Luckily I had gkrellm open, and I saw what was happening. Linux wasn’t
going to the hard disk for the data, because it already had it cached
somewhere in memory. As such, the reader thread wasn’t able to yield
itself to the primary thread until the read() system call was finished
grabbing the cached data.

So, I withdraw my claim that threaded disk access behaves differently in
2.6. Sorry to waste time

One strange thing I did notice, however, is that SDL_GetTicks() behaves
a little differently. Specifically, I was getting the ILLUSION of a
halved framerate, compared to 2.4, because it was basically oscillating
between ~4ms and ~30 ms, each frame, and I was advancing the state of my
world based on it.

So, after all that investigation, I didn’t really find any answers that
will help me migrate to 2.6, so I’ll continue happily using 2.4. I
suppose I’ll look into loading data in a different process, using pipes
or shared memory, though that’s so much clunkier than threaded loading.
Oh well.

-Chris

Alson, do you happen to use NPTL ?

Stephane

SDL mailing list
SDL at libsdl.org
http://www.libsdl.org/mailman/listinfo/sdl

-------------- next part --------------
A non-text attachment was scrubbed…
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: http://lists.libsdl.org/pipermail/sdl-libsdl.org/attachments/20040623/73fa16ec/attachment.pgp

Glenn_Maynard · June 24, 2004, 8:02am

I created a simple 85 line test program, which basically spawned a
thread that just read from a file over + over, 1 megabyte at a time. In
the main thread, I just had it loop, and keep track of skipping / delay
statistics. If you don’t feed it a command line argument (any will do),
then it just loops without the reader thread, as a control in the
experiment.

Er, could you post this program, so we can see what you’re doing?

Luckily I had gkrellm open, and I saw what was happening. Linux wasn’t
going to the hard disk for the data, because it already had it cached
somewhere in memory. As such, the reader thread wasn’t able to yield
itself to the primary thread until the read() system call was finished
grabbing the cached data.

Grabbing data from cache shouldn’t take any significant time at all (compared
to a disk hit).

One strange thing I did notice, however, is that SDL_GetTicks() behaves
a little differently. Specifically, I was getting the ILLUSION of a
halved framerate, compared to 2.4, because it was basically oscillating
between ~4ms and ~30 ms, each frame, and I was advancing the state of my
world based on it.

SDL_GetTicks uses gettimeofday, I believe, which should give microsecond
precision (which SDL irritatingly kills to millisecond precision). You
might be having actual timer problems, though; I saw a lot of this in 2.6.5
on nForce boards.On Wed, Jun 23, 2004 at 11:29:54PM -0400, Chris Nelson wrote:

–
Glenn Maynard

Chris_Nelson · June 24, 2004, 8:08am

Er, could you post this program, so we can see what you’re doing?

Certainly. I’ve attached main.cpp and its Makefile. Drop them into a
directory, ln -s (don’t cp!) a big file to ./input, and finally execute
./test… Without a command line argument, there will be no reader
thread. With a command line argument, it will fork the thread. I’d be
interested in seeing your results. I’ve attached mine as RESULTS.

Grabbing data from cache shouldn’t take any significant time at all (compared
to a disk hit).

True. The theory is that the disk hit takes no CPU time, due to DMA,
where as the cache hit does. An annoying tradeoff, actually.

SDL_GetTicks uses gettimeofday, I believe, which should give microsecond
precision (which SDL irritatingly kills to millisecond precision). You
might be having actual timer problems, though; I saw a lot of this in 2.6.5
on nForce boards.

Interesting. I don’t have an nForce board (I’m running on a Dell
Inspiron 8200)… I’ll probably just wait for this to be fixed in 2.6…
-------------- next part --------------
A non-text attachment was scrubbed…
Name: main.cpp
Type: text/x-c++src
Size: 1640 bytes
Desc: not available
URL: http://lists.libsdl.org/pipermail/sdl-libsdl.org/attachments/20040624/b4e1d0c1/attachment.cpp
-------------- next part --------------
A non-text attachment was scrubbed…
Name: Makefile
Type: text/x-makefile
Size: 432 bytes
Desc: not available
URL: http://lists.libsdl.org/pipermail/sdl-libsdl.org/attachments/20040624/b4e1d0c1/attachment.bin
-------------- next part --------------
FPS Biggest TotalTime Number
2.6.7 Reading OFF 1258048 10 0 0
2.6.7 Reading ON 1197145 2 0 0
2.6.7 Reading REPEAT 0618549 211 2326 23
2.4.24 Reading OFF 1289901 1 0 0
2.4.24 Reading ON 1234525 6 0 0
2.4.24 Reading REPEAT 0643808 101 2470 40
-------------- next part --------------
A non-text attachment was scrubbed…
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: http://lists.libsdl.org/pipermail/sdl-libsdl.org/attachments/20040624/b4e1d0c1/attachment.pgp

Jason_Clark · June 24, 2004, 3:01pm

Are you using a particular distribution, i.e., Redhat, SuSe, etc…?

While it is probably not the root cause of your issues, you should be
aware that several Linux distributions changed their pthread
implementation over the 2.4 → 2.6 releases and are now using NPTL
(Native Posix Thread Library) instead of the older non-posix compliant
linuxthreads.
(For some ramifications see: An Interview With Linus Torvalds: Linux and Git - Part 1 | Tag1 Consulting, or
do a google search for nptl linuxthreads)
This has become such an issue in some of my companies’ multi-threaded
products that one of our team, Peter, had to write a simple program to
test which one is implemented on a given machine.

If you are interested, I’ve attached the source, please leave the header
in tact to give him credit.

Thanks,
Jason.> ----- Original Message -----

From: sdl-bounces+jclark=ccpu.com@libsdl.org
[mailto:sdl-bounces+jclark=ccpu.com at libsdl.org] On Behalf Of Chris
Nelson
Sent: Wednesday, June 23, 2004 9:30 PM
To: A list for developers using the SDL library. (includes SDL-announce)
Subject: Re: [SDL] Threaded Disk Access Causes Process Blocking w/ 2.6
Kernel

Well, I don’t think anyone can test this unless you can give us some
sourcecode to reproduce it

It turns out I was wrong. I wrote more, but that’s the short version, to
save people the time of reading about what I learned in my
investigation.

I created a simple 85 line test program, which basically spawned a
thread that just read from a file over + over, 1 megabyte at a time. In
the main thread, I just had it loop, and keep track of skipping / delay
statistics. If you don’t feed it a command line argument (any will do),
then it just loops without the reader thread, as a control in the
experiment.

I expected 2.4 to behave nearly the same, with and without the reader
thread. I expected 2.6 to choke, and skip a lot, given a reader thread.

What instead happened was that both kernels behaved fine on both the
controlled run and in the initial reader thread run. The surprising find
was that, by repeatedly running the program with the reader thread, I
was able increase skipping more + more with each iteration.

Luckily I had gkrellm open, and I saw what was happening. Linux wasn’t
going to the hard disk for the data, because it already had it cached
somewhere in memory. As such, the reader thread wasn’t able to yield
itself to the primary thread until the read() system call was finished
grabbing the cached data.

So, I withdraw my claim that threaded disk access behaves differently in
2.6. Sorry to waste time

One strange thing I did notice, however, is that SDL_GetTicks() behaves
a little differently. Specifically, I was getting the ILLUSION of a
halved framerate, compared to 2.4, because it was basically oscillating
between ~4ms and ~30 ms, each frame, and I was advancing the state of my
world based on it.

So, after all that investigation, I didn’t really find any answers that
will help me migrate to 2.6, so I’ll continue happily using 2.4. I
suppose I’ll look into loading data in a different process, using pipes
or shared memory, though that’s so much clunkier than threaded loading.
Oh well.

-Chris

Alson, do you happen to use NPTL ?

Stephane

SDL mailing list
SDL at libsdl.org
http://www.libsdl.org/mailman/listinfo/sdl

Incoming mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.708 / Virus Database: 464 - Release Date: 6/18/2004

Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.708 / Virus Database: 464 - Release Date: 6/18/2004

-------------- next part --------------
A non-text attachment was scrubbed…
Name: detect_threading.c
Type: application/octet-stream
Size: 1052 bytes
Desc: not available
URL: http://lists.libsdl.org/pipermail/sdl-libsdl.org/attachments/20040624/5a021585/attachment.obj

Glenn_Maynard · June 24, 2004, 6:33pm

Grabbing data from cache shouldn’t take any significant time at all (compared
to a disk hit).

True. The theory is that the disk hit takes no CPU time, due to DMA,
where as the cache hit does. An annoying tradeoff, actually.

A disk hit isn’t going to DMA directly into your buffer; it’s going to
go to a kernel buffer, and then get copied to your buffer.

The real difference–at least with your test case–is that cached reads are
pulling in gigs of data (since it’s reading repeatedly), where the noncached
reads are spending the whole time reading from disk, reading slowly.

#include <SDL.h>
#include <SDL_thread.h>

Nothing ever yields CPU, so both threads are completely contending for CPU.
One process repeatedly reads as fast as it can, and the other busy loops ove
the timer, so it’s entirely up to the scheduler. This doesn’t happen when
hitting the disk, since it’s blocking on IO.

There may or may not besome kind of bug here: the scheduler should be
preempting the read long before 100ms (which is what I’m seeing). It may
be that it decides that it’s they’re both non-interactive threads (correctly,
at least in this test case) and doesn’t preempt as often to reduce
overhead. (That wouldn’t be correct in the case of a game main loop.)

Add a usleep(8000); to the timing loop to pretend you’re vsyncing; the skips
go away. (The timer thread is probably ending up much higher priority,
preempting the reading thread, since it’s not chewing CPU.)

I sometimes see something like this in Windows. I load sounds in a thread,
and when in fullscreen or vsync is disabled, it gets choppy. That’s because
the threads are competing for time. (I don’t really know why–it’s hitting
the disk; but it’s Windows, so I’m not likely to find out.) With vsync
enabled in fullscreen, it’s (almost) perfectly smooth; vsync gives up
the scheduler (at least with all sensible drivers), and the threads no
longer fight each other.

So, it’s almost certainly scheduler-related, but I couldn’t say if it’s a
bug or not. I can fairly guarantee that, if your program is portable,
you’ll see similar issues in Windows, though.On Thu, Jun 24, 2004 at 04:08:37AM -0400, Chris Nelson wrote:

–
Glenn Maynard

Sean_Ridenour · June 25, 2004, 1:25am

I ran your program on my 2.6.3 system, using GCC 3.2, and an input of about 70
megs, and the results were:
Reader thread enabled…
Please wait 5 seconds… (Skip >= 16.7ms)

FPS: 1279436 Biggest: 1ms TotalTime: 0 Number: 0

And this was pretty constant. The highest I ever saw it go (I ran it about 10
times) was 4ms.

As an aside, why are you still using GCC 2.95?

-Sean Ridenour> > Er, could you post this program, so we can see what you’re doing?

Certainly. I’ve attached main.cpp and its Makefile. Drop them into a
directory, ln -s (don’t cp!) a big file to ./input, and finally execute
./test… Without a command line argument, there will be no reader
thread. With a command line argument, it will fork the thread. I’d be
interested in seeing your results. I’ve attached mine as RESULTS.

Grabbing data from cache shouldn’t take any significant time at all
(compared to a disk hit).

True. The theory is that the disk hit takes no CPU time, due to DMA,
where as the cache hit does. An annoying tradeoff, actually.

SDL_GetTicks uses gettimeofday, I believe, which should give microsecond
precision (which SDL irritatingly kills to millisecond precision). You
might be having actual timer problems, though; I saw a lot of this in
2.6.5 on nForce boards.

Interesting. I don’t have an nForce board (I’m running on a Dell
Inspiron 8200)… I’ll probably just wait for this to be fixed in 2.6…

Chris_Nelson · June 25, 2004, 1:32am

I ran your program on my 2.6.3 system, using GCC 3.2, and an input of about 70
megs, and the results were:
FPS: 1279436 Biggest: 1ms TotalTime: 0 Number: 0

Hmm… It would seem, then, that your system is reading from disk every
time, and mine’s doing caching. I wonder why? Not especially important,
but it raises my curiosity. Thanks for testing.

As an aside, why are you still using GCC 2.95?

Pretty much out of laziness… When I was learning the STL (back when
2.95 was the latest), it allowed for certain questionable operations
that later versions of gcc didn’t. The path of least resistance was to
just keep using 2.95 when 3.0 broke compatibility… Someday I’ll go
back and change the code, so I can use the 3.0 series. Here’s an example
of what 3.0 doesn’t like:

std::vector List;
//push_back a few entries
List.erase(&(List[3]));

The correct way, of course, is to call erase with an iterator.

-Chris
-------------- next part --------------
A non-text attachment was scrubbed…
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: http://lists.libsdl.org/pipermail/sdl-libsdl.org/attachments/20040624/0aeec4fb/attachment.pgpOn Thu, 2004-06-24 at 21:25, Sean Ridenour wrote: