Threading overhead

Pierre_Phaneuf · April 9, 2000, 8:43pm

Pierre Phaneuf wrote:

Yes , you right. And you can control timeouts on nonblocked accept of
cource. But threads on SMP machines give you more perfomance.

A few points:

not many people have SMP machines, and more running threads/processes
than there are CPU causes a noticeable overhead

Could some people look at this test program and tell me if I did
questionable things? I compile threaded and non-threaded version with
these command line:

gcc -o test1 thread-test.c
gcc -o test2 -DTHREADED -lpthread thread-test.c

Running each one five times in a row gives me these results on my
machine:

Running time: 35475 ms (single-threaded)
Running time: 35458 ms (single-threaded)
Running time: 35447 ms (single-threaded)
Running time: 35658 ms (single-threaded)
Running time: 35528 ms (single-threaded)

Running time: 40457 ms (multi-threaded)
Running time: 39970 ms (multi-threaded)
Running time: 38569 ms (multi-threaded)
Running time: 39422 ms (multi-threaded)
Running time: 40173 ms (multi-threaded)

Notice the 4-5 seconds more time it takes to execute and the variance
with the multi-threaded version.

I was expecting this, but I was wondering if any multithreading
proponent had anything to say on my code?–
Pierre Phaneuf
http://ludusdesign.com/

Pierre_Phaneuf · April 9, 2000, 8:46pm

Pierre Phaneuf wrote:

I was expecting this, but I was wondering if any multithreading
proponent had anything to say on my code?

Well, including the code could help, right? :-)–
Pierre Phaneuf
http://ludusdesign.com/
-------------- next part --------------
#ifdef THREADED
#include <pthread.h>
#endif
#include <errno.h>
#include <sys/time.h>
#include <stdlib.h>
#include <stdio.h>

#define ITER 10000
#define BUFSIZE 65536

void* buf1;
void* buf2;
void* buf3;

#ifdef THREADED
pthread_mutex_t lock;

void* thread1(void* dummy) {
unsigned int count;

for(count = ITER; count > 0; count–) {
pthread_mutex_lock(&lock);
memcpy(buf3, buf1, BUFSIZE);
pthread_mutex_unlock(&lock);
}

return NULL;
}

void* thread2(void* dummy) {
unsigned int count;

for(count = ITER; count > 0; count–) {
pthread_mutex_lock(&lock);
memcpy(buf3, buf2, BUFSIZE);
pthread_mutex_unlock(&lock);
}

return NULL;
}
#else
void thread1() {
memcpy(buf3, buf1, BUFSIZE);
}

void thread2() {
memcpy(buf3, buf2, BUFSIZE);
}
#endif

int main(int argc, char** argv) {
#ifdef THREADED
pthread_t t1;
pthread_t t2;
#else
unsigned int count;
#endif
struct timeval time1;
struct timeval time2;
unsigned int msec;

buf1 = malloc(BUFSIZE);
buf2 = malloc(BUFSIZE);
buf3 = malloc(BUFSIZE);

#ifdef THREADED
pthread_mutex_init(&lock, NULL);
#endif

if(gettimeofday(&time1, NULL) == -1) {
perror(argv[0]);
exit(1);
}

#ifdef THREADED
errno = pthread_create(&t1, NULL, thread1, NULL);
if(errno != 0) {
perror(argv[0]);
exit(1);
}

errno = pthread_create(&t2, NULL, thread2, NULL);
if(errno != 0) {
perror(argv[0]);
exit(1);
}

errno = pthread_join(t1, NULL);
if(errno != 0) {
perror(argv[0]);
exit(1);
}

errno = pthread_join(t2, NULL);
if(errno != 0) {
perror(argv[0]);
exit(1);
}
#else
for(count = ITER; count > 0; count–) {
thread1();
thread2();
}
#endif

if(gettimeofday(&time2, NULL) == -1) {
perror(argv[0]);
exit(1);
}

free(buf1);
free(buf2);
free(buf3);

#ifdef THREADED
errno = pthread_mutex_destroy(&lock);
if(errno != 0) {
perror(argv[0]);
exit(1);
}
#endif

msec = (time2.tv_sec - time1.tv_sec) * 1000;
msec += (time2.tv_usec - time1.tv_usec) / 1000;

#ifdef THREADED
printf(“Running time: %i ms (multi-threaded)\n”, msec);
#else
printf(“Running time: %i ms (single-threaded)\n”, msec);
#endif

return 0;
}

Garrett · April 9, 2000, 9:55pm

How did you time your code?On Sun, 09 Apr 2000, you wrote:

Pierre Phaneuf wrote:

Yes , you right. And you can control timeouts on nonblocked accept of
cource. But threads on SMP machines give you more perfomance.

A few points:

not many people have SMP machines, and more running threads/processes
than there are CPU causes a noticeable overhead

Could some people look at this test program and tell me if I did
questionable things? I compile threaded and non-threaded version with
these command line:

gcc -o test1 thread-test.c
gcc -o test2 -DTHREADED -lpthread thread-test.c

Running each one five times in a row gives me these results on my
machine:

Running time: 35475 ms (single-threaded)
Running time: 35458 ms (single-threaded)
Running time: 35447 ms (single-threaded)
Running time: 35658 ms (single-threaded)
Running time: 35528 ms (single-threaded)

Running time: 40457 ms (multi-threaded)
Running time: 39970 ms (multi-threaded)
Running time: 38569 ms (multi-threaded)
Running time: 39422 ms (multi-threaded)
Running time: 40173 ms (multi-threaded)

Notice the 4-5 seconds more time it takes to execute and the variance
with the multi-threaded version.

I was expecting this, but I was wondering if any multithreading
proponent had anything to say on my code?

–
Pierre Phaneuf
http://ludusdesign.com/
–
-Garrett, WPI student majoring in Computer Science
"The fastest way to succeed is to look as if you’re playing by somebody
else’s rules, while quietly playing by your own." -Michael Konda

Garrett · April 9, 2000, 10:11pm

You could try using getrusage to time your program, that way you can see
overhead a little better if there is any.On Sun, 09 Apr 2000, you wrote:

Pierre Phaneuf wrote:

I was expecting this, but I was wondering if any multithreading
proponent had anything to say on my code?

Well, including the code could help, right?

–
Pierre Phaneuf
http://ludusdesign.com/

Content-Type: text/plain; name="thread-test.c"
Content-Transfer-Encoding: 7bit
Content-Description:

–
-Garrett, WPI student majoring in Computer Science
"The fastest way to succeed is to look as if you’re playing by somebody
else’s rules, while quietly playing by your own." -Michael Konda

Mattias_Engdegard · April 9, 2000, 10:00pm

Notice the 4-5 seconds more time it takes to execute and the variance
with the multi-threaded version.

Running time: 7358 ms (multi-threaded)
Running time: 7303 ms (multi-threaded)
Running time: 7298 ms (multi-threaded)
Running time: 7294 ms (multi-threaded)
Running time: 7331 ms (multi-threaded)
Running time: 7291 ms (single-threaded)
Running time: 7326 ms (single-threaded)
Running time: 7279 ms (single-threaded)
Running time: 7295 ms (single-threaded)
Running time: 7281 ms (single-threaded)

for Solaris 2.6, gcc without optimization. Oh, yes, I filled your malloc()ed
buffers just to be sure that we don’t memcpy() from a single COW zero page,
but it didn’t really change the results.

I was expecting this, but I was wondering if any multithreading
proponent had anything to say on my code?

I’m not a MT proponent (rather the opposite), but your code is quite sensitive
to cache sizes, not only the thread implementation. I’m somewhat amazed that
cache misses + kernel overhead eats as much as 0.5ms per iteration though.

The_Mighty_Mike_Mast · April 9, 2000, 10:48pm

Hi,
Still hacking away at this audio application…it was taking
absurd amounts of CPU, until someone recommended that I run the EsounD esd
program. Now it takes much, much less CPU. Usually. At first, I thought I
had to have esd installed and running in order to install SDL; was I
wrong? Is SDL able to fall back on, say, OSS drivers?

Anyway, now that esd is up and running, it's working pretty well,

except for a few problems: There seems to be crackling that we think is
related to clipping, that doesn’t happen without esd running. Should I
look into the esd code to figure out how to work around this? Also,
sometimes when there’s a little hiccup in the machine’s CPU usage, the
sound skips, and proceeds to sound bubbly, while driving up the app’s CPU
usage. Also, I’m getting a bunch of errors in my log files that look like
this:
sound_open_dma: DMA channel 1 busy or not allocated (2)
Unable to grab(2) DMA1 for the audio driver
Any clues as to where I should start looking to debug this problem? Kernel
problem? Library problem? Sound card specific problem (I have an old
Gravis Ultrasound MAX)?

Thanks...
    -The Mighty Mike Master
    http://tmmm.simplenet.com

Pierre_Phaneuf · April 10, 2000, 1:25am

Mattias Engdeg?rd wrote:

for Solaris 2.6, gcc without optimization. Oh, yes, I filled your malloc()ed
buffers just to be sure that we don’t memcpy() from a single COW zero page,
but it didn’t really change the results.

Freshly malloc()ed memory sometimes is not actually allocated, but this
is slightly different from COW, any access (including reading) to the
page will do the actual allocation, right?

I was expecting this, but I was wondering if any multithreading
proponent had anything to say on my code?

I’m not a MT proponent (rather the opposite), but your code is quite
sensitive to cache sizes, not only the thread implementation. I’m
somewhat amazed that cache misses + kernel overhead eats as much
as 0.5ms per iteration though.

As for the cache sizes, that’s why I made the buffer size modifiable.
BTW, I specifically wanted cache misses to happen, as to simulate
unrelated threads breaking the cache for each other, but they have the
destination buffer in common, to simulate locking interaction.

My impression is that for a game, it is cache misses that are biggest
lose, as they mess up things like culling, blitting and mixing.–
Pierre Phaneuf
http://ludusdesign.com/

Pierre_Phaneuf · April 10, 2000, 1:37am

Garrett wrote:

You could try using getrusage to time your program, that way you can see
overhead a little better if there is any.

I wanted mostly a look at wall clock time, but I used “time” to see if
there was any noticeable difference between the two implementations, and
there was more system time. Here is an example for a single run that
resemble what I had with my multiple runs earlier today:

Running time: 35999 ms (single-threaded)
35.91user 0.02system 0:36.02elapsed 99%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (86major+32minor)pagefaults 0swaps

Running time: 40809 ms (multi-threaded)
39.32user 1.46system 0:40.84elapsed 99%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (104major+38minor)pagefaults 0swaps

I will add a switch for using getrusage.–
Pierre Phaneuf
http://ludusdesign.com/

Mattias_Engdegard · April 10, 2000, 3:34am

Freshly malloc()ed memory sometimes is not actually allocated, but this
is slightly different from COW, any access (including reading) to the
page will do the actual allocation, right?

Remember this was Solaris, and I wasn’t sure how it handled allocation and
whether it did overcommit. Setting fresh anonymous mappings to COW the zero
page is a quite reasonable implementation, and reading it would just read
one small (8K) physical page.

My impression is that for a game, it is cache misses that are biggest
lose, as they mess up things like culling, blitting and mixing.

My impression is that the biggest lose for MT development is the debugging
nightmare :-). I agree about the cache importance though.

The comparatively small overhead in my case probably stems from Solaris
threads being NxM (N API threads mapping on M kernel threads, N >= M),
effectively giving non-preemptive user-mode threading for your benchmark.
I suppose one thread would run to completion, then the other one.

In other words, your benchmark tests whether you are running Linux or not

slouken · April 12, 2000, 1:56am

Hi,
Still hacking away at this audio application…it was taking
absurd amounts of CPU, until someone recommended that I run the EsounD esd
program. Now it takes much, much less CPU. Usually. At first, I thought I
had to have esd installed and running in order to install SDL; was I
wrong? Is SDL able to fall back on, say, OSS drivers?

SDL is using the OSS drivers, but it sounds like they are broken with your
sound card…

SDL runs a thread in a loop that calls select on the /dev/dsp file descriptor,
and calls your callback when data is ready for writing. To make this happen
with the minimum latency, SDL does a bunch of tricks in the driver. See
SDL_dspaudio.c for details.

See ya!
-Sam Lantinga, Lead Programmer, Loki Entertainment Software

Pierre_Phaneuf · April 14, 2000, 5:38pm

Mattias Engdeg?rd wrote:

Freshly malloc()ed memory sometimes is not actually allocated, but this
is slightly different from COW, any access (including reading) to the
page will do the actual allocation, right?

Remember this was Solaris, and I wasn’t sure how it handled allocation and
whether it did overcommit. Setting fresh anonymous mappings to COW the zero
page is a quite reasonable implementation, and reading it would just read
one small (8K) physical page.

Yes, I should write all over the buffer once before starting so that any
overcommitting was actually committed. Even though it doesn’t change
anything on Solaris… 8K pages on UltraSPARC? I didn’t knew that…

My impression is that for a game, it is cache misses that are biggest
lose, as they mess up things like culling, blitting and mixing.

My impression is that the biggest lose for MT development is the debugging
nightmare :-). I agree about the cache importance though.

Argh! Yeah, don’t get me started on debugging! Notice that the locking
situation in this small piece of code is really simple, but that it
could not be the case in real life situation. The “no locking at all” in
the non-threaded version is actually infinitely simpler (and easier to
debug).

“Infinitely easier”, just think about how easier that is!

The comparatively small overhead in my case probably stems from Solaris
threads being NxM (N API threads mapping on M kernel threads, N >= M),
effectively giving non-preemptive user-mode threading for your benchmark.
I suppose one thread would run to completion, then the other one.

Ahh, non-preemptive threading is okay with me! It is more complicated
than no threading at all (it can require locking structures over a
blocking call that causes the control to be passed to another thread)
though, but has good performance.

I actually that the NxM threading model is probably one of the most
efficient, with the number of kernel threads being equal to the number
of CPUs you want to use. But I like being explicit as a way of being
clear and simple, I have a slight preference to forking a real process
with shared memory communication or spawning a thread for a single job.
Whatever is simpler is probably better.–
Pierre Phaneuf
Systems Exorcist

Pierre_Phaneuf · July 20, 2000, 10:03pm

Pierre Phaneuf wrote:

I actually that the NxM threading model is probably one of the most
efficient, with the number of kernel threads being equal to the number
of CPUs you want to use. But I like being explicit as a way of being
clear and simple, I have a slight preference to forking a real process
with shared memory communication or spawning a thread for a single job.
Whatever is simpler is probably better.

In his recent Slashdot interview, Ingo Molnar is more or less saying
that “threads sucks”. Very interesting read on high performance system
programming for Linux, check out answer #6 on threading:

“World domination. Fast.” – Linus Torvalds

Threading overhead

Content-Type: text/plain; name="thread-test.c" Content-Transfer-Encoding: 7bit Content-Description:

Content-Type: text/plain; name="thread-test.c"
Content-Transfer-Encoding: 7bit
Content-Description: