Multithreading problem

Dietfrid_Mali · December 30, 2006, 11:43pm

Maybe this is not the perfect place to post this question, but as it
has at least losely to do with SDL, I’ll do it anyway.

I have multithreaded some CPU intensive code in my renderer using SDL.
That code really is CPU intensive - it can cut down framerates by 75%,
so I expected to really see some performance gain on multicore CPUs if
splitting it in two threads.

The threads are started at program start and wait for data to process.
The thread execution process is pretty finegrained though (i.e. the
threads are signalled to start pretty often and do not process all that
much data before finishing and waiting for the next start signal). Due
to the entire structure of the renderer this can hardly be changed.

I have a dual-core AMD CPU, and the device manager shows two CPUs. Yet,
with fully optimized release code, the dual-threaded execution is about
25% percent slower than the single-threaded one.

Any clues to why this is so (e.g. too fine grained) or what I am doing
wrong (i.e. second core not recognized)?

I have tested the entire stuff on WinXP home SP 2.

Bob_Pendleton · December 31, 2006, 5:48am

Maybe this is not the perfect place to post this question, but as it
has at least losely to do with SDL, I’ll do it anyway.

I have multithreaded some CPU intensive code in my renderer using SDL.
That code really is CPU intensive - it can cut down framerates by 75%,
so I expected to really see some performance gain on multicore CPUs if
splitting it in two threads.

The threads are started at program start and wait for data to process.
The thread execution process is pretty finegrained though (i.e. the
threads are signalled to start pretty often and do not process all that
much data before finishing and waiting for the next start signal). Due
to the entire structure of the renderer this can hardly be changed.

I have a dual-core AMD CPU, and the device manager shows two CPUs. Yet,
with fully optimized release code, the dual-threaded execution is about
25% percent slower than the single-threaded one.

Any clues to why this is so (e.g. too fine grained) or what I am doing
wrong (i.e. second core not recognized)?

I have tested the entire stuff on WinXP home SP 2.

If you are doing rendering that is very CPU intensive it is probably
even more memory intensive. My bet is that you churn through megabytes
of memory. Using two CPUs doubles the amount of CPU horse power
available, but it does not double the amount of cache memory available
and it does not double the memory bandwidth of your system. What it does
is double the load on the memory and the cache. It is very possible that
using two CPUs is causing cache thrashing that you did not have with a
single CPU. That could explain the slow down. Using multiple CPUs is
great for CPU intensive applications, but it is not good for memory
intensive applications.

I am of course assuming that you are not creating the threads each
time… And also that when you said “signal” you mean you have the
threads passively waiting on a condition variable and not doing some
horrible kind of busy waiting.

	Bob PendletonOn Sat, 2006-12-30 at 23:43 +0000, karx11erx wrote:

SDL mailing list
SDL at libsdl.org
http://www.libsdl.org/mailman/listinfo/sdl

–
±-------------------------------------+

Bob Pendleton: writer and programmer +
email: Bob at Pendleton.com +
web: www.GameProgrammer.com +
www.Wise2Food.com +
nutrient info on 7,000+ common foods +
±-------------------------------------+

Dietfrid_Mali · December 31, 2006, 9:17am

Bob Pendleton <bob pendleton.com> writes:

Maybe this is not the perfect place to post this question, but as it
has at least losely to do with SDL, I’ll do it anyway.

I have multithreaded some CPU intensive code in my renderer using SDL.
That code really is CPU intensive - it can cut down framerates by 75%,
so I expected to really see some performance gain on multicore CPUs if
splitting it in two threads.

The threads are started at program start and wait for data to process.
The thread execution process is pretty finegrained though (i.e. the
threads are signalled to start pretty often and do not process all that
much data before finishing and waiting for the next start signal). Due
to the entire structure of the renderer this can hardly be changed.

I have a dual-core AMD CPU, and the device manager shows two CPUs. Yet,
with fully optimized release code, the dual-threaded execution is about
25% percent slower than the single-threaded one.

Any clues to why this is so (e.g. too fine grained) or what I am doing
wrong (i.e. second core not recognized)?

I have tested the entire stuff on WinXP home SP 2.

If you are doing rendering that is very CPU intensive it is probably
even more memory intensive. My bet is that you churn through megabytes
of memory. Using two CPUs doubles the amount of CPU horse power
available, but it does not double the amount of cache memory available
and it does not double the memory bandwidth of your system. What it does
is double the load on the memory and the cache. It is very possible that
using two CPUs is causing cache thrashing that you did not have with a
single CPU. That could explain the slow down. Using multiple CPUs is
great for CPU intensive applications, but it is not good for memory
intensive applications.

I am of course assuming that you are not creating the threads each
time… And also that when you said “signal” you mean you have the
threads passively waiting on a condition variable and not doing some
horrible kind of busy waiting.
  Bob Pendleton
SDL mailing list
SDL libsdl.org
http://www.libsdl.org/mailman/listinfo/sdl

Bob,

thx for the input. The code doesn’t read a lot of memory though, but it does a
lot of floating point calculations (it shoots a ray through the level and
determines which faces of the level the ray intersects). What is behind that is
that I am trying to determine the closest faces to an object that are shadowed
by it to do some volume stencil shadow clipping (because I am having so many
lights that I cannot do some regular, full multipass volume stencil shadow
rendering). So I am taking a line from the current light source through each lit
vertex of the object model and see where that line first hits a wall. That means
quite some lines per object. So I have two threads, each processing half of the
object’s lit vertices.

My CPU has 1 MB of CPU cache, and afaik that is per core. Even if not it is
still twice the amount of most single core AMD CPUs (I have a Linux box with
such a CPU).> On Sat, 2006-12-30 at 23:43 +0000, karx11erx wrote:

Spencer_Salazar · December 31, 2006, 10:41am

My CPU has 1 MB of CPU cache, and afaik that is per core

per-core cache is actually not that helpful if your threads are
accessing the same parts of memory. im not familiar with specific
caching mechanisms of AMD’s chips but the basic problem is that
shared thread state, even basic semaphores or mutexes, has to be
synchronized between the two L1 caches, which is a whole lot more
time consuming than one or two threads using just one L1 cache. The
frequent signaling probably isn’t helping much either, especially if
your threads are blocking in a tight loop.

it seems to me like you are using two threads to solve a one thread
problem. The fact that you made thread execution so fine-grained
highlights this. threads aren’t a panacea, even on a multi-processor/
multi-core system. so i would either revise the threaded code to
involve less signaling and/or shared data, or improve the efficiency
of the original single-threaded algorithm if possible… hope this
is of some help but sorry if its not.

spencerOn Dec 31, 2006, at 1:17 AM, karx11erx wrote:

Dietfrid_Mali · December 31, 2006, 11:09am

Spencer Salazar <ssalazar CS.Princeton.EDU> writes:

My CPU has 1 MB of CPU cache, and afaik that is per core

it seems to me like you are using two threads to solve a one thread
problem. The fact that you made thread execution so fine-grained
highlights this. threads aren’t a panacea, even on a multi-processor/
multi-core system. so i would either revise the threaded code to
involve less signaling and/or shared data, or improve the efficiency
of the original single-threaded algorithm if possible… hope this
is of some help but sorry if its not.

spencer

I don’t think this is a one-thread problem. It is parallelizable, there are few
memory accesses, but a lot of 3D math. What I think is that the thread
management overhead is so expensive that it costs more than what parallel
execution save in execution time. Achieving a speedup if using the unoptimized
debug code is suggesting that.> On Dec 31, 2006, at 1:17 AM, karx11erx wrote:

Frank_Becker · December 31, 2006, 6:14pm

karx11erx wrote:

Maybe this is not the perfect place to post this question, but as it
has at least losely to do with SDL, I’ll do it anyway.

I have multithreaded some CPU intensive code in my renderer using SDL.
That code really is CPU intensive - it can cut down framerates by 75%,
so I expected to really see some performance gain on multicore CPUs if
splitting it in two threads.

The threads are started at program start and wait for data to process.
The thread execution process is pretty finegrained though (i.e. the
threads are signalled to start pretty often and do not process all that
much data before finishing and waiting for the next start signal). Due
to the entire structure of the renderer this can hardly be changed.

I have a dual-core AMD CPU, and the device manager shows two CPUs. Yet,
with fully optimized release code, the dual-threaded execution is about
25% percent slower than the single-threaded one.

Any clues to why this is so (e.g. too fine grained) or what I am doing
wrong (i.e. second core not recognized)?

Since you are wondering where your app is spending its time, how about
running it through a profiler?

Cheers,–
Frank Becker - Need a break? http://criticalmass.sf.net/