Using SDL_atomic

Sik_the_hedgehog · March 2, 2015, 10:15pm

2015-03-02 17:13 GMT-03:00, Eirik Byrkjeflot Anonsen :

Note that in this case ‘shared_data’ is read before ‘atomic’ is tested.
Thus it might end up sending a stale value to dangerous(). When the SDL
documentation says “Seriously, here be dragons!”, it really means it

Wouldn’t this be an issue with mutexes as well, actually? I mean, if
the compiler can reorder around SDL_AtomicGet like that, it surely can
reorder around the function that calls the mutex lock as well for the
same reason. It’s not like the compiler knows that the shared data is
not safe to cache

2015-03-02 18:16 GMT-03:00, john skaller :

Then the “volatile” should, normally, prevent any compiler optimisations.

Except it won’t. The only thing it does is guarantee that it’ll write
to memory and that consecutive volatile accesses will be done in said
order (and now not even that thanks to processor-level reordering, it
needs to be cache-through as well for that to work). There’s a very
good reason why volatile doesn’t work at all for multithreading.

The only purpose of volatile is to access hardware ports. Anything
else won’t work as expected.

Bob_Pendleton · March 2, 2015, 11:01pm

Just a couple things, since I pushed the atomics through acceptance and
wrote the first several versions before they were completely rewritten…

Just because you start another thread is no reason to believe that the
thread is running. in fact, since the total number of threads running on a
system is pretty much always larger than the number of cores you can be
sure that sometimes one thread will be running and one thread will not be
running. You have to use locks to make sure that a thread can actually
block and force another thread to run. The code as presented may never let
the save thread run at all because until it runs the flag will not be set
and unless you force the other thread to block once in a while it may never
stop running and let the flag be set.

Probably the best test of when to use an atomic versus a mutex is how long
the flag will stay in the locked state. If you are going to keep the flag
set for more than a few hundred cycles then use a mutex. Yes, I said
CYCLES. The time it takes to run at most a few hundred instructions in a
single core.

Oh, yeah, you should never use simple assignment to communicate between
threads. Like it was pointed out above, between the machine scheduling
instructions out of order and the compiler moving code all over the place
you can never be sure when, or if, an assignment will actually take place.
In fact if you set a value like flag = true; and then later say flag =
false, and do not check the value in between the two statements, the
compiler may just decide to eliminate flag = true because it has no effect.
If the flag is initialized to false then both statements can be removed
completely, unless you tell the compiler not to do that using the volatile
type modifier. A good dead code eliminator pass in the compiler can do
amazing things if you let it and are not aware that it exists.

Bob PendletonOn Mon, Mar 2, 2015 at 4:15 PM, Sik the hedgehog <sik.the.hedgehog at gmail.com wrote:

2015-03-02 17:13 GMT-03:00, Eirik Byrkjeflot Anonsen :

Note that in this case ‘shared_data’ is read before ‘atomic’ is tested.
Thus it might end up sending a stale value to dangerous(). When the SDL
documentation says “Seriously, here be dragons!”, it really means it

Wouldn’t this be an issue with mutexes as well, actually? I mean, if
the compiler can reorder around SDL_AtomicGet like that, it surely can
reorder around the function that calls the mutex lock as well for the
same reason. It’s not like the compiler knows that the shared data is
not safe to cache

2015-03-02 18:16 GMT-03:00, john skaller :

Then the “volatile” should, normally, prevent any compiler optimisations.

Except it won’t. The only thing it does is guarantee that it’ll write
to memory and that consecutive volatile accesses will be done in said
order (and now not even that thanks to processor-level reordering, it
needs to be cache-through as well for that to work). There’s a very
good reason why volatile doesn’t work at all for multithreading.

The only purpose of volatile is to access hardware ports. Anything
else won’t work as expected.

SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org

–
±----------------------------------------------------------

Bob Pendleton: writer and programmer
email: Bob at Pendleton.com
blog: www.TheGrumpyProgrammer.com

john_skaller · March 3, 2015, 1:23am

2015-03-02 17:13 GMT-03:00, Eirik Byrkjeflot Anonsen :

Note that in this case ‘shared_data’ is read before ‘atomic’ is tested.
Thus it might end up sending a stale value to dangerous(). When the SDL
documentation says “Seriously, here be dragons!”, it really means it

Wouldn’t this be an issue with mutexes as well, actually? I mean, if
the compiler can reorder around SDL_AtomicGet like that, it surely can
reorder around the function that calls the mutex lock as well for the
same reason. It’s not like the compiler knows that the shared data is
not safe to cache

It’s not really compiler reordering that’s the problem, rather
its the CPU and cache management.

In theory, mutexes just can’t work. i mean, they are specified
to ensure mutual exclusion, and they will do that, but in theory
that is of no value whatsoever, since it doesn’t lead to any ability
to share.On 03/03/2015, at 9:15 AM, Sik the hedgehog wrote:

–
john skaller
@john_skaller
http://felix-lang.org

john_skaller · March 3, 2015, 1:33am

Just a couple things, since I pushed the atomics through acceptance and wrote the first several versions before they were completely rewritten…

Thanks!

Just because you start another thread is no reason to believe that the thread is running.

Probably the best test of when to use an atomic versus a mutex is how long the flag will stay in the locked state.

Which is of course no use for the problem you mentioned above

If you want to force a thread to wait for another thread you have to use
a condition variable or semaphore … that won’t force the other thread
to run but it will force the current one to wait UNTIL it runs (up
to a particular point).

In fact for SDL the correct control structure to provide is probably a
thing called a monitor. [Monitors are provided in Felix represented
as pchannels]

The only other way to “force” threads to run is to use a RTOS (Real time
operating system).On 03/03/2015, at 10:01 AM, Bob Pendleton wrote:

–
john skaller
@john_skaller
http://felix-lang.org

Sik_the_hedgehog · March 3, 2015, 7:07am

2015-03-02 22:33 GMT-03:00, john skaller :

The only other way to “force” threads to run is to use a RTOS (Real time
operating system).

Or to force a yield, which tells the OS that it’s a good place to make
the thread start waiting (i.e. tell the schedule that it’s OK to
switch threads ahead of time). I know SDL_Sleep(0) on Windows manages
to do this, but for some reason I can’t get this to work on Linux (my
build is configured to use the wrong API, maybe?).

I know that using SDL_Sleep may be seen as a problem by some people
due to the unpredictability of sleeping (it waits at least the amount
of specified but can wait more, even seconds if it wishes), but I
was messing with it and in practice it doesn’t really cause problems,
at least on modern Windows.

john_skaller · March 3, 2015, 12:33pm

2015-03-02 22:33 GMT-03:00, john skaller <@john_skaller>:

The only other way to “force” threads to run is to use a RTOS (Real time
operating system).

Or to force a yield, which tells the OS that it’s a good place to make
the thread start waiting

That’s useful but it still doesn’t force “the other” thread to run.On 03/03/2015, at 6:07 PM, Sik the hedgehog wrote:

–
john skaller
@john_skaller
http://felix-lang.org

john_skaller · March 3, 2015, 12:41pm

I know SDL_Sleep(0) on Windows manages
to do this, but for some reason I can’t get this to work on Linux (my
build is configured to use the wrong API, maybe?).

Try setting it to 1 instead of 0 (you mean SDL_Delay I assume)
[The argument should be floating point but that’s another issue … :]

If you really want another thread to run, you need to give it some time.On 03/03/2015, at 6:07 PM, Sik the hedgehog wrote:

–
john skaller
@john_skaller
http://felix-lang.org

Sik_the_hedgehog · March 3, 2015, 12:57pm

2015-03-03 9:33 GMT-03:00, john skaller :

That’s useful but it still doesn’t force “the other” thread to run.

Well, a RTOS wouldn’t help here either, you’d need a single tasking
system and have full control over each core (you simply have no way
to tell the scheduler to move onto the other thread, it can decide to
move to a different thread, or even to a different process)

To put it bluntly, you should always assume the other thread may
respond up much later in the future if performance is a serious issue.
Incidentally this is also why the concept of critical sections exists,
they tell the scheduler that it’s the worst moment to switch away
since other threads are waiting for a resource to be unlocked.

2015-03-03 9:41 GMT-03:00, john skaller :

Try setting it to 1 instead of 0 (you mean SDL_Delay I assume)

Er yeah.

But if I recall correctly 0 on Windows does a yield anyway (it moves
onto other threads and returns as soon as the scheduler says so) and
SDL doesn’t seem to be filtering out the value. On Linux it’s a whole
different issue since first of all it depends on the underlying API
(i.e. one of the two APIs it supports just does a busy loop, so even a
huge delay will result in CPU hogging by the thread if SDL is using
that)

Eirik_Byrkjeflot_Ano · March 3, 2015, 3:44pm

john skaller writes:

2015-03-02 17:13 GMT-03:00, Eirik Byrkjeflot Anonsen <@Eirik_Byrkjeflot_Ano>:

Note that in this case ‘shared_data’ is read before ‘atomic’ is tested.
Thus it might end up sending a stale value to dangerous(). When the SDL
documentation says “Seriously, here be dragons!”, it really means it

Wouldn’t this be an issue with mutexes as well, actually? I mean, if
the compiler can reorder around SDL_AtomicGet like that, it surely can
reorder around the function that calls the mutex lock as well for the
same reason. It’s not like the compiler knows that the shared data is
not safe to cache

Correct, however, see below…

It’s not really compiler reordering that’s the problem, rather
its the CPU and cache management.

Similar problems, but compiler optimizations are more likely to cause
problems because they can do so much more. CPU and cache management will
not eliminate “unnecessary” code or execute code speculatively. A good
optimizing compiler can and will do both. And more.

In theory, mutexes just can’t work. i mean, they are specified
to ensure mutual exclusion, and they will do that, but in theory
that is of no value whatsoever, since it doesn’t lead to any ability
to share.

A correct implementation of posix mutexes does work, since they are
specified to do all the right things (mutual exclusion and full memory
barriers). However, as I referred to earlier in this thread: “Threads
Cannot be Implemented as a Library”
(http://www.hpl.hp.com/techreports/2004/HPL-2004-209.pdf).

In practice, that just means that compilers have to recognize the
threading code and disable any optimizations around that code that would
break it. And I expect that’s exactly what they do. But it does make it
more likely for bugs to show up in this area, I’m sure.

This is unlike the situation with “volatile” which have been documented
for a long time as not suitable for multi-threading. So I would not
trust any serious optimizing compiler to avoid dangerous optimizations
around those.

Of course, the situation is much better with C++11, which does provide
the necessary primitives (and memory model) to write multi-threaded
code.

eirik> On 03/03/2015, at 9:15 AM, Sik the hedgehog wrote:

john_skaller · March 3, 2015, 8:41pm

2015-03-03 9:33 GMT-03:00, john skaller <@john_skaller>:

That’s useful but it still doesn’t force “the other” thread to run.

Well, a RTOS wouldn’t help here either,

Sure it would. I’ve written one. RTOS can make hard guarantees
that threads run, and run in a particular time as well.

Next time you fly over a nuclear power plant … you’d better hope
both the plant and plane are controlled with critical components
running on a RTOS.On 03/03/2015, at 11:57 PM, Sik the hedgehog wrote:

–
john skaller
@john_skaller
http://felix-lang.org

john_skaller · March 3, 2015, 9:23pm

Similar problems, but compiler optimizations are more likely to cause
problems because they can do so much more. CPU and cache management will
not eliminate “unnecessary” code or execute code speculatively.

Oh, but they do (execute code speculatively). In fact all modern
intel CPU’s do this.

It’s unlikely a compiler will do it because compilers can really
only schedule a single thread of control. All they do is try
to help the CPU do it.

“Modern” compilers are still quite stupid, at least in part
because they’re compiling a language which appears to have
been designed to defeat optimisation (namely, C).

[People doing high performance numerical work still use Fortran …]

A correct implementation of posix mutexes does work, since they are
specified to do all the right things (mutual exclusion and full memory
barriers).

I actually checked the specs and saw no mention of a memory barrier,
do you happen to have a link?

In practice, that just means that compilers have to recognize the
threading code and disable any optimizations around that code that would
break it. And I expect that’s exactly what they do.

Certainly not for languages like C. They don’t “recognise” anything.
C compilers are extremely dumb. They can barely optimise
basic primitives like memset … and when they do they break
all sorts of security code (clearing passwords out of memory …)

Of course, the situation is much better with C++11, which does provide
the necessary primitives (and memory model) to write multi-threaded
code.

The situation is better with C++ because it provides much higher
level constructs and a stronger type system, as well as specifically
supporting threads. In addition the design is deliberate and modern
(although it still has to work in a framework which is poorly structured).On 04/03/2015, at 2:44 AM, Eirik Byrkjeflot Anonsen wrote:

–
john skaller
@john_skaller
http://felix-lang.org

Bob_Pendleton · March 3, 2015, 11:04pm

Look, sharing works, threads work, because the hardware is designed to make
them work. Yes, you have to use operations that are recognized by the
hardware as memory barriers and in some cases the compiler also has to
recognize them so it does the right thing with instruction scheduling. But,
the machines do it right and the compilers do it right and if you do all
the right things it works and it works very well. I first did
multi-threaded code on a Univac 1108 in the early 70s (in fortran and
cobol) and the basic rules have not changed. (I’ve also implement threads
on the 8080 and later machines

And, yes, you can implement a thread package as a library, but it needs to
have support from hardware and the OS.

But, learning to write multithreaded code is not easy. The natural
assumptions built into the human mind about how multiple threads "should"
act is completely different from the reality of how they DO act.

BTW, the main reason I wanted atomics in SDL was to implement atomic
reference counting. Reference counting has its problems, but it has very
nice properties for use in interactive programs. But, threads and reference
counts do not mix well unless you have atomic increment and decrement.

Oh, yeah, I should not have said anything about “forcing” another thread to
run. You can’t do that. You can only stop your thread from running until
the other thread has run.

Bob PendletonOn Tue, Mar 3, 2015 at 3:23 PM, john skaller wrote:

On 04/03/2015, at 2:44 AM, Eirik Byrkjeflot Anonsen wrote:

Similar problems, but compiler optimizations are more likely to cause
problems because they can do so much more. CPU and cache management will
not eliminate “unnecessary” code or execute code speculatively.

Oh, but they do (execute code speculatively). In fact all modern
intel CPU’s do this.

It’s unlikely a compiler will do it because compilers can really
only schedule a single thread of control. All they do is try
to help the CPU do it.

“Modern” compilers are still quite stupid, at least in part
because they’re compiling a language which appears to have
been designed to defeat optimisation (namely, C).

[People doing high performance numerical work still use Fortran …]

A correct implementation of posix mutexes does work, since they are
specified to do all the right things (mutual exclusion and full memory
barriers).

I actually checked the specs and saw no mention of a memory barrier,
do you happen to have a link?

In practice, that just means that compilers have to recognize the
threading code and disable any optimizations around that code that would
break it. And I expect that’s exactly what they do.

Certainly not for languages like C. They don’t “recognise” anything.
C compilers are extremely dumb. They can barely optimise
basic primitives like memset … and when they do they break
all sorts of security code (clearing passwords out of memory …)

Of course, the situation is much better with C++11, which does provide
the necessary primitives (and memory model) to write multi-threaded
code.

The situation is better with C++ because it provides much higher
level constructs and a stronger type system, as well as specifically
supporting threads. In addition the design is deliberate and modern
(although it still has to work in a framework which is poorly structured).

–
john skaller
skaller at users.sourceforge.net
http://felix-lang.org

SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org

–
±----------------------------------------------------------

Bob Pendleton: writer and programmer
email: Bob at Pendleton.com
blog: www.TheGrumpyProgrammer.com

Eirik_Byrkjeflot_Ano · March 4, 2015, 3:22pm

john skaller writes:

Similar problems, but compiler optimizations are more likely to cause
problems because they can do so much more. CPU and cache management will
not eliminate “unnecessary” code or execute code speculatively.

Oh, but they do (execute code speculatively). In fact all modern
intel CPU’s do this.

Yes, unfortunate choice of words there

The guarantees provided by intel CPUs when they reorder your code is far
stronger than the guarantees of the C/C++ standards, though.

It’s unlikely a compiler will do it because compilers can really
only schedule a single thread of control. All they do is try
to help the CPU do it.

It is not only likely, it is absolutely guaranteed that modern compilers
will both eliminate unnecessary code and execute code speculatively.
That is, a compiler will calculate values just in case they may be
needed. And that means they will reorder the code in such a way that
code that should not be reachable will still be executed. So code that
is written like:

if (a == 0)
call_a_function(b);

can well be rewritten as:

b_type tmp = b;
if (a == 0)
call_a_function(tmp);

if the compiler’s analysis shows that this is likely to typically be
faster. (And it doesn’t break the guarantees of the language, of
course.)

“Modern” compilers are still quite stupid, at least in part
because they’re compiling a language which appears to have
been designed to defeat optimisation (namely, C).

C does have features that make certain classes of optimization harder
(pointers, in particular). Some of those problems can be mitigated (e.g.
by using “restrict” in the places where the compiler needs that
guarantee.)

[People doing high performance numerical work still use Fortran …]

True, though maybe as much from tradition as from actual advantages
But yes, classic fortran has some restrictions that are useful for these
cases.

A correct implementation of posix mutexes does work, since they are
specified to do all the right things (mutual exclusion and full memory
barriers).

I actually checked the specs and saw no mention of a memory barrier,
do you happen to have a link?

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_11

“synchronize thread execution and also synchronize memory with respect
to other threads.”

In practice, that just means that compilers have to recognize the
threading code and disable any optimizations around that code that would
break it. And I expect that’s exactly what they do.

Certainly not for languages like C. They don’t “recognise” anything.
C compilers are extremely dumb. They can barely optimise
basic primitives like memset … and when they do they break
all sorts of security code (clearing passwords out of memory …)

Yes, even not-very-modern C compilers recognize common patterns and
optimize them specifically. Because that is really effective.

For important constructs that are known to be broken unless the compiler
helps out, you can be sure the compiler authors will detect those cases
and protect them where necessary.

memset is also a library function and not a basic primitive. And as you
say, compilers do recognize them and optimize them. Or how about this
little surprise: What do you think this code compiles to (using current
gcc):

printf(“Hello world\n”);

Turns out the final binary doesn’t call printf at all. It is instead
turned into:

puts(“Hello world”);

I discovered this when my breakpoint on printf never triggered

Though, on second thoughts, I expect what actually happens with the
threading functions is that compiler-specific barriers have been added
to the code to ensure the tested compilers are unable to optimize the
code to breaking.

Of course, the situation is much better with C++11, which does provide
the necessary primitives (and memory model) to write multi-threaded
code.

The situation is better with C++ because it provides much higher
level constructs and a stronger type system, as well as specifically
supporting threads. In addition the design is deliberate and modern
(although it still has to work in a framework which is poorly structured).

C++11 in particular. Older versions do not have language support for
threads, and so need to deal with the problems of libraries supporting
threads.

eirik> On 04/03/2015, at 2:44 AM, Eirik Byrkjeflot Anonsen wrote:

Eirik_Byrkjeflot_Ano · March 4, 2015, 3:36pm

Bob Pendleton writes:

Look, sharing works, threads work, because the hardware is designed to make
them work. Yes, you have to use operations that are recognized by the
hardware as memory barriers and in some cases the compiler also has to
recognize them so it does the right thing with instruction scheduling.

Yes, and this is the problem. The C89 and C++03 specifications have very
limited provisions for memory barriers. This is intentional, because it
allows some seriously effective optimizations. Thus there is no way for
a library to implement proper threading primitives while only referring
to the language specifications. That’s essentially what Hans Boehm says
in the article.

Of course, if you are making a library for a particular version of a
particular compiler, you can usually figure out ways to force it not to
break your code

But,
the machines do it right and the compilers do it right and if you do all
the right things it works and it works very well.

Also true. In practice, compiler vendors will work with threading
library vendors to ensure that those threading libraries will fulfill
their promises. Because anything else would be truly stupid.

However, if you write your own threading primitives, you run into the
problem that you need those compiler-level memory barriers. Some
compilers actually provide that, but pure C89 and C++03 do not.

eirik

Bob_Pendleton · March 4, 2015, 6:52pm

Ah, yes, the standards do not provide the mechanism. That is very true.
The standards can not provide the mechanism because the mechanism is
always machine and sometimes OS specific. The standards define the basic
language that you are supposed to be able to count on from system to system
and machine to machine. The standards can not define things that must be
done differently for each cpu architecture or OS. The standard does not
even specify how many bits are in an int, it only specifies the minimum
number of bits in an int. I’ve worked on machines with 18 and 36 bit ints.
I saw C on a lisp machine with arbitrary length (as large as will fit in
virtual memory) ints. Thousand plus digit ints are cool. Just like SDL must
live with the lowest common denominator so must standards live with
defining what can be defined.

But, C compilers provide extensions that make it possible to implement all
sorts of things, such as thread libraries, even though the standards do not
and can not. contain those features.

Glad we got that straightened out. There is a huge difference between the
language specified in the standard and the language that actually gets
implemented.

An aside, I seriously dislike most every threading package I have ever
encountered because of just one thing. They do not implement threads that
work as people expect them to work. They implement them the ways the OS
scheduler works. Not at all the same thing. This confuses people fiercely.

Bob PendletonOn Wed, Mar 4, 2015 at 9:36 AM, Eirik Byrkjeflot Anonsen wrote:

Bob Pendleton <@Bob_Pendleton> writes:

Look, sharing works, threads work, because the hardware is designed to
make
them work. Yes, you have to use operations that are recognized by the
hardware as memory barriers and in some cases the compiler also has to
recognize them so it does the right thing with instruction scheduling.

Yes, and this is the problem. The C89 and C++03 specifications have very
limited provisions for memory barriers. This is intentional, because it
allows some seriously effective optimizations. Thus there is no way for
a library to implement proper threading primitives while only referring
to the language specifications. That’s essentially what Hans Boehm says
in the article.

Of course, if you are making a library for a particular version of a
particular compiler, you can usually figure out ways to force it not to
break your code

But,
the machines do it right and the compilers do it right and if you do all
the right things it works and it works very well.

Also true. In practice, compiler vendors will work with threading
library vendors to ensure that those threading libraries will fulfill
their promises. Because anything else would be truly stupid.

However, if you write your own threading primitives, you run into the
problem that you need those compiler-level memory barriers. Some
compilers actually provide that, but pure C89 and C++03 do not.

eirik

SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org

–
±----------------------------------------------------------

Bob Pendleton: writer and programmer
email: Bob at Pendleton.com
blog: www.TheGrumpyProgrammer.com

john_skaller · March 4, 2015, 10:55pm

Also true. In practice, compiler vendors will work with threading
library vendors to ensure that those threading libraries will fulfill
their promises. Because anything else would be truly stupid.

The truly stupid is typically exceedingly common.On 05/03/2015, at 2:36 AM, Eirik Byrkjeflot Anonsen wrote:

–
john skaller
@john_skaller
http://felix-lang.org

Eirik_Byrkjeflot_Ano · March 5, 2015, 3:55pm

Bob Pendleton writes:

Ah, yes, the standards do not provide the mechanism. That is very true.
The standards can not provide the mechanism because the mechanism is
always machine and sometimes OS specific. The standards define the basic
language that you are supposed to be able to count on from system to system
and machine to machine. The standards can not define things that must be
done differently for each cpu architecture or OS.

True when these differences are visible to the code. However, for memory
coherence, all the code cares about is that it gets guarantees about
some specific level of memory coherence at some specific points of
source-level execution. So the compiler can hide away the details of
exactly how that is accomplished.

And in fact, some standards do provide such mechanisms. C++11 being the
most relevant example I know of (I don’t know whether C99 or C11 does).
I think the reason C89 and C++03 did not provide such mechanisms were
that they weren’t considered important at the time. And they thought
that it could be done as libraries

The standard does not
even specify how many bits are in an int, it only specifies the minimum
number of bits in an int.

C++03 actually specifies a char to be 1 byte. Not that it helps any, as
it goes on to say that it is unspecified how many bits are in a byte

[…]

Glad we got that straightened out. There is a huge difference between the
language specified in the standard and the language that actually gets
implemented.

Important distinction

And in the end, the main point I take away from Hans Boehm’s article is
that C compilers will do extremely weird things to your code. Which will
work as expected in single-threaded code because any temporary weird
state will be cleaned up before it is observed. But in multi-threaded
code, you really need explicit compiler-level memory barriers around all
access to shared-memory data.

An aside, I seriously dislike most every threading package I have ever
encountered because of just one thing. They do not implement threads that
work as people expect them to work. They implement them the ways the OS
scheduler works. Not at all the same thing. This confuses people fiercely.

I’m interested. In which ways do you think these thread libraries work
contrary to people’s expectations?

eirik

Bob_Pendleton · March 5, 2015, 7:40pm

An aside, I seriously dislike most every threading package I have ever
encountered because of just one thing. They do not implement threads that
work as people expect them to work. They implement them the ways the OS
scheduler works. Not at all the same thing. This confuses people fiercely.

I’m interested. In which ways do you think these thread libraries work
contrary to people’s expectations?

Ok, that is really the subject for a long blog post.

I’ll try to be fairly quick here; People expect threads to run in parallel.
They expect that if they have 20 threads all 20 threads will be running at
once. People think of threads as being like workers in a factory, each
doing their jobs in parallel with all the other workers in the factory. In
reality you have N cores. That means you have at most N workers running
around doing the jobs of all the workers. These workers do not just
automatically stop one job and switch to another job. They only switch when
they can not keep doing the job they are doing.

The key thing is that no matter how many threads you have only a few of
them will be running at one time. But, people expect all threads to be
running all the time. Even people who know better tend to expect that if
they have N cores they should have N active threads in their code. When in
fact they may well have zero cores active or N - (any number <= N).

When you get to situations with multiple machines connected together it
gets even harder for people to understand. I once spent hours… really
days, trying to explain to an EE why when our systems were connected by a
high speed parallel bus the complete system ran slower than when connected
by a low speed serial line. The difference was that the bus was polled and
the serial line was interrupt driven. He never did understand why fast was
slow and slow was fast… but he finally gave me an interrupt on the
parallel bus. The wrong Interrupt, but an interrupt which let it run almost
as fast as the serial line. He never did understand that the interrupt let
me queue data so that both machines ran nearly full speed all the time
while polling forced the machines to run in lockstep.

Oh well, documentation and education does not make enough of a distinction
between software threads, the things that thread packages deal with and
hardware threads, the real things that do the work. The lack of a one to
one correspondence between them is very surprising to people.

The most intuitive thread package I ever used was one I wrote under DOS on
a 286 lo these many years ago. It switched threads when ever a thread
blocked, but it also used a timer interrupt to force switching after about
a thousand instructions had been run. That kind of fine grain scheduling
"wastes" a lot of CPU time but it made it look like every thread was always
running. I did have to lock out task switching around all I/o calls
though…

I base my observations on my own learning curve, my experience trying to
teach the subject, and on decades of helping people on mailing lists.

Bob PendletonOn Thu, Mar 5, 2015 at 9:55 AM, Eirik Byrkjeflot Anonsen wrote:

Bob Pendleton <@Bob_Pendleton> writes:

Ah, yes, the standards do not provide the mechanism. That is very true.
The standards can not provide the mechanism because the mechanism is
always machine and sometimes OS specific. The standards define the basic
language that you are supposed to be able to count on from system to
system
and machine to machine. The standards can not define things that must be
done differently for each cpu architecture or OS.

True when these differences are visible to the code. However, for memory
coherence, all the code cares about is that it gets guarantees about
some specific level of memory coherence at some specific points of
source-level execution. So the compiler can hide away the details of
exactly how that is accomplished.

And in fact, some standards do provide such mechanisms. C++11 being the
most relevant example I know of (I don’t know whether C99 or C11 does).
I think the reason C89 and C++03 did not provide such mechanisms were
that they weren’t considered important at the time. And they thought
that it could be done as libraries

The standard does not
even specify how many bits are in an int, it only specifies the minimum
number of bits in an int.

C++03 actually specifies a char to be 1 byte. Not that it helps any, as
it goes on to say that it is unspecified how many bits are in a byte

[…]

Glad we got that straightened out. There is a huge difference between the
language specified in the standard and the language that actually gets
implemented.

Important distinction

And in the end, the main point I take away from Hans Boehm’s article is
that C compilers will do extremely weird things to your code. Which will
work as expected in single-threaded code because any temporary weird
state will be cleaned up before it is observed. But in multi-threaded
code, you really need explicit compiler-level memory barriers around all
access to shared-memory data.

An aside, I seriously dislike most every threading package I have ever
encountered because of just one thing. They do not implement threads that
work as people expect them to work. They implement them the ways the OS
scheduler works. Not at all the same thing. This confuses people
fiercely.

I’m interested. In which ways do you think these thread libraries work
contrary to people’s expectations?

eirik

–
±----------------------------------------------------------

Bob Pendleton: writer and programmer
email: Bob at Pendleton.com
blog: www.TheGrumpyProgrammer.com

Sik_the_hedgehog · March 6, 2015, 1:14am

2015-03-05 12:55 GMT-03:00, Eirik Byrkjeflot Anonsen :

C++03 actually specifies a char to be 1 byte. Not that it helps any, as
it goes on to say that it is unspecified how many bits are in a byte

I believe C99 explicitly states char to be exactly 8 bits (the rest of
the sizes is still up to the implementation though, minimum size
aside).