SDL atomic operations part #2, and many thanks to the QNX maintainer

Ok, I’ve rounded up information about atomic operations as provided by GCC,
QNX, Mac OS X, and windows. I’ve posted it at:
http://thegrumpyprogrammer.com/node/15.

Part #3 will be an attempt at a redesign based on what I have learned. I
want to reduce, as much as possible, the number of emulated ops.

I have a couple of major question that I would like folks to get opinionated
about. The main question is about atomic operations on pointers. I
originally wrote the atomic ops so that they only worked on unsigned ints. I
figured people would deal with pointers by casting them to unsigned of the
right size. Mac OS X, and Windows provide atomic ops on signed ints and
provided operations for pointers. On windows there is very little support
for atomic ops on pointers, but there is a little.

Having looked at what the other guys do I rather like providing separate ops
for ints and pointers. BUT, that would mean casting pointers to signed ints
on some platforms. Casting pointers to signed ints makes me very very
nervous. I spent too much of my youth working on machines where int and
(void *) were not the same size.

The other thing is also a casting question. How do you feel about casting
signed to unsigned, doing doing addition or subtraction, and then casting it
back? Or, the reverse. We have the problem that some of the platforms
provide ops for only signed ints and some for only unsigned ints.

Bob Pendleton–
±----------------------------------------------------------

Bob Pendleton writes:

[…]
Having looked at what the other guys do I rather like providing separate ops
for ints and pointers. BUT, that would mean casting pointers to signed ints
on some platforms. Casting pointers to signed ints makes me very very
nervous.

that’s good :slight_smile:

I spent too much of my youth working on machines where int and (void
*) were not the same size.

Be prepared to experience a d?j? vu then ;-):

printf("%ld %ld\n", sizeof(int), sizeof(void*));
prints:
4 8
on x86_64 (aka amd64) / linux

It seems c99 <stdint.h> optionally provides intptr_t and uintptr_t to
get an integer large enough to hold a pointer. But there are platforms
without an large enough integer type.
Related thread on boost mailing list:
http://lists.boost.org/Archives/boost/2005/09/94253.php

jens

Hello, Bob!

I think 8 and 16 bit atomic operations are useless for the most programmers, since atomically access can be for native CPU register size in most CPUs. Nowadays we have 32 and 64 bit CPUs only, so my opinion as I said above, only 32 and 64 bit will be useful for developers.

Also atomic access emulation code has a problem: when pointer to the atomic variable is not aligned and crossing page boundary, atomic access could not been done via 32 bit or 64 bit data to memory write operation at least on all Intel multicore CPUs.

“Bob Pendleton” wrote in message news:9aac2d770907211411n4907dfd0tef176579ecc0c093 at mail.gmail.com
Ok, I’ve rounded up information about atomic operations as provided by GCC, QNX, Mac OS X, and windows. I’ve posted it at: http://thegrumpyprogrammer.com/node/15.

Part #3 will be an attempt at a redesign based on what I have learned. I want to reduce, as much as possible, the number of emulated ops.

I have a couple of major question that I would like folks to get opinionated about. The main question is about atomic operations on pointers. I originally wrote the atomic ops so that they only worked on unsigned ints. I figured people would deal with pointers by casting them to unsigned of the right size. Mac OS X, and Windows provide atomic ops on signed ints and provided operations for pointers. On windows there is very little support for atomic ops on pointers, but there is a little.

Having looked at what the other guys do I rather like providing separate ops for ints and pointers. BUT, that would mean casting pointers to signed ints on some platforms. Casting pointers to signed ints makes me very very nervous. I spent too much of my youth working on machines where int and (void *) were not the same size.

The other thing is also a casting question. How do you feel about casting signed to unsigned, doing doing addition or subtraction, and then casting it back? Or, the reverse. We have the problem that some of the platforms provide ops for only signed ints and some for only unsigned ints.

Bob Pendleton–
±----------------------------------------------------------



SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org

very very nervous. I spent too much of my youth working on machines
where int and (void *) were not the same size.

In our old age, sizeof (int) != sizeof (void *) on either Win64 or
Linux/amd64, either. On Linux/amd64, int is 32-bit, long is 64-bit. On
Win64, int and long are 32-bit, and you use an intrinsic type __int64
for 64-bit. Too much broken Windows code if they changed “long” I guess.

You’re going to have to #ifdef something to do the right thing with
pointers everywhere, I assume.

–ryan.

I don’t think a major rewrite is all that necessary. Maybe slim it down
to some more basic operations that most platforms support, and
definitely add pointer operations.

As for pointer operations, and casting pointer to an integer type, isn’t
size_t the appropriate type for that?

And I wouldn’t be affected if 8 and 16 bit ops were removed, I’m sure
that someone somewhere would want to use them. I’m sure they can be
emulated by use of local 32-bit variables and using the 32-bit version
of the function, anyway.

If a specific operation isn’t supported by the target system, try to
emulate it using ones that are, or maybe use an additional compile-time
definition for the processor architecture (IE, SDL_ARCH_x86 or
SDL_ARCH_IA64 [or maybe something like this already exists - I know
little about SDL internals]) and write them in assembly for known
architectures and compilers.

As for casting between signed and unsigned, what’s the difference
really? The value will be the same in memory either way, won’t it?
Let the programmer cast it to the other type if that’s how (s)he wants
it. I think signed should be used for the functions, though, but only
because that’s what it seems like most systems use (Windows and Mac OS X
do, anyway; although I notice QNX wants unsigned).

Hi Bob,

I think there is a race condition in the mutex-based version of privateWaitLock
that is used by the emulation code in the dummy module (and copies thereof).
This is the code currently in SVN (/trunk/SDL/src/atomic/dummy/SDL_atomic.c,
Rev. 4611):

static inline void
privateWaitLock()
{
if(NULL == lock)
{
lock = SDL_CreateMutex();
if (NULL == lock)
{
SDL_SetError(“SDL_atomic.c: can’t create a mutex”);
return;
}
}

if (-1 == SDL_LockMutex(lock))
{
   SDL_SetError("SDL_atomic.c: can't lock mutex");
}

}

The problem is that the mutex is initialized on demand, i. e. on the first call
of the function. Since the purpose of atomic operations is to be used by several
threads concurrently, it is perfectly possible to have several simultaneous
"first calls" of privateWaitLock (through any of the emulated operations).

A scenario showing the race condition in detail, with the worst possible outcome:
No thread has called an emulated atomic operation yet, so the lock variable
still is NULL.
Threads A and B have each just reached their first call to, for example,
SDL_AtomicIncrementThenFetch8.
A enters privateWaitLock.
A tests if(NULL == lock), which is true, so it enters the respective block.
– The following actions of B either happen in parallel to the actions of A (on
a multi-core system) or after A has been preempted (on any type of system) right
here. –
B enters privateWaitLock.
B tests if(NULL == lock), which still is true, so it enters the same block.
B executes SDL_CreateMutex.
B assigns the result to the lock pointer variable.
– If B is preempted here, we can get away with just a resource leak. –
B executes SDL_LockMutex, which succeeds immediately (B is not blocked).
B returns from privateWaitLock and now reaches the statement (*ptr)+= 1; in
SDL_AtomicIncrementThenFetch8.
– Meanwhile, on a different core, or after preemption… –
A executes SDL_CreateMutex. Another mutex object is created.
A assigns the result to the lock pointer variable.
– The mutex object that B created is now garbage. –
A executes SDL_LockMutex, which succeeds immediately (A is not blocked), because
the mutex locked is a different one than that locked by B.
A returns from privateWaitLock and now reaches the statement (*ptr)+= 1;.
A and B could now mess up (*ptr). Both could get the same result from the
addition, because as we all know, += is not atomic, which is why we need the mutex.
A executes privateUnlock, which unlocks the mutex pointed to by lock, that A had
created and locked before.
A returns from SDL_AtomicIncrementThenFetch8.
– Time for B again… –
B also executes privateUnlock, which attempts to unlock the mutex pointed to by
lock. This is, however, the mutex created by A, which B had never locked. The
operation will give an error - or maybe worse.

Possible solutions:

  • Bad workaround: Require the user to call each atomic operation he intends to
    use at least once before the first call to SDL_CreateThread or whatever else he
    uses to create additional threads.
  • Initialize the mutex on the first call of SDL_CreateThread. With atomic
    operations (ha ha), you might even be able to manage a reference count for it,
    to destroy it when all additional threads have ended. But that could be overkill.
  • Include the mutex initialization in SDL_Init. This is what I’d prefer - if you
    stick with the mutex -, because it allows using the atomic ops even with non-SDL
    threads. You never know.

Now for my long-awaited opinion. :wink:

  • Use the mutex emulation only as a very last resort. I’m quite sure that
    spinlocks (like in the current linux version of SDL_atomic.c) will perform much
    better for the use cases of SDL_atomic. Of course, the mutex may also be
    implemented to spin some time before actually blocking.
    The fact that the mutex operations might fail is also a problem. The SDL_atomic
    functions don’t return error codes, and they aren’t really needed either.
    SDL_SetError alone won’t be “heard”. Abort?

  • Make it possible to test at compile time, whether a particular operation is
    supported natively by the platform. This will allow to pick an atomic-optimized
    algorithm on platforms where it will really boost performance. Other platforms
    may be better off with an algorithm that is taylored directly to using a mutex
    or whatever else.

  • Pointers: You should definitely have distinct functions for pointers and ints.
    You might still cast internally, if possible, but at least the user won’t have to.

  • Signedness: There are certainly many more use cases for atomic ops than I
    know, but so far I guess that signedness will rarely matter.
    You could have your own set of typedefs for SDL_atomic (which could include the
    volatile modifier for convenience, BTW):
    Type Minimal range
    atomic7 0…127 (0x7f)
    atomic15 0…32767 (0x7fff)
    atomic31 0…2147483647 (0x7fffffff)
    atomic63 0…9223372036854775807 (0x7fffffffffffffff)
    Whether those types have more (i. e. negative) values is unspecified, and the
    user must not rely on it.
    For the cases where signedness does matter, you could again make it possible to
    check support at compile time.
    Other idea: Make the interface signed, but internally cast or transform the
    value where the implementation is unsigned. A transformation would require that
    all simple reads and writes (including initialization) are done through
    SDL_atomic functions as well.

Martin

Bob Pendleton wrote:> I have a couple of major question that I would like folks to get opinionated

about. The main question is about atomic operations on pointers. I
originally wrote the atomic ops so that they only worked on unsigned ints. I
figured people would deal with pointers by casting them to unsigned of the
right size. Mac OS X, and Windows provide atomic ops on signed ints and
provided operations for pointers. On windows there is very little support
for atomic ops on pointers, but there is a little.

Having looked at what the other guys do I rather like providing separate ops
for ints and pointers. BUT, that would mean casting pointers to signed ints
on some platforms. Casting pointers to signed ints makes me very very
nervous. I spent too much of my youth working on machines where int and
(void *) were not the same size.

The other thing is also a casting question. How do you feel about casting
signed to unsigned, doing doing addition or subtraction, and then casting it
back? Or, the reverse. We have the problem that some of the platforms
provide ops for only signed ints and some for only unsigned ints.

Bob Pendleton

Hi Bob,

I think there is a race condition in the mutex-based version of
privateWaitLock that is used by the emulation code in the dummy module (and
copies thereof). This is the code currently in SVN
(/trunk/SDL/src/atomic/dummy/SDL_atomic.c, Rev. 4611):

static inline void
privateWaitLock()
{
? if(NULL == lock)
? {
? ? ?lock = SDL_CreateMutex();
? ? ?if (NULL == lock)
? ? ?{
? ? ? ? SDL_SetError(“SDL_atomic.c: can’t create a mutex”);
? ? ? ? return;
? ? ?}
? }

? if (-1 == SDL_LockMutex(lock))
? {
? ? ?SDL_SetError(“SDL_atomic.c: can’t lock mutex”);
? }
}

The problem is that the mutex is initialized on demand, i. e. on the first
call of the function. Since the purpose of atomic operations is to be used
by several threads concurrently, it is perfectly possible to have several
simultaneous “first calls” of privateWaitLock (through any of the emulated
operations).

You are absolutely correct. Thank you for pointing this out!

A scenario showing the race condition in detail, with the worst possible
outcome:
No thread has called an emulated atomic operation yet, so the lock variable
still is NULL.
Threads A and B have each just reached their first call to, for example,
SDL_AtomicIncrementThenFetch8.
A enters privateWaitLock.
A tests if(NULL == lock), which is true, so it enters the respective block.
– The following actions of B either happen in parallel to the actions of A
(on a multi-core system) or after A has been preempted (on any type of
system) right here. –
B enters privateWaitLock.
B tests if(NULL == lock), which still is true, so it enters the same block.
B executes SDL_CreateMutex.
B assigns the result to the lock pointer variable.
– If B is preempted here, we can get away with just a resource leak. –
B executes SDL_LockMutex, which succeeds immediately (B is not blocked).
B returns from privateWaitLock and now reaches the statement (*ptr)+= 1; in
SDL_AtomicIncrementThenFetch8.
– Meanwhile, on a different core, or after preemption… –
A executes SDL_CreateMutex. Another mutex object is created.
A assigns the result to the lock pointer variable.
– The mutex object that B created is now garbage. –
A executes SDL_LockMutex, which succeeds immediately (A is not blocked),
because the mutex locked is a different one than that locked by B.
A returns from privateWaitLock and now reaches the statement (*ptr)+= 1;.
A and B could now mess up (*ptr). Both could get the same result from the
addition, because as we all know, += is not atomic, which is why we need the
mutex.
A executes privateUnlock, which unlocks the mutex pointed to by lock, that A
had created and locked before.
A returns from SDL_AtomicIncrementThenFetch8.
– Time for B again… –
B also executes privateUnlock, which attempts to unlock the mutex pointed to
by lock. This is, however, the mutex created by A, which B had never locked.
The operation will give an error - or maybe worse.

Possible solutions:

  • Bad workaround: Require the user to call each atomic operation he intends
    to use at least once before the first call to SDL_CreateThread or whatever
    else he uses to create additional threads.
  • Initialize the mutex on the first call of SDL_CreateThread. With atomic
    operations (ha ha), you might even be able to manage a reference count for
    it, to destroy it when all additional threads have ended. But that could be
    overkill.
  • Include the mutex initialization in SDL_Init. This is what I’d prefer - if
    you stick with the mutex -, because it allows using the atomic ops even with
    non-SDL threads. You never know.

Now for my long-awaited opinion. :wink:

  • Use the mutex emulation only as a very last resort.

I agree completely. I currently working on getting rid of emulation
completely. Not sure that I can do it, but I am trying.

I’m quite sure that
spinlocks (like in the current linux version of SDL_atomic.c) will perform
much better for the use cases of SDL_atomic.

Here’s the problem, spin locks are not currently part of SDL. IIRC
there is a spin lock in pthreads, so it might be possible to include
them as a regular part of SDL. But, I’m not convinced I should add
them as part of SDL_atomic.

Of course, the mutex may also
be implemented to spin some time before actually blocking.
The fact that the mutex operations might fail is also a problem. The
SDL_atomic functions don’t return error codes, and they aren’t really needed
either. SDL_SetError alone won’t be “heard”. Abort?

Tough call, I’d like to hear from more people on that one.

  • Make it possible to test at compile time, whether a particular operation
    is supported natively by the platform. This will allow to pick an
    atomic-optimized algorithm on platforms where it will really boost
    performance. Other platforms may be better off with an algorithm that is
    taylored directly to using a mutex or whatever else.

Yes, I agree. Not simple to accomplish. It feeds into another problem.On Sun, Aug 23, 2009 at 7:30 PM, Martin<name.changed.by.editors at online.de> wrote:
On some platforms the atomic ops are provided as compiler intrinsics and you would like to have those translate into #defines so that you get around the need for a function call to invoke an intrinsic. The trouble is that you might find yourself having to do some thing horrible like doing the equivalent of including windows.h a the top level of people code to do what you want. I would love to see more input on that problem. Pointers: You should definitely have distinct functions for pointers and ints. You might still cast internally, if possible, but at least the user won’t have to. I just went through four major use cases, reference counting, spin locks, fixed length queues, and the readers/writers problem and did not find a single instance where pointer arithmetic was worth having. I was sure that I would find a need for atomic ops on pointers for implementing queues, but no, I did not. The trouble is that you have to wrap the indexes around at the end of the queue. So, unless someone can come up with a use case that benefits from them, I’m not going to put atomic ops on pointers in the library. Signedness: There are certainly many more use cases for atomic ops than I know, but so far I guess that signedness will rarely matter. You could have your own set of typedefs for SDL_atomic (which could include the volatile modifier for convenience, BTW): Type ? ? ? Minimal range atomic7 ? ?0…127 (0x7f) atomic15 ? 0…32767 (0x7fff) atomic31 ? 0…2147483647 (0x7fffffff) atomic63 ? 0…9223372036854775807 (0x7fffffffffffffff) Whether those types have more (i. e. negative) values is unspecified, and the user must not rely on it. For the cases where signedness does matter, you could again make it possible to check support at compile time. Other idea: Make the interface signed, but internally cast or transform the value where the implementation is unsigned. A transformation would require that all simple reads and writes (including initialization) are done through SDL_atomic functions as well. After looking in detail at the atomic operations provided by GCC, Windows, Mac OS X, and QNX, I am dropping support for anything but 32 and 64 bit unsigned values. The majority of platforms only support 32 and 64 bit unsigned values. If you support 8 and 16 bit operations you wind up with half the library being emulated on the most common platforms. Hey, thanks for the input! Please, if you, or anyone else, has anything to add. Please do. Martin Bob Pendleton wrote:

I have a couple of major question that I would like folks to get
opinionated
about. The main question is about atomic operations on pointers. I
originally wrote the atomic ops so that they only worked on unsigned ints.
I
figured people would deal with pointers by casting them to unsigned of the
right size. Mac OS X, and Windows provide atomic ops on signed ints and
provided operations for pointers. On windows there is very little support
for atomic ops on pointers, but there is a little.

Having looked at what the other guys do I rather like providing separate
ops
for ints and pointers. BUT, that would mean casting pointers to signed
ints
on some platforms. Casting pointers to signed ints makes me very very
nervous. I spent too much of my youth working on machines where int and
(void *) were not the same size.

The other thing is also a casting question. How do you feel about casting
signed to unsigned, doing doing addition or subtraction, and then casting
it
back? Or, the reverse. We have the problem that some of the platforms
provide ops for only signed ints and some for only unsigned ints.

Bob Pendleton


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


±----------------------------------------------------------

Bob Pendleton wrote:

Martin<@Martin2> wrote:

  • Use the mutex emulation only as a very last resort.

I agree completely. I currently working on getting rid of emulation
completely. Not sure that I can do it, but I am trying.

I’m quite sure that
spinlocks (like in the current linux version of SDL_atomic.c) will perform
much better for the use cases of SDL_atomic.

Here’s the problem, spin locks are not currently part of SDL. IIRC
there is a spin lock in pthreads, so it might be possible to include
them as a regular part of SDL. But, I’m not convinced I should add
them as part of SDL_atomic.

Oh, I see that my above sentence was ambiguous… No, I didn’t want to say that
SDL_atomic could be replaced with a spin lock API. I was instead referring to
this code in trunk/SDL/src/atomic/linux/SDL_atomic.c, Rev. 4611:
#define privateWaitLock()
while (nativeTestThenSet32(&lock))
{
};
This is a spin lock, AFAIK. I was thinking that something similar should be
possible on most platforms, since I expect each to support at least a small
subset of SDL_atomic natively.
The only drawback I meanwhile came to think of is the risk of starving a thread,
if another one repeatedly calls SDL_atomic functions within very short intervals.

The fact that the mutex operations might fail is also a problem. The
SDL_atomic functions don’t return error codes, and they aren’t really needed
either. SDL_SetError alone won’t be “heard”. Abort?

Tough call, I’d like to hear from more people on that one.

I guess errors aren’t likely, and even SDL itself is using mutexes without
checking the return codes (see SDL_GetErrorBuf in SDL_thread.c). But there must
be a reason to have them. I found the following reasons that SDL_mutexP/V could
fail for:

  • Passing a NULL pointer for the mutex.
  • Attempting to release a mutex in a thread that didn’t acquire it.
  • Too many waiting threads (if the system has some limit).
  • Corrupt/invalid data in the mutex itself.
  • Anything else?
    I don’t expect any of these to occur in practice for well-tested code. So using
    abort() probably won’t hurt much.
  • Make it possible to test at compile time, whether a particular operation
    is supported natively by the platform. This will allow to pick an
    atomic-optimized algorithm on platforms where it will really boost
    performance. Other platforms may be better off with an algorithm that is
    taylored directly to using a mutex or whatever else.

Yes, I agree. Not simple to accomplish. It feeds into another problem.
On some platforms the atomic ops are provided as compiler intrinsics
and you would like to have those translate into #defines so that you
get around the need for a function call to invoke an intrinsic. The
trouble is that you might find yourself having to do some thing
horrible like doing the equivalent of including windows.h a the top
level of people code to do what you want. I would love to see more
input on that problem.

I would only use a “public” #define where the operation really is built into the
compiler, so it works without any header. If there is a header involved, you’re
probably already just calling a function, so ultra-high performance is out of
reach anyway.

  • Signedness: There are certainly many more use cases for atomic ops than I
    know, but so far I guess that signedness will rarely matter.
    You could have your own set of typedefs for SDL_atomic (which could include
    the volatile modifier for convenience, BTW):
    Type Minimal range
    atomic7 0…127 (0x7f)
    […]

After looking in detail at the atomic operations provided by GCC,
Windows, Mac OS X, and QNX, I am dropping support for anything but 32
and 64 bit unsigned values. The majority of platforms only support 32
and 64 bit unsigned values. If you support 8 and 16 bit operations you
wind up with half the library being emulated on the most common
platforms.

Yes, I think a smaller interface with good support is the best solution.
Otherwise, performance results would be very unreliable across platforms.

Martin