Atomic Ops Semantics

I’ve been looking over the atomic operations library in SDL and I was
trying to grasp the exact semantics of SDL_AtomicSet() and SDL_AtomicGet().

On almost every system other than x86, a word-sized (32-bit) load or
store to an unaligned address triggers a SIGBUS, and generally means a
bug in the program or poor implementation of some byte fiddling. If the
load/store is aligned, then it is atomic, that is, if 2+ CPUs try to
read/write to the same address, there will never be a half old / half
new mix of bytes read from the same address. Sure, if two CPUs write
0xAAAAAAAA & 0xBBBBBBBB to the same location and then immediately read,
you don’t know if you get all A’s or all B’s, but you never will get
0xAABBBBAA or some other permutation.

Given that, it looks like SDL_AtomicSet() and SDL_AtomicGet() are misnomers.

SDL_AtomicSet() actually performs an atomic exchange, which guarantees
that the old value will be read, replaced with the new one, and the old
one returned, as one single interruptable action. This is much
different than atomically storing a value, which is actually as simple
as “value = 5;”
(When that change is visible to the other CPUs, eh, depends CPU
architecture or, more deterministically, the programmer to use a memory
barrier). An atomic exchange though differs from an atomic set in that
you assume the programmer cares about the return value (which can be
slower on other architectures). If they didn’t, there wouldn’t be a need
to return the original value, you could just clobber it, which is fast
enough. Maybe AtomicSet() should be a store + memory barrier rather than
an atomic exchange, and just have a function like SDL_AtomicExchange()
to actually perform an exchange.

SDL_AtomicGet()… not really sure what its purpose is. Reading from an
aligned memory location is an atomic operation, but that doesn’t look
like what it is trying to do. Most notably, it uses a function called
SDL_CompilerBarrier() which (on MSVC) tells the compiler not to move
reads/writes across that barrier, but out of order CPUs can reorder the
reads in any way they see fit, so any kind of subtly based on that is
likely going to fail on anything other than x86 (x86 has reads ordered
wrt to other reads, writes wrt writes)

int flag = AtomicGet(&sharedFlag);
if(flag)
return a+b;
else
return 0;

Produces this psuedo assembly

load sharedFlag, r3 //Compiler must generate load sharedFlag first
load a, r1
load b, r2
add r1, r2
compare r3, 0
jump if not equal "done"
set r2, 0
done:
return

If we depend on ‘sharedFlag’ being non-zero to mean “a and b can be read
from safely”, then it is possible that the hardware will reorder this so
that ‘a’ and ‘b’ are read before ‘sharedFlag’.

If another thread performed:
a = 10; b= 20;
AtomicSet(&sharedFlag, 1)

Then we might have already read stale values for ‘a’ and ‘b’ by the time
the changes to ‘sharedFlag’ are visible to other processors, in effect
computing the wrong value for (a+b). What is needed is a load fence that
guarantees that ‘sharedFlag’ will be read absolutely first before any
more loads will be processed.

My whole point is that it is misleading to say it atomically fetches
something from memory when in fact, all aligned reads are atomic, and it
doesn’t really guarantee anything about how the hardware actually reads it.

Any comments and/or flames?

Patrick

[…]
SDL_AtomicSet() actually performs an atomic exchange, which guarantees
that the old value will be read, replaced with the new one, and the old
one returned,
[…]
An atomic exchange though differs from an atomic set in that
you assume the programmer cares about the return value (which can be
slower on other architectures).

Well, atomic exchange is very handy to implement spinlocks to protect
resources when the resource is locked only for a very short time.

SDL_CompilerBarrier() which (on MSVC) tells the compiler not to move
reads/writes across that barrier, but out of order CPUs can reorder the
reads in any way they see fit, so any kind of subtly based on that is
likely going to fail on anything other than x86 (x86 has reads ordered
wrt to other reads, writes wrt writes)

I think most out-of-order CPUs also provide a memory barrier instruction
of some sort, with which you can guarantee that all pending accesses are
finished before the instruction following the barrier. I do know that the
ARM, PowerPC and MIPS do have it, I seem to remember that the Motorola
88xxx had it, I’d assume that all of them implement it. In which case the
SDL_CompilerBarrier() might as well issue (as an inline asm insn) the
actual barrier instruction on a non-x86 target.

All of the above, however, does not invalidate your complaint about the
SDL functions being called AtomicSet() instead of AtomicSwap() and
AtomicGet() instead of, well, something else.

ZoltanOn Thu, 17 Feb 2011, Patrick Baggett wrote:

You’re right that the intended semantics are ordered reads wrt reads and
writes wrt writes and relying on the compiler barrier to make sure that the
data flow isn’t reordered at the instruction level.

Maybe it would make more sense to call them OrderedGet and OrderedSet and
add a separate Exchange operation?

We would also need a sync instruction of some kind on platforms that need
them to ensure the ordering semantics (e.g. PowerPC)

Thanks for the feedback!On Thu, Feb 17, 2011 at 7:53 AM, Patrick Baggett <baggett.patrick at gmail.com>wrote:

I’ve been looking over the atomic operations library in SDL and I was
trying to grasp the exact semantics of SDL_AtomicSet() and SDL_AtomicGet().

On almost every system other than x86, a word-sized (32-bit) load or store
to an unaligned address triggers a SIGBUS, and generally means a bug in the
program or poor implementation of some byte fiddling. If the load/store is
aligned, then it is atomic, that is, if 2+ CPUs try to read/write to the
same address, there will never be a half old / half new mix of bytes read
from the same address. Sure, if two CPUs write 0xAAAAAAAA & 0xBBBBBBBB to
the same location and then immediately read, you don’t know if you get all
A’s or all B’s, but you never will get 0xAABBBBAA or some other permutation.

Given that, it looks like SDL_AtomicSet() and SDL_AtomicGet() are
misnomers.

SDL_AtomicSet() actually performs an atomic exchange, which guarantees that
the old value will be read, replaced with the new one, and the old one
returned, as one single interruptable action. This is much different than
atomically storing a value, which is actually as simple as “value = 5;”
(When that change is visible to the other CPUs, eh, depends CPU
architecture or, more deterministically, the programmer to use a memory
barrier). An atomic exchange though differs from an atomic set in that you
assume the programmer cares about the return value (which can be slower on
other architectures). If they didn’t, there wouldn’t be a need to return the
original value, you could just clobber it, which is fast enough. Maybe
AtomicSet() should be a store + memory barrier rather than an atomic
exchange, and just have a function like SDL_AtomicExchange() to actually
perform an exchange.

SDL_AtomicGet()… not really sure what its purpose is. Reading from an
aligned memory location is an atomic operation, but that doesn’t look like
what it is trying to do. Most notably, it uses a function called
SDL_CompilerBarrier() which (on MSVC) tells the compiler not to move
reads/writes across that barrier, but out of order CPUs can reorder the
reads in any way they see fit, so any kind of subtly based on that is likely
going to fail on anything other than x86 (x86 has reads ordered wrt to other
reads, writes wrt writes)

int flag = AtomicGet(&sharedFlag);
if(flag)
return a+b;
else
return 0;

Produces this psuedo assembly

load sharedFlag, r3 //Compiler must generate load sharedFlag first
load a, r1
load b, r2
add r1, r2
compare r3, 0
jump if not equal "done"
set r2, 0
done:
return

If we depend on ‘sharedFlag’ being non-zero to mean “a and b can be read
from safely”, then it is possible that the hardware will reorder this so
that ‘a’ and ‘b’ are read before ‘sharedFlag’.

If another thread performed:
a = 10; b= 20;
AtomicSet(&sharedFlag, 1)

Then we might have already read stale values for ‘a’ and ‘b’ by the time
the changes to ‘sharedFlag’ are visible to other processors, in effect
computing the wrong value for (a+b). What is needed is a load fence that
guarantees that ‘sharedFlag’ will be read absolutely first before any more
loads will be processed.

My whole point is that it is misleading to say it atomically fetches
something from memory when in fact, all aligned reads are atomic, and it
doesn’t really guarantee anything about how the hardware actually reads it.

Any comments and/or flames?

Patrick


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


-Sam Lantinga, Founder and CEO, Galaxy Gameworks

Or maybe we simply need to expose a read/write/full memory+compiler barrier?On Thu, Feb 17, 2011 at 2:16 PM, Sam Lantinga <@slouken> wrote:

You’re right that the intended semantics are ordered reads wrt reads and
writes wrt writes and relying on the compiler barrier to make sure that the
data flow isn’t reordered at the instruction level.

Maybe it would make more sense to call them OrderedGet and OrderedSet and
add a separate Exchange operation?

We would also need a sync instruction of some kind on platforms that need
them to ensure the ordering semantics (e.g. PowerPC)

Thanks for the feedback!

On Thu, Feb 17, 2011 at 7:53 AM, Patrick Baggett < baggett.patrick at gmail.com> wrote:

I’ve been looking over the atomic operations library in SDL and I was
trying to grasp the exact semantics of SDL_AtomicSet() and SDL_AtomicGet().

On almost every system other than x86, a word-sized (32-bit) load or store
to an unaligned address triggers a SIGBUS, and generally means a bug in the
program or poor implementation of some byte fiddling. If the load/store is
aligned, then it is atomic, that is, if 2+ CPUs try to read/write to the
same address, there will never be a half old / half new mix of bytes read
from the same address. Sure, if two CPUs write 0xAAAAAAAA & 0xBBBBBBBB to
the same location and then immediately read, you don’t know if you get all
A’s or all B’s, but you never will get 0xAABBBBAA or some other permutation.

Given that, it looks like SDL_AtomicSet() and SDL_AtomicGet() are
misnomers.

SDL_AtomicSet() actually performs an atomic exchange, which guarantees
that the old value will be read, replaced with the new one, and the old one
returned, as one single interruptable action. This is much different than
atomically storing a value, which is actually as simple as “value = 5;”
(When that change is visible to the other CPUs, eh, depends CPU
architecture or, more deterministically, the programmer to use a memory
barrier). An atomic exchange though differs from an atomic set in that you
assume the programmer cares about the return value (which can be slower on
other architectures). If they didn’t, there wouldn’t be a need to return the
original value, you could just clobber it, which is fast enough. Maybe
AtomicSet() should be a store + memory barrier rather than an atomic
exchange, and just have a function like SDL_AtomicExchange() to actually
perform an exchange.

SDL_AtomicGet()… not really sure what its purpose is. Reading from an
aligned memory location is an atomic operation, but that doesn’t look like
what it is trying to do. Most notably, it uses a function called
SDL_CompilerBarrier() which (on MSVC) tells the compiler not to move
reads/writes across that barrier, but out of order CPUs can reorder the
reads in any way they see fit, so any kind of subtly based on that is likely
going to fail on anything other than x86 (x86 has reads ordered wrt to other
reads, writes wrt writes)

int flag = AtomicGet(&sharedFlag);
if(flag)
return a+b;
else
return 0;

Produces this psuedo assembly

load sharedFlag, r3 //Compiler must generate load sharedFlag first
load a, r1
load b, r2
add r1, r2
compare r3, 0
jump if not equal "done"
set r2, 0
done:
return

If we depend on ‘sharedFlag’ being non-zero to mean “a and b can be read
from safely”, then it is possible that the hardware will reorder this so
that ‘a’ and ‘b’ are read before ‘sharedFlag’.

If another thread performed:
a = 10; b= 20;
AtomicSet(&sharedFlag, 1)

Then we might have already read stale values for ‘a’ and ‘b’ by the time
the changes to ‘sharedFlag’ are visible to other processors, in effect
computing the wrong value for (a+b). What is needed is a load fence that
guarantees that ‘sharedFlag’ will be read absolutely first before any more
loads will be processed.

My whole point is that it is misleading to say it atomically fetches
something from memory when in fact, all aligned reads are atomic, and it
doesn’t really guarantee anything about how the hardware actually reads it.

Any comments and/or flames?

Patrick


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


-Sam Lantinga, Founder and CEO, Galaxy Gameworks


-Sam Lantinga, Founder and CEO, Galaxy Gameworks

Or maybe we simply need to expose a read/write/full memory+compiler
barrier?

That seems to be the best idea. It looks like atomic operations perform full
barriers implicitly, so that sort of takes care of itself. This is quite low
level API to expose to developers, is this intended to be used as a sort of
performance-critical thing?

PatrickOn Thu, Feb 17, 2011 at 4:21 PM, Sam Lantinga wrote:

Yes, it is.On Thu, Feb 17, 2011 at 6:57 PM, Patrick Baggett <baggett.patrick at gmail.com>wrote:

On Thu, Feb 17, 2011 at 4:21 PM, Sam Lantinga <@slouken> wrote:

Or maybe we simply need to expose a read/write/full memory+compiler
barrier?

That seems to be the best idea. It looks like atomic operations perform
full barriers implicitly, so that sort of takes care of itself. This is
quite low level API to expose to developers, is this intended to be used as
a sort of performance-critical thing?

Patrick


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


-Sam Lantinga, Founder and CEO, Galaxy Gameworks