Atomic API proposal

A true story which may or may not be significant, but at least is worth
thinking about (maybe):

Once, decades ago, when I was a young software engineer, I wrote some code
that unbeknownst to me at the time I wrote it, contained a subtle race
condition that could result in Bad Things happening, but it was “extremely
unlikely” that they would. Happened on average only about once a day. Fine,
if it were for a game that someone might play for an hour or or less. But
this application had to run 24/7.

Took some statistical analysis to figure out what was going on, and the
problem was fixed within an hour of it being discovered in testing. I don’t
even want to think about what might have happened had that code been shipped
to customers.

The point being that something that is “extremely unlikely” to happen can bite
you in the butt.

JeffOn Saturday 15 January 2011 10:40, Alistair Lynn wrote:

It isn’t blocking. At worst, it’s a busy wait - however, it’s only a
busy wait if something continually overwrites the atomic between the
read and the CAS. This is extremely unlikely.

I just committed inline definitions for the SDL atomic operations for
Windows, Mac OS X, and gcc compiler intrinsics.

I ran into something weird while testing this with test/testatomic.c
… the SDL version of SDL_AtomicAdd() built with a spinlock using
__sync_lock_test_and_set() was faster than the native and intrinsic
versions. How can that be?

Here are the timings:
Single thread, no atomic operations: 0.000000 sec
Two threads, OSAtomicCompareAndSwap32Barrier() to get old value: 1.001000
sec
Two threads, __sync_fetch_and_add(): 0.518000 sec
Two threads, SDL spinlock implementation: 0.381000 sec <-- ???
Just for grins I tried it with a full mutex: 31.694000 sec

Can anybody review the code and see what I’m doing wrong?–
-Sam Lantinga, Founder and President, Galaxy Gameworks LLC

I can’t find the source at the moment, but I remember an interesting
article which had as its central thesis “when you’re running a server
that is up 24/7, 365 days a year, statistically irrelevant situations
happen on average once a day”.On 16 January 2011 02:39, Jeff Post <j_post at pacbell.net> wrote:

The point being that something that is “extremely unlikely” to happen can bite
you in the butt.

I read something similar about the possibility that cosmic rays could do
something like turn a zero into a one in memory or something, and cause
problems. I’m not sure if it was that cosmic ray article or another one
on the error probability of most RAM units under normal usage, but with
a probability as low as something like 10^-15
whichever-unit/whichever-unit, something that might happen once in a
lifetime for most users can happen every hour, or once a day as you
said, in a use case, for example, such as Google’s, in which many many
computers run 24/7 etc. Same thing with boring hardware problems such
as hard drive failures…

In those case, the question is not “will the problem occur”, but “what
do we do about it so that we keep running when it does (inevitably) occur”.

Anyway, my point is that those “extremely unlikely” situations SHOULD
NOT be ignored. Instead, it’s a GREAT thing that they were identified,
and a test case reproducing EXACTLY that unlikely situation should be
implemented so that whatever implementation which is chosen would be
robust to this potentially problematic situation. That way, when one of
the billions of billions of SDL users will inevitably hit that unlikely
situation, he will be happy to be using such a robust library :).

Just my humble opinion,

MartOn 1/16/2011 6:35 PM, Brian Barrett wrote:

I can’t find the source at the moment, but I remember an interesting
article which had as its central thesis “when you’re running a server
that is up 24/7, 365 days a year, statistically irrelevant situations
happen on average once a day”.

On 16 January 2011 02:39, Jeff Post<j_post at pacbell.net> wrote:

The point being that something that is “extremely unlikely” to happen can bite
you in the butt.


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org

Humble perhaps, but I’d say a darned good opinion :wink:

JeffOn Sunday 16 January 2011 21:18, Martin Bisson wrote:

Anyway, my point is that those “extremely unlikely” situations SHOULD
NOT be ignored. Instead, it’s a GREAT thing that they were identified,
and a test case reproducing EXACTLY that unlikely situation should be
implemented so that whatever implementation which is chosen would be
robust to this potentially problematic situation.

Just my humble opinion,

An event that happens once a day on a single server doesn’t qualify as
"statistically irrelevant" at all, IMO. An event that has a one in a
quintillion chance of happen within the lifetime of this universe, on
the other hand, probably really isn’t anything to worry about. Some
events really are so unlikely as to be irrelevant, but you need to do
the actual math to determine if your particular event is one them.On 1/16/2011 16:35, Brian Barrett wrote:

I can’t find the source at the moment, but I remember an interesting
article which had as its central thesis “when you’re running a server
that is up 24/7, 365 days a year, statistically irrelevant situations
happen on average once a day”.


Rainer Deyke - rainerd at eldwood.com

You misunderstand what I meant (or I miscommunicated). Humans look at
probability and say "this happens less than once is a "
and perceive this to mean “it never happens”. The point here is that
things that a well loaded, always running process will do so much work
that these things that almost never happen can actually occur
regularly.

It is the classic case of it you are statistically unlikely to ever
win the lottery, even if you play at every opportunity. And yet -
there are lottery winners every couple of weeks. People forget this
counter intuitive situation and would write off a bug that will bite
them on the basis that it is “extremely unlikely”, when you’re running
your code often enough “practically impossible” can become a yearly,
quarterly, monthly, weekly or daily problem.

Maybe “statistically irrelevant” was too strong a wording, I would
take that point. As I said, I wish I had the original quote, making up
your own ones aren’t always as catchy and correct =]

I’m fighting against a bug like that right now, I can’t for the life
of me figure out how it is happening, and it is happening so
infrequently that I’m unlikely to get a lovely reproducible example.
The problem in this case is that perhaps this code isn’t running
often enough to make it easier to catch.

Sigh

Hopefully that makes a little more sense.

– BrianOn 17 January 2011 12:07, Rainer Deyke wrote:

An event that happens once a day on a single server doesn’t qualify as
"statistically irrelevant" at all, IMO. ?An event that has a one in a
quintillion chance of happen within the lifetime of this universe, on
the other hand, probably really isn’t anything to worry about. ?Some
events really are so unlikely as to be irrelevant, but you need to do
the actual math to determine if your particular event is one them.

It always deserves looking at.

And if I could ever see an SDL-based program being run for a week straight or as part of a high-security system, I’d agree with you wholeheartedly.
In the case of any system code, hardware drivers, boost, STL, and other similar libraries that WILL be used by such programs, I do agree with you wholeheartedly.
And in any case, a bug is certainly worth investigating into. But just like there’s premature optimization (wasting development time on optimizing code that isn’t really a problem with its current performance), there’s premature debugging (spending time solving a seldom occuring and non-fatal bug that isn’t really a problem).------------------------
EM3 Nathaniel Fries, U.S. Navy

http://natefries.net/

With Visual Studio C++ compiler you can use #include <intrin.h> to use
InterlockedXYZ API. No need to include <windows.h>. And this
<intrin.h> header doesn’t include <windows.h>!

http://msdn.microsoft.com/en-us/library/51s265a6.aspx--
M?rti?? Mo?eiko

Hi,

Sam Lantinga wrote:

/* Increment an atomic variable used as a reference count */
void SDL_AtomicIncRef(SDL_atomic_t *a);

/* Decrement an atomic variable used as a reference count,
returning SDL_TRUE if the variable has reached zero after decrementing
*/
SDL_bool SDL_AtomicDecRef(SDL_atomic_t *a);

/* Set an atomic variable to a new value if it is currently an old value,
returning SDL_TRUE if the value was set.

If you don’t know what this function is for, you shouldn’t use it!
*/
SDL_bool SDL_AtomicCAS(SDL_atomic_t *a, int oldval, int newval);

Apologies for the somewhat off-topic nature of my post, but I
was wondering by which API or mechanism these functions would
be implemented on Linux?

My Google-fu must be weak, as I’ve been searching for atomic
compare & swap on Linux, and haven’t found much, yet.

Thanks,

Bill

It uses the gcc intrinsics. Look for HAVE_GCC_ATOMICS in
http://hg.libsdl.org/SDL/file/tip/include/SDL_atomic.h

See ya!On Mon, Jan 17, 2011 at 12:03 PM, Bill Kelly wrote:

Hi,

Sam Lantinga wrote:

/* Increment an atomic variable used as a reference count */
void SDL_AtomicIncRef(SDL_atomic_t *a);

/* Decrement an atomic variable used as a reference count,
returning SDL_TRUE if the variable has reached zero after decrementing
*/
SDL_bool SDL_AtomicDecRef(SDL_atomic_t *a);

/* Set an atomic variable to a new value if it is currently an old value,
returning SDL_TRUE if the value was set.

If you don’t know what this function is for, you shouldn’t use it!
*/
SDL_bool SDL_AtomicCAS(SDL_atomic_t *a, int oldval, int newval);

Apologies for the somewhat off-topic nature of my post, but I
was wondering by which API or mechanism these functions would
be implemented on Linux?

My Google-fu must be weak, as I’ve been searching for atomic
compare & swap on Linux, and haven’t found much, yet.

Thanks,

Bill


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


-Sam Lantinga, Founder and President, Galaxy Gameworks LLC

Sam Lantinga wrote:

It uses the gcc intrinsics. Look for HAVE_GCC_ATOMICS in
http://hg.libsdl.org/SDL/file/tip/include/SDL_atomic.h

Thanks!

Bill

You’re welcome! :)On Mon, Jan 17, 2011 at 12:54 PM, Bill Kelly wrote:

Sam Lantinga wrote:

It uses the gcc intrinsics. Look for HAVE_GCC_ATOMICS in
http://hg.libsdl.org/SDL/file/tip/include/SDL_atomic.h

Thanks!

Bill


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


-Sam Lantinga, Founder and President, Galaxy Gameworks LLC

I’ve done a lot of research, looked over Bob’s excellent analysis, and

Hey, thanks!

looked at some practical examples of lockless programming (scary stuff,
don’t do it!), and I think I’ve come up with an API that more closely
matches the use cases.

I’d love to get feedback on this proposal…

Well I’ve been kind of watching to see where this would go but I think
I need make a few comments.

(All these functions are full memory barriers)

/* A struct so people don’t accidentally use normal operations on it */
typedef struct { int value; } SDL_atomic_t;

Since you are going to use this for a reference count value I believe
that the size of the counter should have the same number of bits as
the size of the address. That means that SDL_atomic_t should be 64
bits on 64 bit architectures. Right now it is possible to build a
machine with enough memory that you can have more than 2^32 references
to an item in the heap. Unlikely… sure. But, the choice is to have
as many bits in the reference count as there are in an address or to
test for reference count overflow and return an error flag or throw an
exception when the reference count overflows.

Because int isn’t necessarily 64 bits long on 64 bit machines the
return types of your functions can’t be int. Also, since SDL_atomic_t
is used for reference counts it would make me happier if it were
unsigned. It doesn’t have to be unsigned because I suspect the
smallest item that can be allocated on the heap is more than 2 bytes
long so it is ok to lose half the reference count range to the
negative numbers. It might really have to be unsigned on word
addressable machines with 2^64 words of memory. :slight_smile:

Ok, this may seem absurd but it looks like we will hit the point where
2^64 transistors will fit on a “chip” (more likely a cube by then) in
the next 40 years or so (could be 70 years, could be 20). Before you
start laughing about that let me mention the time my boss scolded me
for “wasting” 2 bytes in the year field of a some code I was
writing… that was 33 years ago and… ya’know that seemed reasonable
on a machine with 24Kb of RAM.

Hmmm… Ya’know what? I think there should be an
SDL_AtomicReferenceCount_t, an SDL_AtomicInt_t, and an SDL_AtomicPtr_t
with operators that are reasonable for all of them. Doing that gets
rid of all the size differences. A ptr has sizeof(void *), and int has
sizeof(int), and a reference count has a size that makes sense for the
architecture.

Nice to see this problem being revisited.

Bob PendletonOn Fri, Jan 14, 2011 at 5:10 PM, Sam Lantinga wrote:

/* Set an atomic variable to a value, returning the previous value */
int SDL_AtomicSet(SDL_atomic_t *a, int value);

/* Get the value of an atomic variable */
int SDL_AtomicGet(SDL_atomic_t *a);

/* Add to an atomic variable, returning the previous value */
int SDL_AtomicAdd(SDL_atomic_t *a, int value);

/* Increment an atomic variable used as a reference count */
void SDL_AtomicIncRef(SDL_atomic_t *a);

/* Decrement an atomic variable used as a reference count,
?? returning SDL_TRUE if the variable has reached zero after decrementing
?*/
SDL_bool SDL_AtomicDecRef(SDL_atomic_t *a);

/* Set an atomic variable to a new value if it is currently an old value,
?? returning SDL_TRUE if the value was set.

?? If you don’t know what this function is for, you shouldn’t use it!
*/
SDL_bool SDL_AtomicCAS(SDL_atomic_t *a, int oldval, int newval);

/* Set a pointer to a new value if it is currently an old value,
?? returning SDL_TRUE if the value was set.

?? If you don’t know what this function is for, you shouldn’t use it!
*/
SDL_bool SDL_AtomicCASPtr(void **a, void *oldval, void *newval);


? ? -Sam Lantinga, Founder and President, Galaxy Gameworks LLC


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


±----------------------------------------------------------

No, I think you misunderstand my point. I’m saying that it is in fact
safe to treat "this happens less than once is a " as
equivalent to “it never happens” if the big number is big enough. The
problem isn’t with the line of reasoning, the problem is that people
think that a number as small as a billion qualifies as a “big number”.On 1/17/2011 05:39, Brian Barrett wrote:

You misunderstand what I meant (or I miscommunicated). Humans look at
probability and say "this happens less than once is a "
and perceive this to mean “it never happens”. The point here is that
things that a well loaded, always running process will do so much work
that these things that almost never happen can actually occur
regularly.


Rainer Deyke - rainerd at eldwood.com

Depends on the situation. If it’s dollars, a billion is a big number. If it’s
the number of cells in your body that died last week, it’s a small number.

If you have an interrupt that occurs every 10 ms, and the chance of a race
condition causing the interrupt to fail is 1 in a billion, then it will fail
on average every 2.77 hours. On Average. It could fail in the first three
minutes.

So let’s say a “big” number is a gazillion, and the thing fails on average
only every 47 years, and you don’t expect your software to still be in use 47
years from now. But it could fail next week, bringing your finely crafted
financial system down, resulting in the loss of a “small” amount of your
clients’ financial data, say about a billion dollars worth. (Visions of angry
investors marching up the hill carrying torches and sharply pointed stakes.)

How about we just write the stuff so it can’t fail, barring hardware
failures over which we lowly software guys have no (or minimal) control?

JeffOn Monday 17 January 2011 20:01, Rainer Deyke wrote:

No, I think you misunderstand my point. I’m saying that it is in fact
safe to treat "this happens less than once is a " as
equivalent to “it never happens” if the big number is big enough. The
problem isn’t with the line of reasoning, the problem is that people
think that a number as small as a billion qualifies as a “big number”.

So let’s say a “big” number is a gazillion, and the thing fails on average
only every 47 years, and you don’t expect your software to still be in use 47
years from now.

That’s still not big enough for my taste. If the program runs for 5
years, you have a one in ten chance that the software fails. Those are
pretty bad odds, actually.

But it could fail next week, bringing your finely crafted
financial system down, resulting in the loss of a “small” amount of your
clients’ financial data, say about a billion dollars worth. (Visions of angry
investors marching up the hill carrying torches and sharply pointed stakes.)

Yeah, that could happen, even if the chance of failure was really small.
But you can never totally eliminate the possibility of failure.
There’s a small but non-zero chance that the computer will be struck by
a meteor.

How about we just write the stuff so it can’t fail, barring hardware
failures over which we lowly software guys have no (or minimal) control?

That’s generally the preferred approach, but it’s not always practicable
or even possible. For example, GUIDs can only be guaranteed to be truly
unique when issued from a central server, but the probability of the
central server failing is much greater than the probability of an
accidental collision. Another example: every password and encryption
system in the world can be broken almost instantly if the attacker just
happens to guess the correct password or encryption key.On 1/17/2011 21:53, Jeff Post wrote:


Rainer Deyke - rainerd at eldwood.com

But you can never totally eliminate the possibility of failure.

Donald Knuth would disagree. Maybe he’s right, maybe not, but there’s no doubt
that it’s a goal to aim for.

There’s a small but non-zero chance that the computer will be struck by
a meteor.

That’s a hardware failure that we can’t control :wink:

Another example: every password and encryption
system in the world can be broken almost instantly if the attacker just
happens to guess the correct password or encryption key.

OT - The only lock that can’t be picked is the one that can’t be opened at
all.

JeffOn Monday 17 January 2011 21:46, Rainer Deyke wrote:

But you can never totally eliminate the possibility of failure.

Donald Knuth would disagree. Maybe he’s right, maybe not, but there’s no doubt
that it’s a goal to aim for.

It is possible to write software that produces correct results for every
input when run on an ideal computer. That’s not the same thing as
eliminating the possibility of failure when running on real computers.

There’s a small but non-zero chance that the computer will be struck by
a meteor.

That’s a hardware failure that we can’t control :wink:

It is possible to reduce the danger of hardware failure by using
redundant hardware, ideally in different data centers spread across the
world. It’s expensive and it can’t eliminate the possibility of failure
altogether, so you have to decide how much safety you really need and
how much money you are willing to spend on it.On 1/17/2011 23:21, Jeff Post wrote:

On Monday 17 January 2011 21:46, Rainer Deyke wrote:


Rainer Deyke - rainerd at eldwood.com

This is true. I think this is what I was trying to say, I was just
emphasising it from the other side. People are notoriously bad at
dealing with big numbers (or small ones) - myself included I suspect.On 18 January 2011 04:01, Rainer Deyke wrote:

No, I think you misunderstand my point. ?I’m saying that it is in fact
safe to treat "this happens less than once is a " as
equivalent to “it never happens” if the big number is big enough. ?The
problem isn’t with the line of reasoning, the problem is that people
think that a number as small as a billion qualifies as a “big number”.