Discussion:
[cfe-commits] patch: libcxxabi cxa_*_virtual and cxa_guard_* methods
Nick Lewycky
2011-05-20 06:11:35 UTC
Permalink
I've started dipping my toes in the libcxxabi project by implementing a few
simple methods. Patch attached, please review!

This isn't heavily tested, I'm mostly just trying to make sure I have the
style down. Please let me know if we should be using different file names
(hard to follow a pattern where there isn't one), of if the private methods
in cxa_guard actually belong in an anonymous namespace, etc.

Nick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.cs.uiuc.edu/pipermail/cfe-commits/attachments/20110519/7eb409b0/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: libcxxabi-virtual-and-guard.patch
Type: text/x-patch
Size: 4722 bytes
Desc: not available
Url : http://lists.cs.uiuc.edu/pipermail/cfe-commits/attachments/20110519/7eb409b0/attachment-0001.bin
John McCall
2011-05-22 00:05:27 UTC
Permalink
I've started dipping my toes in the libcxxabi project by implementing a few simple methods. Patch attached, please review!
This isn't heavily tested, I'm mostly just trying to make sure I have the style down. Please let me know if we should be using different file names (hard to follow a pattern where there isn't one), of if the private methods in cxa_guard actually belong in an anonymous namespace, etc.
Guard variables need to support recursive initialization: the initialization
of one variable must be able to kick off arbitrary code that might perform
its own initialization. In C++11, there's a guarantee that this won't deadlock
even if the first initialization blocks on a different thread.

Probably the most reasonable way to handle this is to have a global lock
which is *not* held during initialization, and a global condition variable
that threads wait on if they're waiting for somebody else to perform an
initialization.

John.
Nick Lewycky
2011-05-22 07:20:11 UTC
Permalink
Post by Nick Lewycky
I've started dipping my toes in the libcxxabi project by implementing a
few simple methods. Patch attached, please review!
Post by Nick Lewycky
This isn't heavily tested, I'm mostly just trying to make sure I have the
style down. Please let me know if we should be using different file names
(hard to follow a pattern where there isn't one), of if the private methods
in cxa_guard actually belong in an anonymous namespace, etc.
Guard variables need to support recursive initialization: the
initialization
of one variable must be able to kick off arbitrary code that might perform
its own initialization.
Okay, I can handle that. I wrote a test for it:

namespace test3 {
int helper1(int i) {
if (i == 0)
return 0;
static int j = 0 helper1(i - 1);
return j + 1;
}
void test() {
static int x = helper1(10);
}
}

and discovered that libstdc++ doesn't support it. They detect the recursive
initialization and throw an exception.

In C++11, there's a guarantee that this won't deadlock
even if the first initialization blocks on a different thread.
Hold up, what does this actually mean? Given:

int test_a(int x) {
if (x == 0)
return 0;
static int A = test_b(x - 1);
return A + 1;
}
int test_b(int x) {
if (x == 0)
return 0;
static int B = test_a(x - 1);
return B + 1;
}

suppose that thread 1 enters test_a(10) and grabs the lock on A at the same
time that thread 2 enters test_b(10) and grabs the lock on B. The deadlock
is avoided how?

Probably the most reasonable way to handle this is to have a global lock
which is *not* held during initialization, and a global condition variable
that threads wait on if they're waiting for somebody else to perform an
initialization.
It's not clear to me what you're proposing, but I strongly want to avoid
anything that blocks the other threads from running their own initializers
in parallel for as long as they're independent. I think I'll understand your
proposal once I understand the constraints you mentioned above.

Nick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.cs.uiuc.edu/pipermail/cfe-commits/attachments/20110522/a238c2ec/attachment.html
John McCall
2011-05-22 18:33:08 UTC
Permalink
Post by Nick Lewycky
Post by John McCall
Post by Nick Lewycky
This isn't heavily tested, I'm mostly just trying to make sure I
have the style down. Please let me know if we should be using
different file names (hard to follow a pattern where there isn't one),
of if the private methods in cxa_guard actually belong in an
anonymous namespace, etc.
Guard variables need to support recursive initialization: the
initialization of one variable must be able to kick off arbitrary code
that might perform its own initialization.
namespace test3 {
int helper1(int i) {
if (i == 0)
return 0;
static int j = 0 helper1(i - 1);
return j + 1;
}
void test() {
static int x = helper1(10);
}
}
and discovered that libstdc++ doesn't support it. They detect the
recursive initialization and throw an exception.
This isn't what I meant; actual recursive dependencies like this
are undefined behavior even in C++03. I meant that the initialization
of one variable must be able to kick off arbitrary code that might
perform a *different* initialization. Your patch doesn't support this
because your home-grown mutex is non-recursive, and therefore
this code will deadlock:

int foo() { return 5; }
int bar() { static int x = foo(); }
int baz() { static int x = bar(); }

As an enhancement request for later, it would be a substantial
debugging aid to continue to detect this recursive initialization
in the single-threaded case.
Post by Nick Lewycky
Post by John McCall
In C++11, there's a guarantee that this won't deadlock
even if the first initialization blocks on a different thread.
int test_a(int x) {
if (x == 0)
return 0;
static int A = test_b(x - 1);
return A + 1;
}
int test_b(int x) {
if (x == 0)
return 0;
static int B = test_a(x - 1);
return B + 1;
}
suppose that thread 1 enters test_a(10) and grabs the lock on A at
the same time that thread 2 enters test_b(10) and grabs the lock
on B. The deadlock is avoided how?
Sorry, I was unclear. This deadlock is clearly not completely avoidable
except by serializing all initializations, which I think isn't the intent of
the standard. What I meant is that initializations should be able to
block on a separate thread and make progress as long as there's no
cycle in the initializers.

The new wording in C++11 is somewhat helpful:

[stmt.dcl]p3:
If control enters the declaration concurrently while the variable
is being initialized, the concurrent execution shall wait for completion
of the initialization.[footnote: The implementation must not introduce
any deadlock around execution of the initializer.] If control re-enters
the declaration recursively while the variable is being initialized, the
behavior is undefined.

The footnote could be clearer, but I think it's saying that no deadlocks
should be introduced *other than those possible due to the mandated
waiting between threads*. So holding a global recursive lock during
initialization, as libstdcxx does, introduces deadlocks that aren't
inherent in the code.

It might also be useful to consider the original paper; kudos to Howard
for the research here:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2660.htm
(Note that the example implementation in this paper is unnecessarily
complicated by the use of epochs; three sentinel values (uninitialized,
currently initializing, completely initialized) would be sufficient, unless
I am missing some unexplained subtlety.)
Post by Nick Lewycky
It's not clear to me what you're proposing, but I strongly want to avoid
anything that blocks the other threads from running their own
initializers in parallel for as long as they're independent.
Yes, I agree.

John.
Howard Hinnant
2011-05-22 19:23:03 UTC
Permalink
I've started dipping my toes in the libcxxabi project by implementing a few simple methods. Patch attached, please review!
This isn't heavily tested, I'm mostly just trying to make sure I have the style down. Please let me know if we should be using different file names (hard to follow a pattern where there isn't one), of if the private methods in cxa_guard actually belong in an anonymous namespace, etc.
Nick
<libcxxabi-virtual-and-guard.patch>
I disagree with John that your implementation doesn't support the recursive calls for *different* static objects. I believe it does. And I like that your design allows for parallel initialization of different static objects.

What I'm nervous about though is the use of the spin lock. An arbitrary amount of time can go by while one thread is initializing and the other thread is spinning, waiting for that initialization to complete.

I translated your code into using pthread_mutex_t and pthread_cond_t, and came up with the following. I tried to keep your notation so as to make the two implementations more easily comparable:

#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

namespace __cxxabiv1
{

namespace __libcxxabi
{

void abort(const char* s) {printf("%s\n", s); ::abort();}

} // __libcxxabi

static pthread_mutex_t guard_mut = PTHREAD_MUTEX_INITIALIZER;
static pthread_cond_t guard_cv = PTHREAD_COND_INITIALIZER;

extern "C"
{

int __cxa_guard_acquire(int64_t* guard_object)
{
uint8_t* initialized = (uint8_t*)guard_object;
uint8_t* lock = (uint8_t*)guard_object + 1;
int r = pthread_mutex_lock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_acquire failed to acquire mutex");
if (*initialized == 1)
{
r = pthread_mutex_unlock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_acquire failed to release mutex");
return 0;
}
while (*lock == 1)
pthread_cond_wait(&guard_cv, &guard_mut);
int result = *initialized;
if (result == 0)
*lock = 1;
r = pthread_mutex_unlock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_acquire failed to release mutex");
return result == 0;
}

void __cxa_guard_release(int64_t* guard_object)
{
uint8_t* initialized = (uint8_t*)guard_object;
uint8_t* lock = (uint8_t*)guard_object + 1;
int r = pthread_mutex_lock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_release failed to acquire mutex");
*initialized = 1;
*lock = 0;
r = pthread_mutex_unlock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_release failed to release mutex");
r = pthread_cond_broadcast(&guard_cv);
if (r != 0)
__libcxxabi::abort("__cxa_guard_release failed to broadcast condition variable");
}

void __cxa_guard_abort(int64_t* guard_object)
{
uint8_t* lock = (uint8_t*)guard_object + 1;
int r = pthread_mutex_lock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_abort failed to acquire mutex");
*lock = 0;
r = pthread_mutex_unlock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_abort failed to release mutex");
r = pthread_cond_broadcast(&guard_cv);
if (r != 0)
__libcxxabi::abort("__cxa_guard_abort failed to broadcast condition variable");
}

} // extern "C"

} // __cxxabiv1

The structure of our implementations is nearly identical.

One of the things I don't like about my implementation (or yours) is that it does not detect recursive entry of the same static object. I'm thinking of something that makes use of thread_local data get that part, but I'm not there yet. But I thought I'd share what I have so far.

Thanks for helping us with this library!

Howard
Nick Lewycky
2011-05-22 20:45:04 UTC
Permalink
Post by Nick Lewycky
I've started dipping my toes in the libcxxabi project by implementing a
few simple methods. Patch attached, please review!
Post by Nick Lewycky
This isn't heavily tested, I'm mostly just trying to make sure I have the
style down. Please let me know if we should be using different file names
(hard to follow a pattern where there isn't one), of if the private methods
in cxa_guard actually belong in an anonymous namespace, etc.
Post by Nick Lewycky
Nick
<libcxxabi-virtual-and-guard.patch>
I disagree with John that your implementation doesn't support the recursive
calls for *different* static objects. I believe it does.
Thanks. I was just about to write up an email to the same effect. It does
support recursive calls for different static objects because the lock it
uses is the second byte of the guard variable, not a global (or per-thread)
lock.

It's a great test to add to test_guard.cpp though!

And I like that your design allows for parallel initialization of different
static objects.
Thanks! I'm very motivated to preserve this property.

What I'm nervous about though is the use of the spin lock. An arbitrary
amount of time can go by while one thread is initializing and the other
thread is spinning, waiting for that initialization to complete.
Sure, but what are the alternatives? sched_yield()?

I translated your code into using pthread_mutex_t and pthread_cond_t, and
came up with the following. I tried to keep your notation so as to make the
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
namespace __cxxabiv1
{
namespace __libcxxabi
{
void abort(const char* s) {printf("%s\n", s); ::abort();}
} // __libcxxabi
static pthread_mutex_t guard_mut = PTHREAD_MUTEX_INITIALIZER;
static pthread_cond_t guard_cv = PTHREAD_COND_INITIALIZER;
extern "C"
{
int __cxa_guard_acquire(int64_t* guard_object)
{
uint8_t* initialized = (uint8_t*)guard_object;
uint8_t* lock = (uint8_t*)guard_object + 1;
int r = pthread_mutex_lock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_acquire failed to acquire mutex");
if (*initialized == 1)
{
r = pthread_mutex_unlock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_acquire failed to release mutex");
return 0;
}
while (*lock == 1)
pthread_cond_wait(&guard_cv, &guard_mut);
int result = *initialized;
if (result == 0)
*lock = 1;
r = pthread_mutex_unlock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_acquire failed to release mutex");
return result == 0;
}
void __cxa_guard_release(int64_t* guard_object)
{
uint8_t* initialized = (uint8_t*)guard_object;
uint8_t* lock = (uint8_t*)guard_object + 1;
int r = pthread_mutex_lock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_release failed to acquire mutex");
*initialized = 1;
*lock = 0;
r = pthread_mutex_unlock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_release failed to release mutex");
r = pthread_cond_broadcast(&guard_cv);
if (r != 0)
__libcxxabi::abort("__cxa_guard_release failed to broadcast condition variable");
}
void __cxa_guard_abort(int64_t* guard_object)
{
uint8_t* lock = (uint8_t*)guard_object + 1;
int r = pthread_mutex_lock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_abort failed to acquire mutex");
*lock = 0;
r = pthread_mutex_unlock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_abort failed to release mutex");
r = pthread_cond_broadcast(&guard_cv);
if (r != 0)
__libcxxabi::abort("__cxa_guard_abort failed to broadcast condition variable");
}
} // extern "C"
} // __cxxabiv1
The structure of our implementations is nearly identical.
If I understand correctly, yours does detect recursive initializers.
pthread_mutex_lock will return EDEADLK, and your code will abort.

I'll need to spend more time reviewing your implementation before I make
further comments about it. I'm not sufficiently familiar with the pthreads
API.

One of the things I don't like about my implementation (or yours) is that it
does not detect recursive entry of the same static object. I'm thinking of
something that makes use of thread_local data get that part, but I'm not
there yet. But I thought I'd share what I have so far.
Back when I thought that we needed to support self-recursive initialization,
I came up with a way to do this. The fundamental problem with pthread_self()
is that it returns an 8-byte opaque object (ie., they aren't guaranteed to
be dense), and C++11's thread library is worse. (Linux supports "gettid()"
which returns a 4-byte ID which would fit neatly in the guard variable, but
let's leave that aside for the moment.) We can construct our own thread ID
by creating a thread_local variable, and we can make it use only 7 bytes by
giving it 256 byte alignment. The recursion depth counter would then be kept
on the initialization byte (supporting 254 levels of recursion depth with
values 0 and 1 reserved for an unlocked guard variable).

The nasty part is that we don't have a 7-byte CAS operation, but I'm pretty
sure we don't need it. On entry, do CAS of all-zeros to (&thread_id | 0x2)
and if you get it, you're the first to lock the variable. Otherwise, you can
do an unordered memory read to check the thread id. If it matches, you're
the thread which already holds the lock so you can conclude that your
previous memory read was safe since no other thread is touching that memory.
You can then increment the recursion count (no atomicity needed, for the
same reason). If the thread_id doesn't match, spin until you see an
initialized value in the guard variable, or successfully lock it (implying
that the previous attempt ended in __cxa_guard_abort).

Thanks for helping us with this library!
Thanks for releasing it!

Nick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.cs.uiuc.edu/pipermail/cfe-commits/attachments/20110522/5a38850a/attachment-0001.html
John McCall
2011-05-22 20:52:14 UTC
Permalink
Post by Howard Hinnant
I've started dipping my toes in the libcxxabi project by implementing a few simple methods. Patch attached, please review!
This isn't heavily tested, I'm mostly just trying to make sure I have the style down. Please let me know if we should be using different file names (hard to follow a pattern where there isn't one), of if the private methods in cxa_guard actually belong in an anonymous namespace, etc.
Nick
<libcxxabi-virtual-and-guard.patch>
I disagree with John that your implementation doesn't support the recursive calls for *different* static objects. I believe it does.
Thanks. I was just about to write up an email to the same effect. It does support recursive calls for different static objects because the lock it uses is the second byte of the guard variable, not a global (or per-thread) lock.
Yeah, sorry about that.
Post by Howard Hinnant
Back when I thought that we needed to support self-recursive initialization, I came up with a way to do this. The fundamental problem with pthread_self() is that it returns an 8-byte opaque object (ie., they aren't guaranteed to be dense), and C++11's thread library is worse. (Linux supports "gettid()" which returns a 4-byte ID which would fit neatly in the guard variable, but let's leave that aside for the moment.) We can construct our own thread ID by creating a thread_local variable, and we can make it use only 7 bytes by giving it 256 byte alignment. The recursion depth counter would then be kept on the initialization byte (supporting 254 levels of recursion depth with values 0 and 1 reserved for an unlocked guard variable).
This won't work; the initialization byte has to stay zero until initialization is complete.

John.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.cs.uiuc.edu/pipermail/cfe-commits/attachments/20110522/978f50bf/attachment.html
Nick Lewycky
2011-05-22 20:55:39 UTC
Permalink
Post by Nick Lewycky
Post by Nick Lewycky
I've started dipping my toes in the libcxxabi project by implementing a
few simple methods. Patch attached, please review!
Post by Nick Lewycky
This isn't heavily tested, I'm mostly just trying to make sure I have
the style down. Please let me know if we should be using different file
names (hard to follow a pattern where there isn't one), of if the private
methods in cxa_guard actually belong in an anonymous namespace, etc.
Post by Nick Lewycky
Nick
<libcxxabi-virtual-and-guard.patch>
I disagree with John that your implementation doesn't support the
recursive calls for *different* static objects. I believe it does.
Thanks. I was just about to write up an email to the same effect. It does
support recursive calls for different static objects because the lock it
uses is the second byte of the guard variable, not a global (or per-thread)
lock.
Yeah, sorry about that.
Back when I thought that we needed to support self-recursive
initialization, I came up with a way to do this. The fundamental problem
with pthread_self() is that it returns an 8-byte opaque object (ie., they
aren't guaranteed to be dense), and C++11's thread library is worse. (Linux
supports "gettid()" which returns a 4-byte ID which would fit neatly in the
guard variable, but let's leave that aside for the moment.) We can construct
our own thread ID by creating a thread_local variable, and we can make it
use only 7 bytes by giving it 256 byte alignment. The recursion depth
counter would then be kept on the initialization byte (supporting 254 levels
of recursion depth with values 0 and 1 reserved for an unlocked guard
variable).
This won't work; the initialization byte has to stay zero until
initialization is complete.
My mistake, of course the ABI document states that the first byte must not
be modified by __cxa_guard_acquire.

(By the way, unless I missed it, the ABI document never says that the guard
variable is initialized to zero. It seems to me that it has to be though...)

NIck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.cs.uiuc.edu/pipermail/cfe-commits/attachments/20110522/15371235/attachment.html
John McCall
2011-05-22 21:00:59 UTC
Permalink
Post by John McCall
Post by Howard Hinnant
Back when I thought that we needed to support self-recursive initialization, I came up with a way to do this. The fundamental problem with pthread_self() is that it returns an 8-byte opaque object (ie., they aren't guaranteed to be dense), and C++11's thread library is worse. (Linux supports "gettid()" which returns a 4-byte ID which would fit neatly in the guard variable, but let's leave that aside for the moment.) We can construct our own thread ID by creating a thread_local variable, and we can make it use only 7 bytes by giving it 256 byte alignment. The recursion depth counter would then be kept on the initialization byte (supporting 254 levels of recursion depth with values 0 and 1 reserved for an unlocked guard variable).
This won't work; the initialization byte has to stay zero until initialization is complete.
My mistake, of course the ABI document states that the first byte must not be modified by __cxa_guard_acquire.
(By the way, unless I missed it, the ABI document never says that the guard variable is initialized to zero. It seems to me that it has to be though...)
I think you can read in a requirement that the first byte be statically initialized to zero, but yeah, I agree that the compiler really must zero-initialize the whole thing.

John.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.cs.uiuc.edu/pipermail/cfe-commits/attachments/20110522/c71975bf/attachment-0001.html
Howard Hinnant
2011-05-22 23:28:17 UTC
Permalink
Question for clang developers:

Will these namespace-scope definitions ever generate a call to __cxa_guard_acquire? I know this is forward looking concerning thread_local:

#include <stddef.h>

namespace
{

size_t next_id;
thread_local const size_t id = __sync_fetch_and_add(&next_id, 1) + 1;

}

My motivation is to generate a 56-bit non-zero thread id for use in the __cxa_guard_acquire implementation. Thus if these call __cxa_guard_acquire, then I can't use them.

Thanks,
Howard
John McCall
2011-05-22 23:42:06 UTC
Permalink
The only variable definitions that require _cxa_guard_acquire are static locals and static data members of templated classes.

John.
Howard Hinnant
2011-05-23 02:10:56 UTC
Permalink
Post by John McCall
The only variable definitions that require _cxa_guard_acquire are static locals and static data members of templated classes.
Ok, thanks John.

I've got a prototype here. It is kind of messy. But it is basically derived from Nick's code which I think is a good fundamental design. Though I've rejected spin locks. It would work best with thread_local data. But for obvious reasons I haven't tested that branch and am emulating thread_local with pthread_key_t. This implementation is also sensitive to endian because the Itanium spec is in terms of "first byte" instead of "high byte".

In a nutshell, I develop a thread id that can be stored in size_t. An assumption is that the number of threads in the lifetime of an application will be less than 2^32 on a 32 bit architecture, and less than 2^56 on a 64 bit architecture. This thread id is always non-zero, and is nothing more than a sequential count of threads. Note that it is not a count of currently active threads, but a count of the number of times a thread is created. pthread_self() is specifically not used for this purpose because it often takes up more storage space than the 56 bits we have available.

The thread id is stored in the guard_variable while that thread is initializing the guarded variable, and otherwise 0 is stored in those 56 bits of the guard variable. This is used both to state that the variable is in the process of initialization (Nick's lock to block other threads), and to detect recursive initialization, in which case we abort instead of hang.

An assumption (and clang developers will have to tell me if it is correct or not) is that the compiler will do a proper double-checked locking dance prior to calling __cxa_guard_acquire. I.e. __cxa_guard_acquire immediately locks a mutex. If this assumption is wrong, then __cxa_guard_acquire could do the double checked locking. But the correct way to do that is platform dependent and can't be done portably without C++'s <atomic> header which is in turn platform dependent in its implementation. So I don't see a big win in lowering the double checked locking into __cxa_guard_acquire.

Oh, another major assumption is that pthreads is there. If that assumption is wrong (Windows?) then it might be best to just put a giant #if around the whole thing. And if we're targeting a non-multithreaded platform, that could be a very easily implementable 3rd branch in the outer #if.

Lightly tested.

Comments, suggestions welcome.

Howard

#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

namespace __cxxabiv1
{

namespace __libcxxabi
{

void abort(const char* s) {printf("%s\n", s); ::abort();}

} // __libcxxabi

namespace
{

pthread_mutex_t guard_mut = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t guard_cv = PTHREAD_COND_INITIALIZER;

size_t next_id;

#if __LITTLE_ENDIAN__
# define ID_INCR 256
#else
# define ID_INCR 1
#endif

#if __has_feature(cxx_thread_local)

thread_local const size_t id = __sync_fetch_and_add(&next_id, ID_INCR) + ID_INCR;

#else // __has_feature(cxx_thread_local)

pthread_key_t id_key;

int create_id = pthread_key_create(&id_key, 0);

#endif // __has_feature(cxx_thread_local)

#if __LITTLE_ENDIAN__

inline
int64_t
get_lock(int64_t x)
{
return x & 0xFFFFFFFFFFFFFF00;
}

inline
void
set_lock(int64_t& x, int64_t y)
{
x &= 0x00000000000000FF;
x |= (y & 0xFFFFFFFFFFFFFF00);
}

#else // __LITTLE_ENDIAN__

inline
int64_t
get_lock(int64_t x)
{
return x & 0x00FFFFFFFFFFFFFF;
}

inline
void
set_lock(int64_t& x, int64_t y)
{
x &= 0x00FFFFFFFFFFFFFF;
x |= (y & 0x00FFFFFFFFFFFFFFFF);
}

#endif // __LITTLE_ENDIAN__

} // unnamed namespace

extern "C"
{

int __cxa_guard_acquire(int64_t* guard_object)
{
int8_t* initialized = (int8_t*)guard_object;
if (pthread_mutex_lock(&guard_mut))
__libcxxabi::abort("__cxa_guard_acquire failed to acquire mutex");
#if !__has_feature(cxx_thread_local)
size_t id = (size_t)pthread_getspecific(id_key);
if (id == 0)
{
next_id += ID_INCR;
id = next_id;
pthread_setspecific(id_key, (const void*)id);
}
#endif
int64_t lock = get_lock(*guard_object);
if (lock)
{
// if this thread set the lock for this same guard_object, abort
if (lock == id)
__libcxxabi::abort("__cxa_guard_acquire detected deadlock");
do
{
if (pthread_cond_wait(&guard_cv, &guard_mut))
__libcxxabi::abort("__cxa_guard_acquire condition variable wait failed");
lock = get_lock(*guard_object);
} while (lock);
}
int result = *initialized == 0;
if (result)
set_lock(*guard_object, id);
if (pthread_mutex_unlock(&guard_mut))
__libcxxabi::abort("__cxa_guard_acquire failed to release mutex");
return result;
}

void __cxa_guard_release(int64_t* guard_object)
{
uint8_t* initialized = (uint8_t*)guard_object;
if (pthread_mutex_lock(&guard_mut))
__libcxxabi::abort("__cxa_guard_release failed to acquire mutex");
*initialized = 1;
set_lock(*guard_object, 0);
if (pthread_mutex_unlock(&guard_mut))
__libcxxabi::abort("__cxa_guard_release failed to release mutex");
if (pthread_cond_broadcast(&guard_cv))
__libcxxabi::abort("__cxa_guard_release failed to broadcast condition variable");
}

void __cxa_guard_abort(int64_t* guard_object)
{
if (pthread_mutex_lock(&guard_mut))
__libcxxabi::abort("__cxa_guard_abort failed to acquire mutex");
set_lock(*guard_object, 0);
if (pthread_mutex_unlock(&guard_mut))
__libcxxabi::abort("__cxa_guard_abort failed to release mutex");
if (pthread_cond_broadcast(&guard_cv))
__libcxxabi::abort("__cxa_guard_abort failed to broadcast condition variable");
}

} // extern "C"

} // __cxxabiv1
John McCall
2011-05-23 07:22:38 UTC
Permalink
Post by Howard Hinnant
Post by John McCall
The only variable definitions that require _cxa_guard_acquire are static locals and static data members of templated classes.
Ok, thanks John.
I've got a prototype here. It is kind of messy. But it is basically derived from Nick's code which I think is a good fundamental design. Though I've rejected spin locks. It would work best with thread_local data. But for obvious reasons I haven't tested that branch and am emulating thread_local with pthread_key_t. This implementation is also sensitive to endian because the Itanium spec is in terms of "first byte" instead of "high byte".
In a nutshell, I develop a thread id that can be stored in size_t. An assumption is that the number of threads in the lifetime of an application will be less than 2^32 on a 32 bit architecture, and less than 2^56 on a 64 bit architecture. This thread id is always non-zero, and is nothing more than a sequential count of threads. Note that it is not a count of currently active threads, but a count of the number of times a thread is created. pthread_self() is specifically not used for this purpose because it often takes up more storage space than the 56 bits we have available.
A thread-local with a mandatory initializer seems like a lot of overhead for a QoI thing like aborting on a recursive initialization ? I dunno, maybe I'm over-thinking things. I do think that it would be good to allow platforms to opt out, or at least to provide a different implementation if the platform happens to support a range-restricted TID.

In particular, while Darwin's pthread_t is a pointer, Darwin does provide mach_thread_self(), which returns a mach_port_t, which is (ultimately) an unsigned int. This is the sort of thing that most OSes probably have an equivalent of. For example, on Linux (maybe only certain distributions?), you can use gettid(), which is a signed int.

Just to make your life more complicated, guard variables on ARM are only 32 bits, but they give the implementation 31 bits to play with: the compiler's fast-path check only considers whether the low *bit* is non-zero.

I don't think you need to worry about platforms that use the Itanium ABI but that don't have pthreads. :) Undoubtedly someone will correct me, though.
Post by Howard Hinnant
An assumption (and clang developers will have to tell me if it is correct or not) is that the compiler will do a proper double-checked locking dance prior to calling __cxa_guard_acquire.
I'm not sure what code you're imagining that the compiler might emit here; if the compiler emitted a full and proper double-checked locking dance, it would not need to call __cxa_guard_acquire. The compiler has the option of emitting an unprotected check that the first byte is non-zero, subject only to the restriction that accesses to the protected global cannot be re-ordered around that check. It's not required to do any such thing, though.

In short, for a multitude of reasons, __cxa_guard_acquire must be prepared for the possibility that the variable has already been initialized by the time it grabs the lock.

That said, it's probably best to assume that the compiler did a preliminary check and that it's not worth performing a second such check before trying to grab the lock.
Post by Howard Hinnant
Comments, suggestions welcome.
Pedantic point of the week: 'initialized' should be specifically a char*, not an int8_t*. int8_t may be 'signed char', which in C++ does not have the omnipotent aliasing power of 'char' or 'unsigned char'.

Also, I don't see a reason to clear the "lock" on cxa_guard_release, and on every other path, set_lock can safely assume that the first byte of the guard is zero.

But this looks like a great start!

John.
Howard Hinnant
2011-05-23 14:13:03 UTC
Permalink
Post by John McCall
Post by Howard Hinnant
Post by John McCall
The only variable definitions that require _cxa_guard_acquire are static locals and static data members of templated classes.
Ok, thanks John.
I've got a prototype here. It is kind of messy. But it is basically derived from Nick's code which I think is a good fundamental design. Though I've rejected spin locks. It would work best with thread_local data. But for obvious reasons I haven't tested that branch and am emulating thread_local with pthread_key_t. This implementation is also sensitive to endian because the Itanium spec is in terms of "first byte" instead of "high byte".
In a nutshell, I develop a thread id that can be stored in size_t. An assumption is that the number of threads in the lifetime of an application will be less than 2^32 on a 32 bit architecture, and less than 2^56 on a 64 bit architecture. This thread id is always non-zero, and is nothing more than a sequential count of threads. Note that it is not a count of currently active threads, but a count of the number of times a thread is created. pthread_self() is specifically not used for this purpose because it often takes up more storage space than the 56 bits we have available.
A thread-local with a mandatory initializer seems like a lot of overhead for a QoI thing like aborting on a recursive initialization ? I dunno, maybe I'm over-thinking things. I do think that it would be good to allow platforms to opt out, or at least to provide a different implementation if the platform happens to support a range-restricted TID.
In particular, while Darwin's pthread_t is a pointer, Darwin does provide mach_thread_self(), which returns a mach_port_t, which is (ultimately) an unsigned int.
I didn't know this, thanks! I will look into taking advantage of it.
Post by John McCall
This is the sort of thing that most OSes probably have an equivalent of. For example, on Linux (maybe only certain distributions?), you can use gettid(), which is a signed int.
Just to make your life more complicated, guard variables on ARM are only 32 bits, but they give the implementation 31 bits to play with: the compiler's fast-path check only considers whether the low *bit* is non-zero.
I don't think you need to worry about platforms that use the Itanium ABI but that don't have pthreads. :) Undoubtedly someone will correct me, though.
Post by Howard Hinnant
An assumption (and clang developers will have to tell me if it is correct or not) is that the compiler will do a proper double-checked locking dance prior to calling __cxa_guard_acquire.
I'm not sure what code you're imagining that the compiler might emit here; if the compiler emitted a full and proper double-checked locking dance, it would not need to call __cxa_guard_acquire. The compiler has the option of emitting an unprotected check that the first byte is non-zero, subject only to the restriction that accesses to the protected global cannot be re-ordered around that check. It's not required to do any such thing, though.
Yes, that was what I was thinking, but also on some platforms (almost certainly not intel) an atomic read may be necessary. I'm weak in this area though.
Post by John McCall
In short, for a multitude of reasons, __cxa_guard_acquire must be prepared for the possibility that the variable has already been initialized by the time it grabs the lock.
Oh, yes, absolutely. The danger I'm worried about is the compiler getting a false '1' out of the first byte (using a non-atomic read) before the initializing thread is quite done. I'm always confused as to how that might happen, but have been repeatedly warned by those more knowledgeable than me in this area, that it is possible on platforms with sufficiently weak memory ordering.

http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html
Post by John McCall
That said, it's probably best to assume that the compiler did a preliminary check and that it's not worth performing a second such check before trying to grab the lock.
Yes, that is the question I was asking, and the answer I was assuming, thanks.
Post by John McCall
Post by Howard Hinnant
Comments, suggestions welcome.
Pedantic point of the week: 'initialized' should be specifically a char*, not an int8_t*. int8_t may be 'signed char', which in C++ does not have the omnipotent aliasing power of 'char' or 'unsigned char'.
Ok, thanks.
Post by John McCall
Also, I don't see a reason to clear the "lock" on cxa_guard_release, and on every other path, set_lock can safely assume that the first byte of the guard is zero.
Ok, thanks.
Post by John McCall
But this looks like a great start!
John.
Howard
John McCall
2011-05-23 23:11:28 UTC
Permalink
Post by Howard Hinnant
Post by John McCall
Post by Howard Hinnant
An assumption (and clang developers will have to tell me if it is correct or not) is that the compiler will do a proper double-checked locking dance prior to calling __cxa_guard_acquire.
I'm not sure what code you're imagining that the compiler might emit here; if the compiler emitted a full and proper double-checked locking dance, it would not need to call __cxa_guard_acquire. The compiler has the option of emitting an unprotected check that the first byte is non-zero, subject only to the restriction that accesses to the protected global cannot be re-ordered around that check. It's not required to do any such thing, though.
Yes, that was what I was thinking, but also on some platforms (almost certainly not intel) an atomic read may be necessary. I'm weak in this area though.
Post by John McCall
In short, for a multitude of reasons, __cxa_guard_acquire must be prepared for the possibility that the variable has already been initialized by the time it grabs the lock.
Oh, yes, absolutely. The danger I'm worried about is the compiler getting a false '1' out of the first byte (using a non-atomic read) before the initializing thread is quite done. I'm always confused as to how that might happen, but have been repeatedly warned by those more knowledgeable than me in this area, that it is possible on platforms with sufficiently weak memory ordering.
Yes. Specifically, a processor has to be able to see writes made by a different processor out of order (writes to non-conflicting objects, obviously). I'm going to break this down because it turns out to matter.

Imagine that Alice and Bob are racing to initialize a variable, and Bob happens to get there first.

Just in terms of writes, Bob is doing this:
store x to _variable
store 1 to _guard
Alice is doing this:
tmp = load _guard
goto skip if tmp != 0
try to acquire guard, initialize, etc.
y = load _variable
and we want to guarantee that x == y. The way we prove this is to set things up such that if [store _guard < load _guard] (because the load read the value written by the store) then the processor's memory model guarantees that [store _variable < load _variable].

The reason that x86 doesn't require anything special here is that x86 promises not to "reorder" stores after stores or loads after loads; that is, it guarantees that sequenced stores will become visible to other processors in the same order, and it guarantees that sequenced loads will see stores in an order consistent with the order they became visible. So we automatically get that [store _variable < store _guard] and [load _guard < load _variable], and we don't need any fences.

However, the implications for hardware are that, e.g., a processor that needs to replace a dirty cache line must also publish every write it's made up to that point, and a processor that has a cache read miss has to make sure that the rest of its cache is up-to-date. Those are pretty strong guarantees, so some architectures use weaker rules. That means that, without barriers, Bob might publish his store to _guard before he publishes his store to _variable, or Alice might just use a cached version of _variable which doesn't include Bob's write. So to make this work, Bob has to do a write barrier (effectively, publishing all his stores), and Alice has to do a read barrier (effectively, updating or invalidating her cache). That gives us that both [store _variable < write barrier < store _guard] and [load _guard < read barrier < load _variable], which is what we need.

So what the compiler has to do is to make sure that it emits a read barrier along the fast path, or else we might get a stale read; but __cxa_guard_release also needs to perform a write barrier. There are barriers happening as part of the mutex logic, but conventionally, acquiring a mutex only performs a read barrier and releasing it performs a write barrier, and the mutex release happens *after* the write to guard. So I think you either need an extra write barrier before the store to guard, or you need to move guard out of the mutex (with all the intendant complexity).

John.
Howard Hinnant
2011-05-23 23:34:31 UTC
Permalink
Post by John McCall
Post by Howard Hinnant
Post by John McCall
Post by Howard Hinnant
An assumption (and clang developers will have to tell me if it is correct or not) is that the compiler will do a proper double-checked locking dance prior to calling __cxa_guard_acquire.
I'm not sure what code you're imagining that the compiler might emit here; if the compiler emitted a full and proper double-checked locking dance, it would not need to call __cxa_guard_acquire. The compiler has the option of emitting an unprotected check that the first byte is non-zero, subject only to the restriction that accesses to the protected global cannot be re-ordered around that check. It's not required to do any such thing, though.
Yes, that was what I was thinking, but also on some platforms (almost certainly not intel) an atomic read may be necessary. I'm weak in this area though.
Post by John McCall
In short, for a multitude of reasons, __cxa_guard_acquire must be prepared for the possibility that the variable has already been initialized by the time it grabs the lock.
Oh, yes, absolutely. The danger I'm worried about is the compiler getting a false '1' out of the first byte (using a non-atomic read) before the initializing thread is quite done. I'm always confused as to how that might happen, but have been repeatedly warned by those more knowledgeable than me in this area, that it is possible on platforms with sufficiently weak memory ordering.
Yes. Specifically, a processor has to be able to see writes made by a different processor out of order (writes to non-conflicting objects, obviously). I'm going to break this down because it turns out to matter.
Imagine that Alice and Bob are racing to initialize a variable, and Bob happens to get there first.
store x to _variable
store 1 to _guard
tmp = load _guard
goto skip if tmp != 0
try to acquire guard, initialize, etc.
y = load _variable
and we want to guarantee that x == y. The way we prove this is to set things up such that if [store _guard < load _guard] (because the load read the value written by the store) then the processor's memory model guarantees that [store _variable < load _variable].
The reason that x86 doesn't require anything special here is that x86 promises not to "reorder" stores after stores or loads after loads; that is, it guarantees that sequenced stores will become visible to other processors in the same order, and it guarantees that sequenced loads will see stores in an order consistent with the order they became visible. So we automatically get that [store _variable < store _guard] and [load _guard < load _variable], and we don't need any fences.
However, the implications for hardware are that, e.g., a processor that needs to replace a dirty cache line must also publish every write it's made up to that point, and a processor that has a cache read miss has to make sure that the rest of its cache is up-to-date. Those are pretty strong guarantees, so some architectures use weaker rules. That means that, without barriers, Bob might publish his store to _guard before he publishes his store to _variable, or Alice might just use a cached version of _variable which doesn't include Bob's write. So to make this work, Bob has to do a write barrier (effectively, publishing all his stores), and Alice has to do a read barrier (effectively, updating or invalidating her cache). That gives us that both [store _variable < write barrier < store _guard] and [load _guard < read barrier < load _variable], which is what we need.
So what the compiler has to do is to make sure that it emits a read barrier along the fast path, or else we might get a stale read; but __cxa_guard_release also needs to perform a write barrier. There are barriers happening as part of the mutex logic, but conventionally, acquiring a mutex only performs a read barrier and releasing it performs a write barrier, and the mutex release happens *after* the write to guard. So I think you either need an extra write barrier before the store to guard, or you need to move guard out of the mutex (with all the intendant complexity).
I follow all of this right up until the very last sentence.

My conclusion would have been that the mutex unlock issued after the write to the guard is exactly the write barrier required, and that (on some platforms) the compiler would need to issue a read barrier just before checking the guard (i.e. an atomic read with memory_order_acquire in C++11-speak).

Howard
John McCall
2011-05-24 00:53:43 UTC
Permalink
Post by Howard Hinnant
My conclusion would have been that the mutex unlock issued after the write to the guard is exactly the write barrier required, and that (on some platforms) the compiler would need to issue a read barrier just before checking the guard (i.e. an atomic read with memory_order_acquire in C++11-speak).
A write barrier *after* the write to _guard isn't good enough. A
read barrier / write barrier promises that reads/writes (respectively)
executed after the barrier will not be visible out of order with
reads/writes executed before the barrier. They only actually give
useful ordering guarantees in tandem.

As it stands, there's an unlikely race:

Thread A writes to _variable.
Thread A writes to _guard
Thread B reads _guard, seeing Thread A's write.
Thread B performs a read barrier.
Thread B reads _variable, not seeing Thread A's write because
of the architecture's relatively weak memory ordering.
Thread A performs a write barrier.

If Thread A now stored something to _foo, and Thread C read from _foo,
saw Thread A's update, and did a read barrier, then Thread C would
be guaranteed to see both of Thread A's earlier stores. But there's no
analogous promise about Thread B because they haven't "synchronized"
on anything.

John.
Howard Hinnant
2011-05-24 22:13:00 UTC
Permalink
Committed based on Nick's patch revision 132009. I ended up doing an __APPLE__ branch for the guard variable because of the desire to check for deadlock. The non-Apple branch is pretty generic, and will hang on deadlock. Other branches can be made of course (e.g. linux).

The problem with checking for deadlock is you need a cheap thread id that will fit in 56 bits, and that isn't portable. Somewhere in this thread, someone mentioned how to do it, and that would be a welcome patch.

We've had an Apple expert on memory visibility look over the Apple branch and he's signed off on it. Though if anyone sees any problems with either branch, please speak up. I've tested both branches fairly heavily (in addition to the tests that Nick wrote and are committed). I didn't commit these extra tests because I wrote them using C++11 and libc++ (where it is so easy to create and manipulate threads).

Arm is not addressed herein, and patches that do so are of course welcome.

Finally, thanks Nick for getting this started!

Howard
Howard Hinnant
2011-05-24 23:09:21 UTC
Permalink
Post by Howard Hinnant
Committed based on Nick's patch revision 132009. I ended up doing an __APPLE__ branch for the guard variable because of the desire to check for deadlock. The non-Apple branch is pretty generic, and will hang on deadlock. Other branches can be made of course (e.g. linux).
The problem with checking for deadlock is you need a cheap thread id that will fit in 56 bits, and that isn't portable. Somewhere in this thread, someone mentioned how to do it, and that would be a welcome patch.
We've had an Apple expert on memory visibility look over the Apple branch and he's signed off on it. Though if anyone sees any problems with either branch, please speak up. I've tested both branches fairly heavily (in addition to the tests that Nick wrote and are committed). I didn't commit these extra tests because I wrote them using C++11 and libc++ (where it is so easy to create and manipulate threads).
Arm is not addressed herein, and patches that do so are of course welcome.
Finally, thanks Nick for getting this started!
I forgot to thank Dave Zarzycki for his generous and tremendously expert help on this. Thanks Dave!

Howard
Nick Lewycky
2011-05-23 21:59:52 UTC
Permalink
Post by Howard Hinnant
Post by Howard Hinnant
Post by John McCall
Post by Howard Hinnant
Will these namespace-scope definitions ever generate a call to
__cxa_guard_acquire? I know this is forward looking concerning
Post by Howard Hinnant
Post by John McCall
The only variable definitions that require _cxa_guard_acquire are static
locals and static data members of templated classes.
Post by Howard Hinnant
Ok, thanks John.
I've got a prototype here. It is kind of messy. But it is basically
derived from Nick's code which I think is a good fundamental design. Though
I've rejected spin locks. It would work best with thread_local data. But
for obvious reasons I haven't tested that branch and am emulating
thread_local with pthread_key_t. This implementation is also sensitive to
endian because the Itanium spec is in terms of "first byte" instead of "high
byte".
Post by Howard Hinnant
In a nutshell, I develop a thread id that can be stored in size_t. An
assumption is that the number of threads in the lifetime of an application
will be less than 2^32 on a 32 bit architecture, and less than 2^56 on a 64
bit architecture. This thread id is always non-zero, and is nothing more
than a sequential count of threads. Note that it is not a count of
currently active threads, but a count of the number of times a thread is
created. pthread_self() is specifically not used for this purpose because
it often takes up more storage space than the 56 bits we have available.
A thread-local with a mandatory initializer seems like a lot of overhead
for a QoI thing like aborting on a recursive initialization ? I dunno, maybe
I'm over-thinking things. I do think that it would be good to allow
platforms to opt out, or at least to provide a different implementation if
the platform happens to support a range-restricted TID.
In particular, while Darwin's pthread_t is a pointer, Darwin does provide
mach_thread_self(), which returns a mach_port_t, which is (ultimately) an
unsigned int. This is the sort of thing that most OSes probably have an
equivalent of. For example, on Linux (maybe only certain distributions?),
you can use gettid(), which is a signed int.
Better than gettid(), Linux's pthreads implementation is based on futex()
which provides a way to implement a condition variable on an int*. I'd be
willing to try my hand at implementing guards with futex(), but won't get
around to it soon.

Just to make your life more complicated, guard variables on ARM are only 32
Post by Howard Hinnant
bits, but they give the implementation 31 bits to play with: the compiler's
fast-path check only considers whether the low *bit* is non-zero.
Yup. This. I wanted to bring this up, but thought I should have some code
contributed before stirring it up. :)

Is libcxxabi interested in providing a home for ARM ABI as well as Itanium
ABI? And if so, what should we do to share code? On ARM we need to provide
many/all of the Itanium ABI functions but compilers are expected to call the
ARM ABI ones instead. Their ABI manual suggests implementing the Itanium
functions in terms of the ARM ones instead of the other way around. How
would we go about doing that?

Nick

I don't think you need to worry about platforms that use the Itanium ABI but
Post by Howard Hinnant
that don't have pthreads. :) Undoubtedly someone will correct me, though.
Post by Howard Hinnant
An assumption (and clang developers will have to tell me if it is correct
or not) is that the compiler will do a proper double-checked locking dance
prior to calling __cxa_guard_acquire.
I'm not sure what code you're imagining that the compiler might emit here;
if the compiler emitted a full and proper double-checked locking dance, it
would not need to call __cxa_guard_acquire. The compiler has the option of
emitting an unprotected check that the first byte is non-zero, subject only
to the restriction that accesses to the protected global cannot be
re-ordered around that check. It's not required to do any such thing,
though.
In short, for a multitude of reasons, __cxa_guard_acquire must be prepared
for the possibility that the variable has already been initialized by the
time it grabs the lock.
That said, it's probably best to assume that the compiler did a preliminary
check and that it's not worth performing a second such check before trying
to grab the lock.
Post by Howard Hinnant
Comments, suggestions welcome.
Pedantic point of the week: 'initialized' should be specifically a char*,
not an int8_t*. int8_t may be 'signed char', which in C++ does not have the
omnipotent aliasing power of 'char' or 'unsigned char'.
Also, I don't see a reason to clear the "lock" on cxa_guard_release, and on
every other path, set_lock can safely assume that the first byte of the
guard is zero.
But this looks like a great start!
John.
_______________________________________________
cfe-commits mailing list
cfe-commits at cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.cs.uiuc.edu/pipermail/cfe-commits/attachments/20110523/1fde56bf/attachment-0001.html
Renato Golin
2011-05-25 01:53:02 UTC
Permalink
Post by Nick Lewycky
Is libcxxabi interested in providing a home for ARM ABI as well as Itanium
ABI? And if so, what should we do to share code?
Hi Nick,

I'd very much like that to happen... ;)
Post by Nick Lewycky
On ARM we need to provide
many/all of the Itanium ABI functions but compilers are expected to call the
ARM ABI ones instead. Their ABI manual suggests implementing the Itanium
functions in terms of the ARM ones instead of the other way around. How
would we go about doing that?
That's pretty much it. ARM C++ ABI is not that different from Itanium.

The way we're doing now in LLVM's ARM back-end for library selection
(calling ARM names instead of GNU ones) is enough on the majority of
cases. Two problems we found (memset and divrem) can be easily
implemented with special lowering. It should be as simple as ARM
functions calling the default ones on most cases.

I'm not sure how that would work in the libraries, but CodeSourcery's
worked that out already in GCC. I don't think it should be too
difficult, though I'm not an expert in writing libraries.

I'll have a look on libcxxabi later on to see if I can spot any simple
way of doing this, but from what you've said, you're probably far
ahead. ;)

cheers,
--renato

Howard Hinnant
2011-05-22 20:55:42 UTC
Permalink
Post by Howard Hinnant
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
namespace __cxxabiv1
{
namespace __libcxxabi
{
void abort(const char* s) {printf("%s\n", s); ::abort();}
} // __libcxxabi
static pthread_mutex_t guard_mut = PTHREAD_MUTEX_INITIALIZER;
static pthread_cond_t guard_cv = PTHREAD_COND_INITIALIZER;
extern "C"
{
int __cxa_guard_acquire(int64_t* guard_object)
{
uint8_t* initialized = (uint8_t*)guard_object;
uint8_t* lock = (uint8_t*)guard_object + 1;
int r = pthread_mutex_lock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_acquire failed to acquire mutex");
if (*initialized == 1)
{
r = pthread_mutex_unlock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_acquire failed to release mutex");
return 0;
}
while (*lock == 1)
pthread_cond_wait(&guard_cv, &guard_mut);
int result = *initialized;
if (result == 0)
*lock = 1;
r = pthread_mutex_unlock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_acquire failed to release mutex");
return result == 0;
}
void __cxa_guard_release(int64_t* guard_object)
{
uint8_t* initialized = (uint8_t*)guard_object;
uint8_t* lock = (uint8_t*)guard_object + 1;
int r = pthread_mutex_lock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_release failed to acquire mutex");
*initialized = 1;
*lock = 0;
r = pthread_mutex_unlock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_release failed to release mutex");
r = pthread_cond_broadcast(&guard_cv);
if (r != 0)
__libcxxabi::abort("__cxa_guard_release failed to broadcast condition variable");
}
void __cxa_guard_abort(int64_t* guard_object)
{
uint8_t* lock = (uint8_t*)guard_object + 1;
int r = pthread_mutex_lock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_abort failed to acquire mutex");
*lock = 0;
r = pthread_mutex_unlock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_abort failed to release mutex");
r = pthread_cond_broadcast(&guard_cv);
if (r != 0)
__libcxxabi::abort("__cxa_guard_abort failed to broadcast condition variable");
}
} // extern "C"
} // __cxxabiv1
The structure of our implementations is nearly identical.
If I understand correctly, yours does detect recursive initializers. pthread_mutex_lock will return EDEADLK, and your code will abort.
Unfortunately no. I'm not holding the pthread_mutex_t during initialization, only while setting/inspecting guard_object. I get a hang caused by deadlock.

Fwiw, there's some harmless but worthless code in my __cxa_guard_acquire. Here it is again with it removed:

int __cxa_guard_acquire(int64_t* guard_object)
{
uint8_t* initialized = (uint8_t*)guard_object;
uint8_t* lock = (uint8_t*)guard_object + 1;
int r = pthread_mutex_lock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_acquire failed to acquire mutex");
while (*lock == 1)
pthread_cond_wait(&guard_cv, &guard_mut);
int result = *initialized;
if (result == 0)
*lock = 1;
r = pthread_mutex_unlock(&guard_mut);
if (r != 0)
__libcxxabi::abort("__cxa_guard_acquire failed to release mutex");
return result == 0;
}

Howard
John McCall
2011-05-22 20:45:44 UTC
Permalink
Post by Howard Hinnant
I've started dipping my toes in the libcxxabi project by implementing a few simple methods. Patch attached, please review!
This isn't heavily tested, I'm mostly just trying to make sure I have the style down. Please let me know if we should be using different file names (hard to follow a pattern where there isn't one), of if the private methods in cxa_guard actually belong in an anonymous namespace, etc.
Nick
<libcxxabi-virtual-and-guard.patch>
I disagree with John that your implementation doesn't support the recursive calls for *different* static objects. I believe it does. And I like that your design allows for parallel initialization of different static objects.
Oh, yes, sorry, I did misread the code as using a global lock instead of a variable-specific lock. In retrospect, I'm not sure how. :)
Post by Howard Hinnant
What I'm nervous about though is the use of the spin lock. An arbitrary amount of time can go by while one thread is initializing and the other thread is spinning, waiting for that initialization to complete.
Yeah; while there are times and places for spin locks, I don't think this is one, and this is actually a very naive spin lock.

That said, while I shouldn't be concerned about using pthread_mutex_t, I know that certain platforms have particularly terrible implementations. :)

John.
Loading...