Speed/Timing of SDL_SemWait and SDL_SemPost

For my application I decided upon an implementation which involves a
particular function having numerous exit/entry points - places where the
function should be able to return to a calling function, and later
resume from that same position.

In the absense of any option for forking call stacks, this implied
separate call stacks, and therefore in the context of C/C++/SDL,
multiple threads. Having spawned my particular function as a separate
thread, I have a wrapper function which does something like this:

SDL_SemPost(GoSemaphore);
SDL_SemWait(StopSemaphore);

And the function with exit/entry points doing the same thing the
opposite way around at those points, i.e.:

SDL_SemPost(StopSemaphore);
SDL_SemWait(GoSemaphore);

This all works perfectly, but execution time is a consideration. There
are other possible (much more complicated in source terms)
implementations that don’t use threads similarly. Therefore I profiled
my code.

I found that SemPost was taking up ~14% of my execution time, SemWait
more like 18%, the wrapper function ~10% on its own, ~42% including
children, and the function running in a separate thread was not listed.
This is in an unrealistic test scenario which has the whole
SemWait/SemPost loop attempting to occur over 15,000 times a second,
whereas I will only be requiring 100 or 50 in the end.

This test was performed using MSVC 6 under Windows 2000. I now have some
related questions:

Are SemPost and SemWait really that expensive, or am I paying for the
scheduler? E.g. does SemPost only end up unlocking things at the next
scheduler tick? If so, can I be reasonably sure that for a given
operating system, timings for these operations will tend to be unrelated
to processor speed?

If there are scheduler limits on how often SemPost/SemWait can occur, is
looking towards 100 sets of the four calls every second too optimistic?
Will I see large variations across different operating systems?

Also, not SDL but MSVC profiler related, does anyone know if I should
take the ~10% that my wrapper function seems to take up with one call to
SemPost and one to SemWait as the actual time taken by the unlisted
separately threaded function, or is it likely that SemWait is soaking up
that execution time?

Finally, I suppose I should ask - is there a better way of achieving
what I am trying? I thought the semaphore solution was rather neat, but
I’m new to this thread stuff…

-Thomas