SDL Thread Safety

Dear SDL samurais! I have a noob question about threadsafety.

I’m using SDL2 only to put the final frame onto screen, using my own blitting routines and surface format, because SDL provided are too inefficient and missing important features, like textured triangle rendering. SDL is also a bit unstable, with newer version changing everything, so less dependency on SDL is always good. So I have the following code to upload the frame:

SDL_LockSurface(surface);
memcpy(surface->pixels, framebuffer, w*h*4);
SDL_UnlockSurface(surface);
SDL_UpdateWindowSurface(window);

Now that memcpy takes a lot of time, which I can spend updating the game world for the next frame, if only it was running in it own thread, that would have given me several more FPS. Yet I’m unsure if functions like SDL_LockSurface, SDL_UnlockSurface and SDL_UpdateWindowSurface are thread safe or I should do all SDL calls in a single thread, taking requests from the rest of the code?

Note that I can’t use surface->pixels as a framebuffer itself, because my surface format has pixels embedded into the structure to eliminate indirection and make set_pixel several times faster. But maybe surface->pixels can be user allocated, so I could prefix them with my own surface header data? Then there will be no need to copy anything.

Thanks in advance!

The folk-lore answer is ‘only call display functions in the main thread’, but I’m not sure if that’s still true (for all platforms). I know that things like this do work on some platforms(on some hardware/driver combos).

There’s not excellent documentation for the thread safety of SDL functions in general. I’d like to know, but I’m not sure anyone actually does know for sure.

I tried running display function in different thread on OSX, and it worked (haven’t crashed). But I looking at SDL source code, and there is no trace of any thread safety for several platforms I checked, so my guess one should never do that. Currently I’m using manually unrolled memcpy, which produced enough speedup (over library memcpy) so I stopped caring about it for now. And if one wants properly optimized native app, he/she won’t be using abstraction layers like SDL anyway.

static void memcpy4(uint32_t *restrict p, uint32_t*restrict s, int len) {
  uint32_t *end = p + (len&~0x1f);
  while (p != end) {
    p[ 0] = s[0];
    p[ 1] = s[1];
    p[ 2] = s[2];
    p[ 3] = s[3];
    p[ 4] = s[4];
    p[ 5] = s[5];
    p[ 6] = s[6];
    p[ 7] = s[7];
    p[ 8] = s[8];
    p[ 9] = s[9];
    p[10] = s[10];
    p[11] = s[11];
    p[12] = s[12];
    p[13] = s[13];
    p[14] = s[14];
    p[15] = s[15];
    p[16] = s[16];
    p[17] = s[17];
    p[18] = s[18];
    p[19] = s[19];
    p[20] = s[20];
    p[21] = s[21];
    p[22] = s[22];
    p[23] = s[23];
    p[24] = s[24];
    p[25] = s[25];
    p[26] = s[26];
    p[27] = s[27];
    p[28] = s[28];
    p[29] = s[29];
    p[30] = s[30];
    p[31] = s[31];
    p += 32;
    s += 32;
  }
  end += len&0x1f;
  while (p != end) *p++ = *s++;
}