Rendering to 4 windows with maximum efficiency

Hello community,

this is a follow up to the “Inexplainable race condition in SDL Renderer” discussion. I have now restructured my application (which did not fix the issue, but improved portability) and ran into efficiency issues.

This is what my application needs to do:
Render contents of 4 buffers with pixel data to 4 windows. Buffers are provided by a secondary thread while rendering is done on the main thread. Keyboard and mouse events are processed in the main thread. In more detail:

  • Main thread: Get and process events at a frequency of about 200 Hz. At the same frequency check if rendering is required and if true, render contents of buffers with pixel data to 1, 2, 3 or all of the windows.
  • Secondary thread: Run the primary task of my application and tell the main thread when data is available for rendering. Therefore the mentioned buffers are filled with pixel data (thread-safe) and an atomic variable is set to 1 to signal the main thread that rendering is requested. The data is provided at an average frequency of 68 Hz. Variability through the 200 Hz frequency of the main thread is not an issue.

Code simplified for single window:

Shared variables:

static Uint8        buffer[4*1024*1024];
static SDL_SpinLock bufferLock;
static SDL_atomic_t bufferUpdated;

Main thread:

Uint32 r, g, b, a, d, format;
SDL_Event event;

SDL_Window*   sdlWindow   = SDL_CreateWindow(PROG_NAME, SDL_WINDOWPOS_UNDEFINED, SDL_WINDOWPOS_UNDEFINED, 1120, 832, 0);
SDL_Renderer* sdlRenderer = SDL_CreateRenderer(sdlWindow, -1, SDL_RENDERER_ACCELERATED);
SDL_Texture*  sdlTexture  = SDL_CreateTexture(sdlRenderer, SDL_PIXELFORMAT_UNKNOWN, SDL_TEXTUREACCESS_STREAMING, 1120, 832);

SDL_RenderSetLogicalSize(sdlRenderer, 1120, 832);
SDL_QueryTexture(sdlTexture, &format, &d, &d, &d);
SDL_PixelFormatEnumToMasks(format, &d, &r, &g, &b, &a);

SDL_Surface*  sdlSurface  = SDL_CreateRGBSurface(SDL_SWSURFACE, 1120, 832, 32, r, g, b, a);

while (1) {
    SDL_PollEvent(&event);
    // some event handler function
    SDL_Delay(5)
    if (SDL_AtomicSet(&bufferUpdated, 0) {
        SDL_AtomicLock(&bufferLock);
        SDL_UpdateTexture(sdlTexture, NULL, buffer, sdlSurface->pitch);
        SDL_AtomicUnlock(&bufferLock);
        SDL_RenderClear(sdlRenderer);
        SDL_RenderCopy(sdlRenderer, sdlTexture, NULL, NULL);
        SDL_RenderPresent(sdlRenderer);
    }
}

Secondary thread:

void bufferCopy(Uint8* src) {
    SDL_AtomicLock(&bufferLock);
    memcpy(buffer, src, sizeof(buffer);
    SDL_AtomicSet(&bufferUpdated, 1);
    SDL_AtomicUnlock(&bufferLock);
}

I now have the problem, that my current implementation is very inefficient. It uses way too much CPU- and GPU power. The original implementation involved rendering in four secondary threads at VSYNC while the main thread provided the data. That was way more efficient but not portable.

Any ideas how to efficiently implement this are welcome!

btw. Using SDL_LockTexture()/SDL_UnlockTexture() instead of SDL_UpdateTexture() has no noticable effect.

Define “too much CPU and GPU power”

Remember that you’re uploading 4 (apparently large-ish) textures per frame to the GPU.

I still don’t know why you’re using SDL_PIXELFORMAT_UNKNOWN when creating the texture

Have you tried using a mutex instead of a spinlock? I don’t know what your pattern of contention is, but mutexes are almost always the right answer and have much better behavior if you have any kind of contention.

“Too much” in this case means way more than my previous implementation (every renderer in a separate thread with VSYNC). Compared to the previous implementation I get slightly higher GPU usage with only a single window and about three times the GPU usage with all 4 windows active. CPU usage behaves similar. So the difference is massive. SDL_PIXELFORMAT_UNKNOWN seems to have no effect on efficiency.

My system has 8 CPU cores so I do not expect too much blocking of other threads through the spinlock. But I’ll try and report.

I just tried an also compared to my old variant again. My app’s name is Previous and I checked all involved processes with high CPU load. Here are the results of CPU core usage:

Threaded rendering:

Previous:     125 %
kernel_task:   32 %
WindowServer:  19 %

Main thread rendering with mutex:

Previous:     145 %
kernel_task:   63 %
WindowServer:  46 %

Main thread rendering with spinlock:

Previous:     146 %
kernel_task:   63 %
WindowServer:  47 %

As you can see the results with mutex and spinlock are almost identical. The threaded variant uses less CPU power, especially for kernel_task and WindowServer.

GPU usage can be seen in the screenshots:

Threaded:

Main thread with mutex:

Main thread with spinlock: