Low-end GPU Performance


I’m experimenting with SDL2 on SBCs (single board computers). I’m trying to figure out the performance limits and the best way to use SDL.

When copying large textures to the screen with SDL_RenderCopy, I’ve noticed that drawing begins lagging at a certain point but CPU usage remains low. I think the GPU is reaching its limit, in this case. I want to do a test and see how well it matches with the theoretical Mpix/s spec.

Using small textures (and many SDL_RenderCopy operations), CPU usage goes up quickly. It can easily max-out the CPU core (maybe around 1000-1200 calls). The texture size seems to have little effect on the CPU usage.

There must be a way to improve the second scenario and stop using so much CPU. I would expect that most of the work could be done by the GPU.

Maybe I need to check the render target pixel format vs. the source textures. Could that have some overhead for the CPU?

Suggestions? Ideas? Anything I should test?

Each draw call has a certain amount of CPU overhead. If you’re drawing lots of objects, look into using texture atlases (if you aren’t already). SDL will be able to batch consecutive calls to SDL_RenderCopy() or SDL_RenderCopyEx() that use the same source texture into one draw call, potentially saving a lot of CPU overhead.

Thanks. The source is a typical tile-set, so it’s drawing from one source texture but with different clipping regions.

Is there a way I can improve that process?

Would be better to break the source image into lots of separate textures?

Why are you using clipping?
SDL uses batching unless you change the texture, color, and blend mode. In this case, all requests to draw the same texture to different places on the screen will be executed at one time.

I mean that, since I am using an “atlas” (i.e. a tile set), the calls to SDL_RenderCopy use different portions of the source texture.

I’m trying a test now with a single, small texture and running into a similar problem with performance. I wonder if the texture format matches the screen. Maybe that has an impact.

Can you provide more information about the OS and the renderer used? Since the Opengl implementation is very old using immediate mode… For Linux I prefer to use EGL and Opengl es2 render.

I’m using Armbian (Debian) Linux with LXQT desktop environment. For renderers, I have OpenGL, OpenGLES2 and software. I didn’t select one of them explicitly but it probably defaults to OpenGL.

Try to set

SDL_SetHint(SDL_HINT_RENDER_DRIVER, "opengles2");

and compare performance.

To ensure which renderer is created add following code

    SDL_RendererInfo renderer_info;
    if (SDL_GetRendererInfo(renderer, &renderer_info) < 0)
        printf("Could not get renderer info [ %s ]", SDL_GetError());
        printf("renderer inf name = %s max_width = %d max_height = %d flags = %u",
            renderer_info.name, renderer_info.max_texture_width, renderer_info.max_texture_height, renderer_info.flags);


However, I’m not sure that SDL_HINT_RENDER_BATCHING is having an effect. Is there actually batching in SDL2?

SDL supports automatic batching for same texture since version 2.0.9, provided that no other methods from the list below are called between SDL_RenderCopy calls


But it’s not always the case. If the methods were called, but nothing changed, then autopackaging may work. And it also depends on the render backend.

Hi @rmg.nik and thanks for the reply.

I don’t think I’m using any of those calls. I wrote a simple test program that loops a bunch of SDL_RenderCopy calls. Is SDL_RenderCopyEx unable to batch (if I only use that call)?

I think that the render backend should support batch.

I don’t know if it should make any difference, but try not rendering directly to the default target. Instead define a texture with the same dimensions as the screen and RenderCopy everything there. Then at the end of the frame RenderCopy this texture to the default target.
Take that with a grain of salt though, as it could probably make the performance worse.


I tried that and I don’t think I see any performance difference. That was actually my original approach. I render lots of small textures to a large one and then render that to the screen.

I did some benchmarking on another SBC with similar specs and results. From what I can tell, the GPU starts to slow down at the expected point - in this case, Mpix/s > 450 - but I’m still having trouble with the CPU overhead for lots of calls.

Well then I guess you either hit the SBC’s performance limits, or the abstraction that the SDL renderer provides is affecting performance. Even without knowing the internals of SDL, I doubt it’s the second as it is known to be “thin”. I’m not an OpenGL “user” but many people ditch the SDL renderer and do everything with raw OpenGL calls. Maybe you could squeeze some more juice this way.
I don’t own an SBC, but maybe you could set-up a minimal skeleton app so that people that own one can test and provide feedback. IMO this could be the best way to determine whether you actually hit the limits or there’s some piece of code at fault.


Yes, I’m intentionally trying with some very limited SBCs. I don’t expect them to be very powerful.

However, I tried with the “SDL_gpu” library and there was a significant improvement. I was able to render 7x the number of objects with similar CPU usage. Maybe SDL is not properly batching the draws. The issue might be limited to this platform but it was interesting to see.


Can you post a link to your test? Ryan and I can look and see if there’s some CPU slowdown in that pattern.

1 Like