Low-end GPU Performance

hyphex · October 27, 2020, 3:22am

Hi,

I’m experimenting with SDL2 on SBCs (single board computers). I’m trying to figure out the performance limits and the best way to use SDL.

When copying large textures to the screen with SDL_RenderCopy, I’ve noticed that drawing begins lagging at a certain point but CPU usage remains low. I think the GPU is reaching its limit, in this case. I want to do a test and see how well it matches with the theoretical Mpix/s spec.

Using small textures (and many SDL_RenderCopy operations), CPU usage goes up quickly. It can easily max-out the CPU core (maybe around 1000-1200 calls). The texture size seems to have little effect on the CPU usage.

There must be a way to improve the second scenario and stop using so much CPU. I would expect that most of the work could be done by the GPU.

Maybe I need to check the render target pixel format vs. the source textures. Could that have some overhead for the CPU?

Suggestions? Ideas? Anything I should test?

sjr · October 27, 2020, 4:53am

Each draw call has a certain amount of CPU overhead. If you’re drawing lots of objects, look into using texture atlases (if you aren’t already). SDL will be able to batch consecutive calls to SDL_RenderCopy() or SDL_RenderCopyEx() that use the same source texture into one draw call, potentially saving a lot of CPU overhead.

hyphex · October 27, 2020, 6:16am

Thanks. The source is a typical tile-set, so it’s drawing from one source texture but with different clipping regions.

Is there a way I can improve that process?

Would be better to break the source image into lots of separate textures?

rmg.nik · October 27, 2020, 8:19am

Hi
Why are you using clipping?
SDL uses batching unless you change the texture, color, and blend mode. In this case, all requests to draw the same texture to different places on the screen will be executed at one time.

hyphex · October 27, 2020, 8:22am

I mean that, since I am using an “atlas” (i.e. a tile set), the calls to SDL_RenderCopy use different portions of the source texture.

I’m trying a test now with a single, small texture and running into a similar problem with performance. I wonder if the texture format matches the screen. Maybe that has an impact.

rmg.nik · October 27, 2020, 8:36am

Can you provide more information about the OS and the renderer used? Since the Opengl implementation is very old using immediate mode… For Linux I prefer to use EGL and Opengl es2 render.

hyphex · October 27, 2020, 9:41am

I’m using Armbian (Debian) Linux with LXQT desktop environment. For renderers, I have OpenGL, OpenGLES2 and software. I didn’t select one of them explicitly but it probably defaults to OpenGL.

rmg.nik · October 27, 2020, 10:02am

Try to set

SDL_SetHint(SDL_HINT_RENDER_DRIVER, "opengles2");
SDL_SetHint(SDL_HINT_RENDER_BATCHING, "1");

and compare performance.

To ensure which renderer is created add following code

    SDL_RendererInfo renderer_info;
    if (SDL_GetRendererInfo(renderer, &renderer_info) < 0)
    {
        printf("Could not get renderer info [ %s ]", SDL_GetError());
    }
    else
    {
        printf("renderer inf name = %s max_width = %d max_height = %d flags = %u",
            renderer_info.name, renderer_info.max_texture_width, renderer_info.max_texture_height, renderer_info.flags);
    }

hyphex · October 27, 2020, 10:39am

Thanks. I confirmed it is using OpenGLES2 with the flags I enabled (SDL_RENDERER_ACCELERATED | SDL_RENDERER_PRESENTVSYNC | SDL_RENDERER_TARGETTEXTURE).

However, I’m not sure that SDL_HINT_RENDER_BATCHING is having an effect. Is there actually batching in SDL2?

rmg.nik · October 27, 2020, 10:49am

SDL supports automatic batching for same texture since version 2.0.9, provided that no other methods from the list below are called between SDL_RenderCopy calls

SDL_SetTextureAlphaMod 
SDL_SetTextureBlendMode
SDL_SetTextureColorMod
SDL_RenderSetClipRect
SDL_RenderCopyEx

But it’s not always the case. If the methods were called, but nothing changed, then autopackaging may work. And it also depends on the render backend.

hyphex · October 29, 2020, 4:00am

Hi @rmg.nik and thanks for the reply.

I don’t think I’m using any of those calls. I wrote a simple test program that loops a bunch of SDL_RenderCopy calls. Is SDL_RenderCopyEx unable to batch (if I only use that call)?

I think that the render backend should support batch.

immortalx · October 29, 2020, 4:33am

I don’t know if it should make any difference, but try not rendering directly to the default target. Instead define a texture with the same dimensions as the screen and RenderCopy everything there. Then at the end of the frame RenderCopy this texture to the default target.
Take that with a grain of salt though, as it could probably make the performance worse.

hyphex · October 29, 2020, 8:46am

@immortalx

I tried that and I don’t think I see any performance difference. That was actually my original approach. I render lots of small textures to a large one and then render that to the screen.

I did some benchmarking on another SBC with similar specs and results. From what I can tell, the GPU starts to slow down at the expected point - in this case, Mpix/s > 450 - but I’m still having trouble with the CPU overhead for lots of calls.

immortalx · October 29, 2020, 9:24am

Well then I guess you either hit the SBC’s performance limits, or the abstraction that the SDL renderer provides is affecting performance. Even without knowing the internals of SDL, I doubt it’s the second as it is known to be “thin”. I’m not an OpenGL “user” but many people ditch the SDL renderer and do everything with raw OpenGL calls. Maybe you could squeeze some more juice this way.
I don’t own an SBC, but maybe you could set-up a minimal skeleton app so that people that own one can test and provide feedback. IMO this could be the best way to determine whether you actually hit the limits or there’s some piece of code at fault.

hyphex · October 29, 2020, 10:19am

@immortalx

Yes, I’m intentionally trying with some very limited SBCs. I don’t expect them to be very powerful.

However, I tried with the “SDL_gpu” library and there was a significant improvement. I was able to render 7x the number of objects with similar CPU usage. Maybe SDL is not properly batching the draws. The issue might be limited to this platform but it was interesting to see.

slouken · November 2, 2020, 11:37pm

Can you post a link to your test? Ryan and I can look and see if there’s some CPU slowdown in that pattern.