Renderer vs SDL_gpu for text rendering?

hjalfi · June 26, 2021, 4:23pm

I’ve got a word processor (http://cowlark.com/wordgrinder), and I’ve got fed up with having to write multiple pieces of rendering code for X11, Windows, OSX etc and want to use SDL instead.

Doing the driver was remarkably straightforward. I’m aware that SDL_ttf doesn’t cache glyphs so I’m using grimfang4’s SDL_fontcache layer instead. It actually works pretty well.

Unfortunately, while it works well on a desktop PC, on a low-end device such as a Raspberry Pi with Mali chipset it’s simply too slow — with a full-screen window it can’t keep up when scrolling. The old Xlib code could. So, I need to speed it up somehow. I’ve got a surprising number of Raspberry Pi users (and my preferred writing laptop is an ARM Mali device too).

The current rendering code is pretty naive; whenever the screen needs to be updated, I clear and redraw every character on the screen. I gather this is the preferred way to do it these days. See wordgrinder/dpy.c at sdl · davidgiven/wordgrinder · GitHub. For an average full-screen window that’s going to be, say, 150x50 characters = 7500 individual calls to SDL_RenderCopyEx(). Is that a lot? I really can’t tell any more…

What can I do? Is it with switching to something like SDL_gpu (or pure OpenGL, but I’d rather use something which abstracts away between GL and GLES)? My impression was that the SDL renderer is supposed to be natively hardware accelerated, but I’ve seen references here than SDL_gpu is way faster. Is there any way to verify that hardware acceleration is actually being used?

rtrussell · June 26, 2021, 9:16pm

This probably doesn’t help you immediately, but the version of SDL available from the RPi repository is 2.0.9, which is the last version not to support ‘render batching’. When 2.0.10 or later becomes available for the Raspberry Pi (you could build it now from source, but that’s a pain) there’s every likelihood that it will significantly speed up what you are doing.

I’ve no personal experience of SDL_gpu but I don’t think it should be “way faster” for 2D rendering, especially with SDL’s render batching enabled.

Which model of Raspberry Pi are you using? If it’s a RPi 4 then hardware acceleration should be enabled so long as you’ve specified SDL_RENDERER_ACCELERATED in your SDL_CreateRenderer() call. But if it’s a RPi 3 then it may be using mesa (software emulated OpenGL) by default, and enabling the ‘experimental’ GL driver could make a big difference:

 sudo raspi-config
 Advanced Options... GL Driver... GL (Full KMS)... Ok... Finish

hjalfi · June 26, 2021, 9:29pm

Actually, I’m testing now on a ARM laptop with the same Mali chipset (this one’s a T760). It’s running Debian, but it’s got SDL 2.0.14, so I’d expect it to have batching. glxinfo -B tells me I’m using the panfrost driver, and I am using SDL_RENDERER_ACCELERATED so that doesn’t sound like there are any quick fixes.

What goes into a batch? My render code is using a mixture of SDL_RenderFillRect() (to clear the backgrounds), SDL_RenderDrawLine() (to draw line characters which the font doesn’t have) and SDL_RenderCopyEx() (to actually draw the characters). Can these not be batched together? It occurs to me it might be worthwhile drawing the screen in three passes: first the backgrounds, then the lines, then the text…

hjalfi · June 26, 2021, 9:42pm

…I tried splitting the backgrounds from the glyphs, leaving the line drawing as part of the glyph pass, and it’s much faster.

So that’s obviously the trick to it. Where can I find out about optimal batching strategies?

sjr · June 27, 2021, 6:06am

Any time you change state or change what kind of thing is being drawn, SDL can no longer batch it into one draw call and has to submit whatever has been queue’d up for rendering and then start a new batch. So, using a different texture with RenderCopy(), changing the draw color (maybe), drawing lines, drawing rects, etc.

rtrussell · June 27, 2021, 9:21am

Just to clarify, I presume what you mean is “different destination texture”. It must be able to batch multiple source textures because that’s often what’s needed for rendering text glyphs, sprites etc.

JonnyD · June 27, 2021, 2:46pm

Simple batching in OpenGL (though I can’t confirm what SDL is doing) does not batch multiple source textures. The source texture is a separate part of the driver state in OpenGL and so multiple cannot be sent in a single draw call to execute the batch. For fonts and sprites, this is why we try to use packed font texture and sprite atlases. The source texture can remain the same through multiple sprites and just the source rect is specified differently per sprite, no state change required.

Changing the texture to none or changing the shader (e.g. rendering lines, then sprites, then lines) requires state changes. To optimize batching this for OpenGL, you’d probably want to unify the shader for both and skip texture changes when possible or try to use texture arrays.

hjalfi · June 27, 2021, 3:55pm

Luckily SDL_fontcache uses texture atlases. It uses 12w x 12h textures so there’s room for 144 glyphs per texture; but each of the four fonts I’m using (normal, italic, bold, bold-italic) are cached separately, so will require state changes. It sounds like I most likely want to draw the text in four passes, one for each font variant.

Text in different colours is done using SDL_SetTextureColorMod() to change the colour modulation on the source texture (the atlas). I’m guessing that’s also a state change…

I’m kinda tempted to switch to pure OpenGL, for maximum overkill for an app which started life as a terminal-based word processor, but OpenGL/OpenGLES compatibility is a bind.

sjr · June 28, 2021, 3:52am

No, I mean source textures. It’s a limitation imposed by the underlying graphics APIs, and is why using texture atlases is important.

The only ways for SDL to use multiple source textures in one draw call (which is what batching does; try to cram as much into one draw call as possible to minimize draw call overhead) would be to either use an array texture, which isn’t available everywhere and has the pesky limitation of all textures inside having to be the same size and pixel format, or use bindless texture arrays, which is really only possible on newer OpenGL versions (and has limitations in Metal). In either case it would then have to somehow pass the index of each texture along with the texture vertices.

rtrussell · June 29, 2021, 1:49am

I bow to your superior knowledge, but it’s not what icculus said in this post. There the question was asked:

What i have heard is that the flushing happens when you render a new texture, so if you render 2 sprites multiple times it is way faster to render A A B B rather then A B A B, since it only has to flush 2 times rather then 4, is this correct?

to which the reply was:

This will not flush at all in 2.0.10, assuming those SDL_Textures haven’t changed in some way between SDL render calls.

He does go on to say:

That being said, when flushing does happen, it benefits from using the same texture twice in a row, as SDL is now smart enough to only bind the texture once (in GL, Direct3D, etc) and draw from it multiple times, so you’ll see performance increases in this scenario beyond the higher-level “flushing” that SDL does.

So I wonder if we are talking about two different kinds of batching. I was referring to SDL’s internal batching, first enabled in SDL 2.0.10, which is independent of the backend and seemingly can batch multiple source textures. Whereas you may be talking about a lower-level batching within OpenGL.

sjr · June 29, 2021, 7:32am

I was talking about SDL’s batching. It doesn’t batch different source textures into one draw call, at least as far as I know, given that it has to be implemented on top of GPU APIs that don’t support doing that*

I’m assuming @icculus was talking about using texture atlases.

The last paragraph you’re quoting seems to be talking about how even without batching (turned off or just not present) SDL is smart enough where if you call SDL_RenderCopy() twice with the same source texture it won’t waste time (and incur a state change) by binding the same texture again for the second RenderCopy call (it’ll still do two draw calls though).

*with the exception of something like instanced rendering with array textures or bindless texture arrays

rtrussell · June 29, 2021, 8:55am

That’s not how I read his post: he referred to SDL_Textures in the plural. We should perhaps wait for him to comment.

icculus · June 29, 2021, 7:44pm

Yes, it won’t bind the same texture twice, because now it caches this state and knows not to bind it again, which wasn’t true in earlier versions of SDL.

When I said about the “SDL_Textures changing,” I meant that if you try to update a texture and there are batched draws waiting that need the current contents of the texture, it will force a flush so you get correct rendering before the texture contents change. Me using a plural here was just a stylistic choice…or an oversight, whichever applies.

But @sjr is right, at the current time, this will still be two separate draw calls from the same texture; the render API isn’t (currently) smart enough to notice that a string of SDL_RenderCopy() calls all use the same texture and collect them all into a single draw.

In theory, this can be done with some reworking of SDL’s internals and without a new API, but no one has tried to implement it yet. My attitude is that the worst fire of the render API’s performance is put out by the batching code, but that doesn’t mean there isn’t still a lot of low-hanging fruit out there.

rtrussell · June 29, 2021, 9:04pm

Thanks for the explanation. I’m still not entirely clear, though, whether SDL’s batching offers any benefit if you are doing many SDL_RenderCopy() calls with different source textures, or whether it doesn’t help at all.

In my app I use SDL_gfx to draw text, and it creates (and caches) a separate texture for every glyph, it doesn’t use a font atlas. I explicitly enable batching (in my case it would be disabled by default because of hints) thinking it would help, but perhaps it doesn’t.

icculus · July 1, 2021, 2:30pm

it creates (and caches) a separate texture for every glyph

In terms of performance, this is often considered a bad idea regardless of the API used to render it, fwiw.

In this case, though, SDL’s batching still gets you some wins, as there is a lot of other state we can cache, but moving to a texture atlas will be a speedup in any case, as there are less texture binds, and maybe some day significantly less draw calls.

(but for a small amount of text, maybe this isn’t a big deal in practice.)

rtrussell · July 1, 2021, 4:34pm

SDL2_gfx has been around for several years so it’s evidently not been considered something which needs to be addressed urgently. Certainly in my application the text rendering performance has never been an issue, even on relatively slow platforms like the Raspberry Pi.

Thanks for the confirmation.

sjr · July 2, 2021, 2:49pm

but moving to a texture atlas will be a speedup in any case, as there are less texture binds, and maybe some day significantly less draw calls.

Wait, so does SDL’s batching not combine multiple calls to RenderCopy() with the same source texture into one draw call?

icculus · July 8, 2021, 3:08pm

Wait, so does SDL’s batching not combine multiple calls to RenderCopy() with the same source texture into one draw call?

Not at the moment, but that’s planned future development (it won’t require any API changes, just improvements to the rendering backends).

sjr · July 9, 2021, 8:03pm

I thought that was the whole point of the batching system, to reduce draw call overhead. Huh.

rtrussell · July 10, 2021, 12:06am

In this article Ryan describes in some detail how the batching works. It says “Everything that looks like a rendering operation (not just draws but setting the viewport, the cliprect, etc) goes into a linked list” (which AIUI is stored in a Vertex Buffer Object).