Performance using Wayland window

MarcelHB · August 14, 2024, 9:34pm

Hello everyone,

There is something that keeps my mind a little busy: I’m using a Raspberry Pi 400 using the newest RPi OS, that is configured to use Wayland, SDL2 v2.26.5.

I have some project that isn’t exactly challenging visually, mostly blitting sprites, but somewhat accelerated using OpenGL, making use of fragment shaders for some pixel ops.

My display is full HD and my experience with Wayland isn’t very deep, the project is configured to use 1024x768 for window size since it renders old games with little scalability as per their design. Some interesting observations:

Case 1: I start a WM (sway) and launch the game. The performance is bad, at max 30 FPS, and mouse lags extremely to an unplayable extent.
Case 2: I start the project from the initial shell, no problem with Wayland. Nice, performance is now back at 60 FPS at fullscreen but is stretched to full HD, looking bad.
Case 3: I modify the source code to open a window at full HD but keep the drawing at 1024x768. Now, also from the shell, it’s scaled to the max. possible extent correctly (with black bars to the left + right) but the performance is bad again (30 FPS), but at least the mouse doesn’t lag.

Ok, it’s probably little surprise that a full HD-requested window will lead to more memory demand and bigger regions to update. Yet I was thinking if we can get something good out of here, i. e. correct display ratio and better FPS using SDL2? My API knowledge of Wayland is currently non-existent and I also suspect other flaws around, too, e. g. when running sway that I’m willing to ignore here:

Could the peformance bottleneck maybe be resolved through some kind of scissoring on the windows’ surface, esp. for Wayland? And if so, is there a way to tell SDL to do this?
What’s the actual matter with scaling here? I guess that an up-scaled version as seen in case 3 does impact the extent of copying framebuffer/surface data at a cost, but case 2 covered all pixels of the screen at no penalty, so something can do something like this at little cost? Wayland, the driver, …?
Maybe I am completely at the wrong end r/n and there is a straight-forward solution to this? Or something is bad around Wayland and RPi, and SDL2 is out of frame? I know the RPi 4 performance isn’t amazing but then at least I’d be interested in the underlying problem.

Thanks

MarcelHB · August 15, 2024, 5:32pm

More experiments that leave me even more baffled:

sway: fullscreen bad (30 FPS), increasing the number of tiles i. e. shrinking the visual surface: increasing the FPS up until 60. Nice, except you can’t read anything anymore.
weston: fullscreen better (45-50FPS) whereas having a windowed window bad: 30FPS.

So probably there are moving parts in play we cannot address by SDL.

I only know that we somehow can render 1024x768 screen buffers at acceptable FPS on RPi but no idea how to get there portably …

MarcelHB · August 15, 2024, 6:05pm

At least I found something: When running an SDL2 application from tty, SDL_RENDERER_PRESENTVSYNC on RPi/EGL/Wayland? apparently locks the FPS to 30. That’s at least something I can work around a bit by disabling VSync in the settings, and then we are at 45FPS+ again.

rtrussell · August 16, 2024, 9:07am

That wouldn’t be surprising if your rendering (main loop) is taking more than 1/60 second, because in that case SDL_RENDERER_PRESENTVSYNC will inevitably result in alternate frames being ‘skipped’.

In fact most of the symptoms you describe could be explained by your main loop only just managing to complete in 1/60 second. Then only a small performance hit, which wouldn’t normally be noticeable, will result in a halving of the frame rate!

If that is the cause I wouldn’t advise disabling VSync, because although the average frame rate will increase it will be at a cost of motion artefacts (juddery movement of sprites etc.).

MarcelHB · August 16, 2024, 1:48pm

Thanks, yeah sounds like we have to look for performance improvements here.

Yet ideas or comments regarding the window size concerns are still welcome.

anon914446 · August 16, 2024, 2:28pm

From this description of the program I would have expected something around 200 to 400 FPS once the VSYNC was turned off.
The Pi 4 is a pretty capable machine at 1.8 GHZ Quad core.

I’m curious what your RAM usage is, perhaps your program is putting you into swap memory and that’s why the frame rate is so slow? The task manager should be lxtask which you can launch from terminal if it’s not in your menu. You might also check CPU usage while you’re there.

If you are not maxing out your RAM, you might also try increasing your VRAM (the GPU’s Video RAM) using the raspi config program. Go with a sane value (maybe double the default to start with), this will only provide performance if the GPU is actually needing the extra storage space.

MarcelHB · August 16, 2024, 3:30pm

RAM is totally fine, CPU usage looks a little high for what I’d expect and I will check this. Yet Passmark values are average to what others have reported, so I guess I can rule anything related to the hardware, like throttling due to bad power supply.

Anyway, the whole system feels way slower since RPI OS switched from X11 to Wayland since mouse is lagging in WMs and Firefox has short freezes when scrolling down a page.

I’ve tried setting VRAM to 128M, although raspi-config no longer supports this officially on this model, but this doesn’t make much of a difference. On some older OS, I recall this being helpful though.

anon914446 · August 16, 2024, 4:31pm

When you say you’re doing pixel ops, what does that mean exactly? Are you making a single call to modify a cluster of pixels, or is is one call per pixel?

Sending and retrieving between the CPU and the GPU is a pretty hefty bottleneck, so it would be useful if you could reduce the number of calls it takes to make those changes.

You might also reduce the number of times those functions are called by creating a sentry variable that only runs those functions when the image needs to be updated.
Or even better, if you could run any of those GPU calls before the game loop.

MarcelHB · August 16, 2024, 7:49pm

When you say you’re doing pixel ops, what does that mean exactly? Are you making a single call to modify a cluster of pixels, or is is one call per pixel?

That’s straight forward, for every draw call of a sprite,

SDL_RenderFlush (since we will do custom gl...)
we set some buffer render target
we request our custom shader pair, set some uniforms, sometime we also bind an additional texture,
then SDL_RenderCopy will do some multi-texture ops, tinting, gamma, per-pixel masking, nothing special, we also didn’t know how to do some of this with plain SDL_* calls.

Before SDL_RenderPresent, everything is blitted onto the window target by more SDL_RenderCopy.

We also are careful with updating textures. For what I’m looking at r/n, there are no new uploads or updates of textures for the ongoing scene. We have only one shader and the vertices issued by SDL should be very limited (handful per call I guess). At least I’m not aware of a bandwith problem from RAM to GPU under these circumstances. Also, on medium desktop computers, we can have 1000s of FPS in such a scene.

So we have: about 70 draw calls (a bunch of textures + some very little text rendered as such) and a frame time of 20ms/50 FPS. 4ms go to our display drawing code of the main menu, 16ms go to waiting for SDL_RenderPresent, with VSync disabled (that’s the gap, I just didn’t take a screenshot with it).

MarcelHB · August 17, 2024, 11:18am

Btw. grateful for ideas how to get more insights about this on RPi, other than poking around with calls and inputs and measure the impacts.

Tracy’s OpenGL status collection relies on OpenGL v3.2 and we only have OpenGL v3.1 here.

MarcelHB · August 18, 2024, 3:41pm

From this description of the program I would have expected something around 200 to 400 FPS once the VSYNC was turned off.
The Pi 4 is a pretty capable machine at 1.8 GHZ Quad core.

Just as a quick test in the meantime, still on the TTY environment: Some very primitive loop of SDL_RenderClear and SDL_RenderPresent on full HD will give me about 100 FPS (on -O3 and opengl/opengles), and software is already at 50 FPS.

On 1024x768 (window and buffer), the accelerated path is around 300FPS, and 210FPS when also blitting a bunch of rectangles (85FPS at full HD with draw buffer at 1024x768).

So far, the significant factor still looks like the number of pixels on the screen to refresh (looks somewhat linear), not the ones being drawn by me, even if the buffer/target region is way smaller than the screen/window.

Let’s say I’d specify SDL_RenderSetLogicalSize: Shouldn’t we then know how to subtract regions of the window that don’t need render attention? From what I read, on Wayland a window and a surface are coupled by dimensions, and then one can declare opaque regions over that surface to reduce the amount of drawing, e. g. the black bars to the left and right, if logical resolution < screen/window resolution.

anon914446 · August 18, 2024, 4:06pm

We need some kind of access to the code to do any real testing. Otherwise it’s all guess and check which is very inefficient.

MarcelHB · August 18, 2024, 4:17pm

Sure, here you go: sdl-rpi-performance-test/main.cpp at main · MarcelHB/sdl-rpi-performance-test · GitHub

To actually replay this fully, use an RPi 4/400 using the latest Raspbian OS & Wayland, compile this and start without WM, but GALLIUM_HUD=fps being set.

btw. that’s not my initial motivation/problem – that is working on GemRB. This source is a minimal test example roughly representing the setup and reproducibility.

anon914446 · August 20, 2024, 5:12am

My apologies, I guess I had some blurry nostalgia glasses on about the raspberry pi (Back in the day, a Pi 2B+ got me through two semesters of college). I tinkered a lot with your code, but none of the optimizations I tried for my x86_64 laptop translated over for the Pi in any meaningful sense.
I’m not getting much better results with some minimal code that I wrote either.

You are correct the key limiting factor looks to be the size of the rendering area. This does explain why many of the games available on the pi are set up with small windows or lower resolutions by default.
Another option could be a heavily event-driven environment, where the screen only needs to update when the user does something like click a button.

MarcelHB · August 20, 2024, 4:48pm

Thanks for validating this. Well, appears we have to live with it for now.

We are already quite good a reducing drawing in general but I was wondering how much potential is left.

I’ll probably have another look at Wayland/EGL in terms of what can be done by cutting off surface regions for drawing.