How can I reduce SDL_DYNAPI_entry calls in my program?

Hi Folks,

As an academic exercise, I am building a simple 3D engine to better appreciate how games in the 90’s (before everyone had OpenGL) had to go about doing perspective projection, triangle rasterization, and texture mapping. I used SDL_RENDERER_SOFTWARE to ultimately render 3D scenes, pushing most of the 3D computational load on the CPU and using the GPU for back buffering, pixel blitting, and texture storage. For the most part the software 3D rendering pipeline is working, but it’s pretty slow. Something caught my eye when profiling my debug run that I wanted to ask this group about. If you look at the flamegraph below of my pipeline, you’ll see that a fair amount of prof samples are spent in SDL_DYNAPI_entry outside the main() stack. I can’t tell what they are doing.

Given the way I want to use SDL for this project, this normal? Is there any way I can reduce the time spent in SDL_DYNAPI_entry? I found this post about SDL_DYNAPI and tried to read up on it, enough so that I think I understand that it’s purpose is to assist with overriding statically linked older versions of SDL and allow for newer versions to be linked at runtime, but I don’t understand much more than that, and don’t know what it is doing that requires so many samples at runtime when profiling. As far as I know, I am NOT statically linking anything with clang++ on linux (where I develop and debug) at my linker stage.

Here is a link to my project source code if anyone would like to take a look at the source in context with the flamegraph: GitHub - DaveGuenther/3D_Engine_SDL: Attempt to build a basic 3D engine from scratch based on YouTube tutorial series from javidx9

Thanks,
Dave

The reason is that you’re making multiple calls to SDL for each and every pixel, which is going to be slow no matter what.

It’d be faster instead to set the pixels yourself in a texture, then upload the texture every frame:

  1. At game start, create a texture with the SDL_TEXTUREACCESS_STREAMING flag set
  2. At the start of the frame, call SDL_LockTexture() on it
  3. Render the frame into the memory SDL_LockTexture() gives you, manually setting the pixels
  4. When done rendering, call SDL_UnlockTexture() on your texture to update its contents
  5. Call SDL_RenderCopy() to put that texture on the screen
3 Likes

Thank you @sjr . It took me a while to wrap my head around how to use SDL_Textures for low level pixel operations but I think I got it working (after a bunch of segfaults). I’ve managed to get a 20FPS boost out of it, and the SDL_DYNAPI calls have reduced drastically:

With your suggestion, I made a class out of the steps that probably could be improved on a lot but as it stands should handle RBGA8888 pixel format without issue. The class manages a single SDL_Texture that is the width and height of the screen/window to treat like a frame buffer. It handles lock, unlock and provides a simple blit function for pixel plotting. Posting here in case it might help someone else trying to do the same thing:

SDL_Texture_Blit.h

#ifndef SDL_TEXTURE_BLIT_H
#define SDL_TEXTURE_BLIT_H

#include <SDL2/SDL.h>

/*
How to use:
-Instantiate from scope that has an SDL_Renderer* defined with dimensions of pixelbuffer then run these commands in sequece during the game loop
  -lock() the texture so it can be edited
  -blit() pixels to the texture as muich as you want (use RGBA8888 format for now)
  -unlock() the texture so that it's ready for the render
  -RenderCopy() to apply the texture to the renderer.

- Outside this class, present the renderer using SDL_RenderPresent(renderer); to show the pixel buffer
The instance of this class should be destroyed when the calling object goes out of scope (just before it destroys the SDL_Renderer object).
*/

class SDL_Texture_Blit{
    public:
        SDL_Texture_Blit(SDL_Renderer* renderer, int SCREEN_W, int SCREEN_H);
        ~SDL_Texture_Blit();
        void lock(); // Do this before blitting
        void blit(uint x, uint y, uint8_t r, uint8_t g, uint8_t b, uint8_t a);
        void unlock(); // do this when all the pixel blits are done for the frame
        void RenderCopy(); // Copies the unlocked terxture to the renderer
        SDL_Texture* getFrameBuffer(); // use this returned pointer to call SDL_RenderCopy(renderer, <frameBuffer>, NULL, NULL); outside this class

    private:
        SDL_Renderer *renderer=NULL;
        SDL_PixelFormat *pixelFormat=NULL;
        SDL_Texture *texture=NULL;
        int tex_w, tex_h;
        uint8_t *framebufferpixels=NULL;
        Uint32 *tex_head=NULL;
        Uint32 textureFormat;
        int pitch=0; // size of one row in bytes
        int adjusted_pitch=0; // this is the pitch of a single row in pixels (not bytes)
        Uint32 *p=NULL; // this will be the pixel pointer to a specific pixel in the buffer
        
        bool inPixelRange(const int &x, const int &y);

};


#endif

SDL_Texture_Blit.cpp

#include "SDLTextureBlit.h"
#include <SDL2/SDL.h>
#include <iostream>


SDL_Texture_Blit::SDL_Texture_Blit(SDL_Renderer* renderer, int SCREEN_W, int SCREEN_H){
    
    this->renderer = renderer;
    this->pixelFormat = SDL_AllocFormat(SDL_PIXELFORMAT_RGBA8888);
    this->texture = SDL_CreateTexture(renderer, SDL_PIXELFORMAT_RGBA8888,SDL_TEXTUREACCESS_STREAMING, SCREEN_W, SCREEN_H);
    SDL_QueryTexture(this->texture, &this->textureFormat, NULL, &this->tex_w, &this->tex_h);
    
}

void SDL_Texture_Blit::lock(){
    if(SDL_LockTexture(this->texture, NULL, (void **)&this->framebufferpixels, &this->pitch))
    {
        // if return status is non-zero we have an error and want to show it here
        std::cout << "Error Locking Texture: " << SDL_GetError() << std::endl;
    }
    this->adjusted_pitch=this->pitch/this->pixelFormat->BytesPerPixel;
    this->p = (Uint32 *)(this->framebufferpixels); // cast for a pointer increments by 4 bytes.(RGBA)
    this->tex_head=this->p;
    
}

void SDL_Texture_Blit::blit(uint x, uint y, uint8_t r, uint8_t g, uint8_t b, uint8_t a){
    
    if (this->inPixelRange(x, y)){
        this->p = this->tex_head+(this->adjusted_pitch*y)+x;
        *p = SDL_MapRGBA(this->pixelFormat, r, g, b, a);
    }
    
}

void SDL_Texture_Blit::unlock(){
    SDL_UnlockTexture(this->texture);
}

void SDL_Texture_Blit::RenderCopy(){
    SDL_RenderCopy(this->renderer, this->texture, NULL, NULL);
}

SDL_Texture* SDL_Texture_Blit::getFrameBuffer(){
    return this->texture;
}

SDL_Texture_Blit::~SDL_Texture_Blit(){
    
    SDL_DestroyTexture(this->texture); // this also clears framebufferpixels
    free(this->pixelFormat); 
    p=NULL;
    tex_head=NULL;
    framebufferpixels=NULL;
    //SDL_Renderer* renderer gets destroyed by parent object outside this class
    
}

bool SDL_Texture_Blit::inPixelRange(const int &x, const int &y){
    
    if (y>=this->tex_h) {return false;}
    if (x>=this->tex_w) {return false;}
    if (y<0) {return false;}
    if (x<0) {return false;}
    return true;
   
}

Glad you got it working. However, the way you’re doing this here is still gonna be really slow compared to just setting the pixels directly. Multiple function calls (3 in the worst case) is bad. Especially since one of them, inPixelRange() is also doing 4 conditionals!

Even at a fairly low resolution like 640x480, that’s still 307200 pixels. Meaning you’re doing 921600 function calls and 1228800 comparison/branches per frame!

How can this be fixed?

First, you don’t need to check if every single pixel is in range. Since software 3D polygon renderers for games typically clip the polygons to the screen and then break those down into horizontal spans for each row of pixels the polygon will occupy, that should be more than enough to make sure you’re staying within the framebuffer. Even if inPixelRange() gets inlined by the compiler so there’s one less function call being made per pixel, the 4 conditionals are really gonna have a negative impact on performance.

Second, calling SDL_MapRGBA() for every pixel is gonna slow things down too. If you’re doing texture mapping, convert the textures to the same format as the framebuffer (either offline or at load time), then just copy the pixels into the framebuffer as uint32’s (so your texture is stored in memory as an array of uint32’s, you treat the framebuffer as an array of uint32’s, and just copy the pixels). If you’re filling the polygons with solid colors, pack the color channels into a uint32 yourself ahead of time (don’t do it for every pixel!) and then set it directly.

Third, the blit() function shouldn’t really exist. Have your lock() function return the pointer and pitch that it gets from SDL_LockTexture() to the caller, and then let the caller set the pixels directly.

Ideally, you’d just be directly setting pixels in a tight inner loop (“framebufferData[y * pitch + x] = color;” or even “framebufferData[offset++] = color” since you’re drawing in horizontal spans), with any conditionals, function calls, etc., lifted out of it and put somewhere where they won’t be called for every pixel.

But you want to keep it this way

Definitely fix isPixelRange() if you aren’t going to eliminate it.

For starters, blit() takes uints for the X and Y coordinates, so coordinates it passes to isPixelRange() will never be less than zero. Change isPixelRange() to take uints as well, so then you can eliminate the last two if statements.

Next, change it to just pass the values of X and Y instead of references. This extra level of indirection is going to hurt performance if the compiler isn’t smart enough to optimize the references away when inlining.

2 Likes

Thanks for all the feedback. You are correct, I am clipping polygons are the camera frustum edges in camera space (x, y, z) so ideally by the time they get to screen space they shouldn’t ever actually be outside the x,y screen limits. I’m have noticed some unrelated funny business happen right now between my FOV and aspect ratio that’s causing some clipping to occur such that pixels end up off screen when I move my FOIV away from 90. I appear to have fixed this, but I’m not sure I trust the fix enough to remove the inPixelRange() safeguard just yet. Also, I did made some adjustments to the blit function to handle horizontal scanlines. You are correct that the engine is set up to rasterize triangles one row at a time, so I can just identify the x,y starting value for each pixel line in a poly and then use p++ to increment each pixel for that line.

Should I still use Uint32 for my pixel pointer p if I move away from using alphs and instead define the texture to be RGB888? the Pixelformat of my window that I discovered this morning using getWindowMode() is RGB888. When working with the SDL_Textures, I am not using the alpha channel at all (every pixel I use 0xFF for alpha), so I agree that it seems like wasted space. But if that’s the case, would I still be using Uint32 for the frame buffer pixel pointer (p) or do I need to be dealing with 24 bit sizes (might be a dumb question, because I didn’t even see uint24_t show up in my intellisense in VS Code). Can you explain a little more about how to move away from using an alpha channel in the picture so that it would improve performance? Would I keep the pointer at Uint32 for the pixel pointer if the texture is defined as RGB888?

Noted on the removal of the blit() routine. I have a flat top triangle rasterization and flat bottom triangle rasterization function that works through the UV mapping slightly differently, but both make calls to the blit() function. I can try to instead pull down a pixel pointer and just perform the blit function in those two areas.

If the units in the buffer are 24-bit, then you need a data type that is 24-bit wide. You can declare such a type as a packed structure with a 24-bit integer field (bitfield), but since you need it to modify the color channels, you can declare such a type as a packed structure containing three one-byte fields R, G and B (in correct order). Or in the form of a union, if you need fields for convenient access to channel data and a comprehensive one that treats the entire 24 bits as a single number.

Remember, however, that in this way you will exclude memory alignment, so the processing efficiency of such a buffer may be lower than in the case of data aligned to 32 bits (with one byte unused). Also take into account that even if you use 24-bit pixels, the backend may still use 32-bit data packets. So you need to check whether 24-bit data packets are supported at all, and if so, whether their processing efficiency will actually be higher. I bet it won’t be—you will reduce memory usage but increase buffer processing time.

Thanks. I think Uint32 worked here with RGB888.

When it comes to the GPU, any RGB888 texture you upload will get converted to RGBA anyway. (there are exceptions to this, such as packed formats, but you only need to worry about them you’re accessing the GPU API directly)

It’s gonna be way faster to keep everything (including textures) as RGBA32, even if there’s no transparency. That way copying pixels from the texture to the framebuffer is just a simple assignment/copy, it’ll be 32-bit aligned (more speed), and copying your framebuffer texture to VRAM won’t make the driver have to interleave an alpha channel.

Also, doing your blitting in spans (and RGBA32) instead of a pixel at a time opens up the possibility of using SIMD instructions to operate on multiple pixels at a time when you go to add lighting etc later on.