Use SDL to accelerate raw pixel operations?

nomagno · April 24, 2023, 7:38pm

Greetings! This is a bit of a weird inquirty, but I reckon it can’t hurt to ask. I’m currently writing a custom renderer that does rasterizing and everything else through raw RGBA5551 row-major arrays/matrixes (from the scaling to the line drawing to text… it bypasses OpenGL completely, basically), and then simply uses SDL2 to poll input and write to the final window, abstracted away. However, the main issue I’ve been faced with is that any kind of operation involving window-sized matrixes gets really slow really fast in software. I’d appreciate it if anyone knows of a way to use SDL2 to somehow noticeably software/hardware accelerate the following, without actually using any of the SDL formats for the input and output, so it can be abstracted away as a more performant version of the existing functions (any of these would be a huge help individually, this is why I list them separately):

Set a matrix of pixels to a single color.
Copy a matrix of pixels to another of the same size
Copy a matrix of pixels to another of a bigger size, but in a specific place location within the output matrix
Scale a matrix of pixels to another one of an integer multiple size in both dimensions

Thanks.

PiotrGrochowski · April 24, 2023, 9:21pm

When making software renderers, it is important to note the difference between bloatware and native software. An excessive amount of branches can make it slow, and divisions and floating point operations are even slower.
Here is an example of the difference between a bloatware copy and a native copy:

#include <stdint.h>
//bloatware for copying image from input with stride iF, width iW, height iH to data with stride F, width W, height H on position x, y

void bloatwarecopy(const uint16_t* input, int iF, int iW, int iH, uint16_t* data, int F, int W, int H, int x, int y){
	for(int k=0; k<iH; k++){
		if(y+k>=0 && y+k<H){
			for(int l=0; l<iW; l++){
				if(x+l>=0 && x+l<W){
					if(input[k*iF+l] != 0xFFFF)
					data[(y+k)*F+(x+l)] = input[k*iF+l];
				}
			}
		}
	}
}

//native software for copying image from input with stride iF, width iW, height iH to output with stride F, width W, height H on position x, y

void nativecopy(const uint16_t* input, int iF, int iW, int iH, uint16_t* data, int F, int W, int H, int x, int y){
	int k1=0;if(k1<-y)k1=-y;int k2=iH;if(k2>H-y)k2=H-y;
	int l1=0;if(l1<-x)l1=-x;int l2=iW;if(l2>W-x)l2=W-x; if(l2<=l1)return;
	for(int k=k1; k<k2; k++){
		uint16_t* o=data+((y+k)*F+x); const uint16_t* b=input+(k*iF);
		for(int l=l1; l<l2; l++) o[l] = b[l]==0xFFFF?o[l]:b[l];
	}
}

What makes the former copying code bloatware is the excessive amount of branches in multiple levels of the loop. The native copying code on the other hand combines the two bounding boxes to compute the bounds and it also does conditional move in place of transparency branch. In this case, the native copy is faster than bloatware copy. Even then, my example might not be fully native and it might be possible to be made even faster than that. In case transparency is not used, memcpy may be used to instantly copy a scanline of the image memcpy(o+l1,b+l1,(l2-l1)*sizeof(uint16_t)).

flowCRANE · April 25, 2023, 2:09pm

@nomagno: first, don’t use arrays and access individual pixels based on indices because that can be slow (see for details). Instead, use pointers to the first and last pixels of the row, then iterate until the pointers line up. It doesn’t matter where in the target image you want to copy the source image — calculate the address for the source and target rows once, then iterate over them.

If you need to copy a whole block of data, use e.g. memcpy or write a loop that will copy this data using the largest capacity (64-bit) registers. A handwritten and optimized loop should be faster because memcpy can do additional checks before the actual copy loop.

To sum up, limit the number of calculations to a minimum, extract what you can before loops, modify data based on pointers, avoid branching. If you need even higher performance, use vectorization.

Daniel_Gibson · April 25, 2023, 2:30pm

Not necessarily, memcpy() can be really fast, it’ll likely use SSE or AVX registers if it makes sense (SSE has 128bit == 16byte registers, AVX has 256bit == 32byte, or even 512bit == 64byte if AVX512 is used).

Furthermore, the “additional checks” shouldn’t matter much, because memory access is a lot more “expensive” than a few cycles of calculation in whatever check memcpy() might do - probably the main overhead of memcpy() is the function call, which (if you copy enough data at a time) should be more then compensated for by the very optimized implementation of memcpy() on most sytems

flowCRANE · April 25, 2023, 8:03pm

You are right, it seems that the fastest way to copy larger memory block is to use memcpy.

PiotrGrochowski · May 10, 2023, 8:22am

As for copying images with integer factor scaling, it is bloatware to use divisions, floating points, or excessive branches. By converting a constant division to a multiplicative integer factor, the scaled copy can be made native.

//bloatware for copying image from input with stride iF, width iW, height iH to data with stride F, width W, height H on position x, y with scale s

void bloatwarescaledcopy1(const uint16_t* input, int iF, int iW, int iH, uint16_t* data, int F, int W, int H, int x, int y, int s){
	for(int k=0; k<iH*s; k++){
		if(y+k>=0 && y+k<H){
			for(int l=0; l<iW*s; l++){
				if(x+l>=0 && x+l<W){
					if(input[k/s*iF+l/s] != 0xFFFF)
					data[(y+k)*F+(x+l)] = input[k/s*iF+l/s];
				}
			}
		}
	}
}

//another bloatware which scales images with excessive branches

void bloatwarescaledcopy2(const uint16_t* input, int iF, int iW, int iH, uint16_t* data, int F, int W, int H, int x, int y, int s){
	for(int k=0; k<iH; k++){for(int k1=0;k1<s;k1++){
		if(y+k*s+k1>=0 && y+k*s+k1<H){
			for(int l=0; l<iW; l++){for(int l1=0;l1<s;l1++){
				if(x+l*s+l1>=0 && x+l*s+l1<W){
					if(input[k*iF+l] != 0xFFFF)
					data[(y+k*s+k1)*F+(x+l*s+l1)] = input[k*iF+l];
				}
			}}
		}
	}}
}

//floating point is even more bloated and may be affected by rounding errors

void bloatwarescaledcopy3(const uint16_t* input, int iF, int iW, int iH, uint16_t* data, int F, int W, int H, int x, int y, int s){
	double r=1.0/s;
	for(int k=0; k<iH*s; k++){
		if(y+k>=0 && y+k<H){
			for(int l=0; l<iW*s; l++){
				if(x+l>=0 && x+l<W){
					if(input[(int)(k*r)*iF+(int)(l*r)] != 0xFFFF)
					data[(y+k)*F+(x+l)] = input[(int)(k*r)*iF+(int)(l*r)];
				}
			}
		}
	}
}

//still bloatware, since division is involved

void bloatwarescaledcopy4(const uint16_t* input, int iF, int iW, int iH, uint16_t* data, int F, int W, int H, int x, int y, int s){
	int k1=0;if(k1<-y)k1=-y;int k2=iH*s;if(k2>H-y)k2=H-y;
	int l1=0;if(l1<-x)l1=-x;int l2=iW*s;if(l2>W-x)l2=W-x; if(l2<=l1)return;
	for(int k=k1; k<k2; k++){
		uint16_t* o=data+((y+k)*F+x); const uint16_t* b=input+(k/s*iF);
		for(int l=l1; l<l2; l++) o[l] = b[l/s]==0xFFFF?o[l]:b[l/s];
	}
}

//transforming to multiplicative integer factor makes it native

void nativescaledcopy(const uint16_t* input, int iF, int iW, int iH, uint16_t* data, int F, int W, int H, int x, int y, int s){
	int k1=0;if(k1<-y)k1=-y;int k2=iH*s;if(k2>H-y)k2=H-y;
	int l1=0;if(l1<-x)l1=-x;int l2=iW*s;if(l2>W-x)l2=W-x; if(l2<=l1)return;
	int64_t r=0xFFFFFFFFLL/s+1;
	for(int k=k1; k<k2; k++){
		uint16_t* o=data+((y+k)*F+x); const uint16_t* b=input+((k*r>>32)*iF);
		for(int l=l1; l<l2; l++) o[l] = b[l*r>>32]==0xFFFF?o[l]:b[l*r>>32];
	}
}

//depending on the constraints of width and height and scaling factor, if width and height are less than 32768 even after scaling, a 32-bit multiplication can be used instead of 64-bit multiplication which debloats it further

void nativescaledcopy32(const uint16_t* input, int iF, int iW, int iH, uint16_t* data, int F, int W, int H, int x, int y, int s){
	int k1=0;if(k1<-y)k1=-y;int k2=iH*s;if(k2>H-y)k2=H-y;
	int l1=0;if(l1<-x)l1=-x;int l2=iW*s;if(l2>W-x)l2=W-x; if(l2<=l1)return;
	int r=0x7FFFFFFF/s; int f=31; if(r>0x7FFFFF){r>>=8;f-=8;} if(r>0x7FFFF){r>>=4;f-=4;} if(r>0x1FFFF){r>>=2;f-=2;} if(r>0xFFFF){r>>=1;f-=1;} r++;
	for(int k=k1; k<k2; k++){
		uint16_t* o=data+((y+k)*F+x); const uint16_t* b=input+((k*r>>f)*iF);
		for(int l=l1; l<l2; l++) o[l] = b[l*r>>f]==0xFFFF?o[l]:b[l*r>>f];
	}
}

eri0o · May 15, 2023, 6:00am

memcpy is not the fastest in all platforms, or at least it wasn’t when I checked with MSVC on Windows.