Lack of batching for RenderCopy/RenderCopyEx

Hey guys

I wanted to create yet another rendering engine for my game which would be SDL2 internal renderer besides pure Direct3D and OpenGL. The reason for that is because SDL2 has a great software renderer which might come in handy sooner or later. However, seeing the lack of batch drawing within those two functions I can deduce the impact of performance when rendering glyphs of a text and few other minor stuff when batching is more than mandatory.

Perhaps it’s time to create some functions to batch few rectangles within one primitive draw call? It’s pretty much the same as with already available functions of SDL_RenderDrawPoints, SDL_RenderDrawRects, and SDL_RenderFillRects. Perhaps I’m missing something and there is a specific reason behind not implementing it?

I mean I know that batching won’t change anything within a software renderer, but it would be nice to have it nonetheless - after all, why not use available Direct3D and OpenGL back-ends when possible?

2015-08-01 13:09 GMT-03:00, .3lite :

Perhaps it’s time to create some functions to batch few rectangles within
one primitive draw call? It’s pretty much the same as with already available
functions of SDL_RenderDrawPoints, SDL_RenderDrawRects, and
SDL_RenderFillRects. Perhaps I’m missing something and there is a specific
reason behind not implementing it?

Mostly that people attempted to automatically batch the calls to the
existing functions and somehow expect that to improve all programs
(despite said programs constantly changing textures most likely).

A new function would be probably the best and easiest option (actually
an equivalent for the scaling/rotation variant would be neat too). It
can even be made to just fall back to non-batched functions in
backends where batching wasn’t implemented yet (worst case it just
takes about the same amount of time, best case it improves by a lot).

Odds are it’d look a lot like the XNA SpriteBatch when using
SpriteSortMode.Deferred:

https://msdn.microsoft.com/en-us/library/microsoft.xna.framework.graphics.spritebatch.aspx

https://msdn.microsoft.com/en-us/library/microsoft.xna.framework.graphics.spritesortmode.aspx

https://github.com/flibitijibibo/FNA/blob/master/src/Graphics/SpriteBatch.cs

You could either require that all copies in a single batch use one
texture, like RenderCopyBatched(SDL_Texture*, SDL_Rect**), or do
RenderBatchBegin/RenderBatchEnd hints that try to generate batches of
RenderCopy calls on SDL’s end at the cost of having things split up more
than you might expect.

-EthanOn 8/1/15 3:06 PM, Sik the hedgehog <sik.the.hedgehog at gmail.com> wrote:

2015-08-01 13:09 GMT-03:00, .3lite :

Perhaps it’s time to create some functions to batch few rectangles
within
one primitive draw call? It’s pretty much the same as with already
available
functions of SDL_RenderDrawPoints, SDL_RenderDrawRects, and
SDL_RenderFillRects. Perhaps I’m missing something and there is a
specific
reason behind not implementing it?
Mostly that people attempted to automatically batch the calls to the
existing functions and somehow expect that to improve all programs
(despite said programs constantly changing textures most likely).

A new function would be probably the best and easiest option (actually
an equivalent for the scaling/rotation variant would be neat too). It
can even be made to just fall back to non-batched functions in
backends where batching wasn’t implemented yet (worst case it just
takes about the same amount of time, best case it improves by a lot).

To be honest guys I’m expecting something much simpler. Lets give an example of following functions:

Code:
SDL_RenderCopies(SDL_Renderer* renderer,
SDL_Texture* texture,
const SDL_Rect** srcrect,
const SDL_Rect** dstrect)

Code:
SDL_RenderCopiesEx(SDL_Renderer* renderer,
SDL_Texture* texture,
const SDL_Rect** srcrect,
const SDL_Rect** dstrect,
const double angle,
const SDL_Point* center,
const SDL_RendererFlip flip)

The only difference between those functions and their equivalents of SDL_RenderCopy and SDL_RenderCopyEx is taking an array of source and destination rectangles. The rest can be taken care of by the programmer. That way the implemention of these functions should be straight forward and should take no more than 10 minutes for both OpenGL and Direct3D. I can make it myself, yes, but I would like to stay up to date with the SDL2 itself and I do not like modifying external libraries I’m relying on.

D3D_RenderCopy uses DrawPrimitiveUP of Direct3D which is almost ready for batch drawing - just add more vertices of the rest rectangles. Desktop OpenGL uses old immediate mode (draw arrays would be much better), but it’s easy to implement it as well. All available renderers within SDL2 are pretty much ready to add batching of the same texture and they require minor changes to SDL_RenderCopy and SDL_RenderCopyEx to make new functions out of them.

To be honest guys I’m expecting something much simpler. Lets give an example
of following functions:

Code:
SDL_RenderCopies(SDL_Renderer* renderer,
SDL_Texture* texture,
const SDL_Rect** srcrect,
const SDL_Rect** dstrect)

Code:
SDL_RenderCopiesEx(SDL_Renderer* renderer,
SDL_Texture* texture,
const SDL_Rect** srcrect,
const SDL_Rect** dstrect,
const double angle,
const SDL_Point* center,
const SDL_RendererFlip flip)

The only difference between those functions and their equivalents of
SDL_RenderCopy and SDL_RenderCopyEx is taking an array of source and
destination rectangles. The rest can be taken care of by the programmer.
That way the implemention of these functions should be straight forward and
should take no more than 10 minutes for both OpenGL and Direct3D. I can make
it myself, yes, but I would like to stay up to date with the SDL2 itself and
I do not like modifying external libraries I’m relying on.

D3D_RenderCopy uses DrawPrimitiveUP of Direct3D which is almost ready for
batch drawing - just add more vertices of the rest rectangles. Desktop
OpenGL uses old immediate mode (draw arrays would be much better), but it’s
easy to implement it as well. All available renderers within SDL2 are pretty
much ready to add batching of the same texture and they require minor
changes to SDL_RenderCopy and SDL_RenderCopyEx to make new functions out of
them.

I agree that it should be simpler than XNA and I personally like this
line of thinking.

I think the goal of the batching API should be stated explicitly so
everybody is on the same page. In my opinion, the goal should be to
allow performance optimizations and that’s it. (XNA conflates multiple
things…performance plus read-my-mind-do-everything-I-want which
ultimately makes things more complicated.) Convenience wrappers can
always be written on the outside, but you can’t wrap around
API/performance bottlenecks and expect them go faster.

Additionally, I suspect that a good SIMD backend could make the
software renderer go a lot faster too. (Watch Handmade Hero for a
great demonstration of how he made a chunky software renderer go to
60fps at 1080p using SSE2.)

I would suggest constraining the API as much as possible for speed. To
make SIMD or any vectorization to go fast, you generally want
predictable data layouts and no branches.

So for example, with

SDL_RenderCopiesEx(SDL_Renderer* renderer,
SDL_Texture* texture,
const SDL_Rect** srcrect,
const SDL_Rect** dstrect,
const double angle,
const SDL_Point* center,
const SDL_RendererFlip flip)

I might suggest that individual array elements for src/dst rect must
always have values and can’t be NULL, that way the code that is trying
to shuffle things into registers isn’t needing to check for NULL all
the time.

Along that line of thought, we may want explicit array sizes for
srcrect/dstrect as additional parameters. Algorithms may want to
compute up front how it is going to deal with odd number cases where
the number of objects doesn’t perfectly divide evenly into the wide
registers. This may have an additional convenience for when the user
has a large array of rects already, but only needs a subset,
preventing the need to make a new copy.

srcrect or dstrect arrays themselves being NULL/empty probably could
be handled efficiently by separating into specialized versions early
before entering into the inner loops.

I’m a little ambivalent about flip. It seems like for performance, the
user should have pre-oriented the texture. On the otherhand, since it
is already in the core SDL API, consistency is nice, and I don’t think
this needs to incur a noticeable cost as it can also be separated out
into specialized versions early.

-EricOn 8/1/15, .3lite wrote:

Eric Wing wrote:

I might suggest that individual array elements for src/dst rect must
always have values and can’t be NULL, that way the code that is trying
to shuffle things into registers isn’t needing to check for NULL all
the time.

I agree. In fact I made a small mistake in my example. I mean’t an array of objects, not array of pointers. Including amount of elements inside the array.

That is:

Code:
SDL_RenderCopies(SDL_Renderer* renderer,
SDL_Texture* texture,
const SDL_Rect* srcrect,
const SDL_Rect* dstrect,
int count)

Code:
SDL_RenderCopiesEx(SDL_Renderer* renderer,
SDL_Texture* texture,
const SDL_Rect* srcrect,
const SDL_Rect* dstrect,
const double angle,
const SDL_Point* center,
const SDL_RendererFlip flip,
int count)

That way instead of providing an array of pointers you will provide an address to an array with objects.

Code:
SDL_Rect srcRects[100];
SDL_Rect dstRects[100];
// fill them
SDL_RenderCopies(renderer, texture, srcRects, dstRects, 100);

I believe that these two small functions will take renderer of SDL2 into a new level. Who knows how many games were forced to abandon SDL2 in favor of pure Direct3D or OpenGL implementation due to performance issues from lack of batching.

Message-ID: <1438457988.m2f.48435 at forums.libsdl.org>
Content-Type: text/plain; charset=“iso-8859-1”

To be honest guys I’m expecting something much simpler. Lets give an example
of following functions:

Code:
SDL_RenderCopies(SDL_Renderer* renderer,
SDL_Texture* texture,
const SDL_Rect** srcrect,
const SDL_Rect** dstrect)

Code:
SDL_RenderCopiesEx(SDL_Renderer* renderer,
SDL_Texture* texture,
const SDL_Rect** srcrect,
const SDL_Rect** dstrect,
const double angle,
const SDL_Point* center,
const SDL_RendererFlip flip)

Was center meant to point to an array, or just a single SDL_Point? C
has supported pass-by-value of structures for long enough that I
haven’t easily been able to find out when: I think it was during the
K&R/C89 switch-over. I wouldn’t use pass-by-value for *_Renderer or
*_Texture, but mostly because they might have hidden data.> Date: Sat, 01 Aug 2015 19:39:48 +0000

From: “.3lite”
To: sdl at lists.libsdl.org
Subject: Re: [SDL] Lack of batching for RenderCopy/RenderCopyEx

Date: Sat, 1 Aug 2015 15:21:04 -0700
From: Eric Wing
To: sdl at lists.libsdl.org
Subject: Re: [SDL] Lack of batching for RenderCopy/RenderCopyEx
Message-ID:
<CA+Q62MAsW2ikZAPpjhN=TRva-8GCkwLT5NcrM5d0z6Uq7rPc5Q at mail.gmail.com>
Content-Type: text/plain; charset=UTF-8

On 8/1/15, .3lite wrote:

<snip: see my reply above>

I would suggest constraining the API as much as possible for speed. To
make SIMD or any vectorization to go fast, you generally want
predictable data layouts and no branches.

So for example, with

SDL_RenderCopiesEx(SDL_Renderer* renderer,
SDL_Texture* texture,
const SDL_Rect** srcrect,
const SDL_Rect** dstrect,
const double angle,
const SDL_Point* center,
const SDL_RendererFlip flip)

I might suggest that individual array elements for src/dst rect must
always have values and can’t be NULL, that way the code that is trying
to shuffle things into registers isn’t needing to check for NULL all
the time.

I don’t think these would be too bad, since for both the source and
destination you’re looking at a mandatory memory read, mandatory
comparison, optional single-instruction jump, and optional constant
write to the same destination as the mandatory memory read. As long as
you prep your src/dest null-replacement SDL_Rect instances to the
texture & renderer beforehand, you should be fine (after all, srcrect
and destrect are likely to point to non-contiguous SDL_Rect, since
gather/store can potentially cut down on the total data transfers, and
the cache thrashing was likely to happen during copy operations as
well).

Along that line of thought, we may want explicit array sizes for
srcrect/dstrect as additional parameters.

Agreed, especially since the only other way to find the end of the
array is to go hunting for a null-pointer.

srcrect or dstrect arrays themselves being NULL/empty probably could
be handled efficiently by separating into specialized versions early
before entering into the inner loops.

Yeah, when I write a function with pointer args it usually starts off like this:

int func( type *arg )
{
if( arg )
{
}

return( -1 );

}

A quick optomization for even the least optimizing compilers is easy
to figure out.

Jared Maddox wrote:

Was center meant to point to an array, or just a single SDL_Point? C
has supported pass-by-value of structures for long enough that I
haven’t easily been able to find out when: I think it was during the
K&R/C89 switch-over. I wouldn’t use pass-by-value for *_Renderer or
*_Texture, but mostly because they might have hidden data.

Actually it is meant to be a single SDL_Point. All rendering engines within SDL2 do use transformation matrices and transformation matrix does always require a state change flushing the current batch.

The batch is meant to be simple, efficient, and it should be up to the programmer to implement any kind of batching he likes.

I know this is an old thread but seeing we still don’t have batching support I went ahead and rolled my own. Now this is not the best solution since I will need to statically link if I want it on all platforms, but was planning to use it where it hurts most (phones) and DLLs are not used on iOS and Android (for droid I’m not 100% sure).

Now to the point, my first iteration (patch which was supposed to be used just by me) I just changed SDL_RenderCopy to automatically batch when same texture is sent in in consecutive calls and render the batch once the texture changes, or SDL_RenderPresent is called basically resulting in a seamless batching that just works, no work needed outside. This is all nice, but is a little hacky in the way that if somebody expects a SDL_RenderCopy to instantly render to the target and makes a SDL_ReadPixels after will be disappointed to find this no longer works. Also some buffers need to be set up before hand where the batch will be stored (for my patch this is just a define).

After this I started working on integrating this in a less hacky way defining int SDL_RenderCopyN(SDL_Renderer * renderer,SDL_Texture * texture,const SDL_Rect * srcrect,const SDL_Rect * dstrect,const size_t srcCount,const size_t dstCount) however when I checked to see how SDL_DrawPoints actually is implemented I found that it makes 2 SDL_stack_alloc(), this is very neat and results in readable and clean code but is much slower than having the renderer pre-allocate some buffer for all it’s needs (a function could be used to increase the buffer) in which every operation that needs work space would operate (one will have many draw operations per frame, allocating memory each time really eats into performance). Anybody else some input on this?

In what state is SDL_RenderGeometry BTW, is that still being up for push? Would there be interest for batching support if I were to contribute code for this (can most likely implement most renderers except the very exotic ones (like psp))?.

Oh cool! Very interested in your code, and batching APIs :slight_smile:

Ended up with a different approach for the first working change. Instead of SDL_RenderCopyN() I ended up defining int
SDL_RenderCopyStartBatch(SDL_Renderer * renderer, const int count) and int
SDL_RenderCopyEndBatch(SDL_Renderer * renderer).

Usage example would be:
SDL_RenderCopyStartBatch();
SDL_RenderCopy();
SDL_RenderCopy();

SDL_RenderCopyEx();

SDL_RenderCopyEndBatch();

This way the original functions can be left unchanged for the sotfware renderer for example and it will work in the same way as before. SDL_RenderCopyStartBatch() is in charge for setting up buffers and telling the renderer we want the next calls batched (only calls using the same texture will get batched, rest are rendered on the spot) SDL_RenderCopyEndBatch() renders the batch and invalidates the batcher until start is called again.

As a proof of concept I implemented it for SDL_render_gles2.c (https://www.dropbox.com/s/s224cj2rrmkgghk/SDL2_batchedES2.7z?dl=0 code on top of SDL 2.0.7 with testsprite2 example edited to use it). Need to also test it on a device as well, for now I only ran it on windows.

It’s not optimized yet, I am using GL_TRIANGLES and need 6 vertices in total since only glDrawArrays() seems to be available on the SDL ES2 renderer. Does anybody know why glDrawElements() isn’t exposed? should be normally there for ES2, that would cut down on data transfers needed for batching between the CPU and GPU.

You can also look at SDL_gpu’s automatic batching. SDL might have
redundant state changes, too, that may be worth looking at if you’re in the
optimizing mood.

Does anyone one if something like SDL_RenderCopies was ever created as a add-on? Looking specifically for gles2.0 for a single glDrawArrays call with a packed vbo of a mesh of quads against the same texture.

Change the 4 to a mesh size in bytes referencing a vbo each for position, normal, and color:
data->glDrawArrays(GL_TRIANGLE_STRIP, 0, 4);