Upgrading the TTF library from 16-bit UCS-2 to full Unicode

Hello!

When just playing (and experimenting with) a game (not my own) based on SDL 2 with ttf, I was first happy to see it could display UTF-8 text (a previous version of that game couldn???t), then perplexed by the inability to show anything beyond the 16-bit BMP subset of Unicode, even when I knew the used font had glyphs beyond that.

So I downloaded the source to SDL ttf v2.0.14 ??? kind of expecting to see some really old UCS-2 code under the hood, I???ll admit, but was delighted to see that, on the contrary, UTF-8 is used as the base for everything internally.

Still there are/were two problems, both fairly easily fixed:
First, Unicode code points were, for some reason, everywhere crammed into 16-bit variables (type Uint16), except only in the function
UTF8_getch, reading a character from a UTF-8 string, which properly returned the character as a 32-bit value (Uint32). (Only for that value to be cut down to smithereens afterwards.)

Second, when a character map for the loaded font was to be selected, there was no attempt to find a full Unicode map (UCS-4), before looking (and settling) for a mere 16-bit Unicode subset (UCS-2) map.

With those two things fixed, I did indeed get the mentioned game to show Unicode beyond the 16-bit BMP.________________________________

So… I am wondering if perhaps there is a way to apply these fixes also to the official SDL ttf library?

I should say that my fixed version is not terribly well tested. I haven???t even yet written any SDL 2 application of my own to test it with. Everything seems to work without a problem with the mentioned game that isn???t mine (and is closed-source), but I don???t think it makes that heavy use of the library.

In particular, in the existing official library, there are a few functions that take Unicode code points as 16-bit parameters (ugh), which I really think should be upgraded to 32-bit. When I made my changes I did just that as well (though I don???t think they are used by the mentioned game). However, although I know very little about how program libraries are made, I suspect this can cause ABI compatibility problems. In which case I assume the old 16-bit versions have to remain (even if deprecated) while the 32-bit versions are given some new names. (I don???t know what those new names should be, though.) The functions in question are these:
TTF_GlyphIsProvided

TTF_GlyphMetrics

TTF_RenderGlyph_Solid

TTF_RenderGlyph_Shaded

TTF_RenderGlyph_Blended

TTF_GetFontKerningSizeGlyphs

With these changes, the only remaining 16-bit parameters would be for the code units of UCS-2 or UTF-16 strings. Which, I think, is as it should be. No other 16-bit variables should really be let anywhere near anything Unicode, I think.


By the way, I did find an earlier discussion on these matters (http://forums.libsdl.org/viewtopic.php?t=9228), but nothing final seems to have come out of it, and it also seemed to largely focus on the addition of a function to let the user select character maps for a font. Not saying such a function couldn???t have a use, perhaps, in a font viewer or font editor application. But I would imagine the normally desired behavior of the library to be to just automatically pick any full Unicode (UCS-4) character map it can find first, and a 16-bit UCS-2 map only otherwise.

Marlin wrote:

Regarding new names for 32-bit versions of the functions taking Unicode characters (/code points) as 16-bit parameters, I???ve come up with the following:

Code:
Old UCS-2 function New full Unicode function
TTF_GlyphIsProvided ??? TTF_CharacterHasGlyph
TTF_GlyphMetrics ??? TTF_GetGlyphMetrics
TTF_RenderGlyph_Solid ??? TTF_RenderCharacter_Solid
TTF_RenderGlyph_Shaded ??? TTF_RenderCharacter_Shaded
TTF_RenderGlyph_Blended ??? TTF_RenderCharacter_Blended
TTF_GetFontKerningSizeGlyphs ??? TTF_GetKerningValue

…and, although I???m obviously biased, I even kind of like these new names better than the existing ones, mostly. In particular, the ???RenderGlyph??? part, in the old functions for rendering UCS-2 characters, seems less than ideal to me. I mean, the glyph kind of is the rendering, isn???t it? And what???s really rendered is the given character (or code point). The functions for rendering strings of characters (in various encodings) don???t have names with ???RenderGlyphs???, do they?

Well ‘RenderGlyph’ is actually … like… find a glyph (either bitmap or vector) and draw it on a surface (then return the surface). It’s two operations in one function :). I think ‘render glyph’ make some sense though as it render a vector graphic into a surface.

Minor update: I have now made my own very first SDL2 application. (Yay!) Veeery simple, of course, but it at least allows me to do some testing of every SDL2_ttf library function that I???ve modified. (Maybe it can even one day be expanded to a full-blown game, who knows. ???) So far I haven???t found any problems, and, again, the changes are pretty small, really, so there shouldn???t be. Still, there???s always the possibility that I might have overlooked something, of course.

Regarding new names for 32-bit versions of the functions taking Unicode characters (/code points) as 16-bit parameters, I???ve come up with the following:

Code:
Old UCS-2 function New full Unicode function
TTF_GlyphIsProvided ??? TTF_CharacterHasGlyph
TTF_GlyphMetrics ??? TTF_GetGlyphMetrics
TTF_RenderGlyph_Solid ??? TTF_RenderCharacter_Solid
TTF_RenderGlyph_Shaded ??? TTF_RenderCharacter_Shaded
TTF_RenderGlyph_Blended ??? TTF_RenderCharacter_Blended
TTF_GetFontKerningSizeGlyphs ??? TTF_GetKerningValue

…and, although I???m obviously biased, I even kind of like these new names better than the existing ones, mostly. In particular, the ???RenderGlyph??? part, in the old functions for rendering UCS-2 characters, seems less than ideal to me. I mean, the glyph kind of is the rendering, isn???t it? And what???s really rendered is the given character (or code point). The functions for rendering strings of characters (in various encodings) don???t have names with ???RenderGlyphs???, do they?

Obviously, someone else may have better ideas. I won???t be offended if these particular names aren???t used. ???

By the way, I kind of wish that the functions for handling strings of 16-bit code units had also required replacements with new names. That???s the functions with ???UNICODE???, horribly misleading, in their names. (UTF-16, of course, isn???t any more ???Unicode??? than UTF-8, and UCS-2 is less.) But (unfortunately), it was fairly easy to upgrade the handling of UCS-2 strings to full UTF-16 too, without any changes to the function declarations.________________________________________

I realize, of course, that Unicode beyond the 16-bit BMP isn???t a high priority to most people, or the library would have been upgraded a long time ago. Still, apart from me, there???s also the OP of that other discussion I found. So we seem to be at least two people in the world with some interest in this.

On a more serious note, a lot of stuff has entered those higher regions of the Unicode standard only in the last few years, and more is bound to follow. There will be some time before font makers fully support these new characters, and, perhaps also, before application designers and users discover them. But, presumably, interest in displaying things with more than 16 bits will only increase over time.

Marlin wrote:

By the way, I kind of wish that the functions for handling strings of 16-bit code units had also required replacements with new names. That???s the functions with ???UNICODE???, horribly misleading, in their names.

Just realized: Not only can in this case, of course, compatibility after a name change be easily retained through preprocessor macros. That precise trick has already been used, in SDL_ttf.h, for previous name changes.

Which now promptly had me replace all “UNICODE” in function names with the much more accurate “UTF16” instead. (It was only UCS-2, but with my update it is UTF-16.)

Much nicer! ???

Ultimately, they do render glyphs, of course, not really denying that, but so do all the string rendering functions too (with no ???glyphs??? in their names). I???m not saying the names for the existing single-character/glyph rendering functions are horrible… but… ??? since versions with new names are in any case required, if they are to handle Unicode beyond 16 bits ??? why not go with the pattern of the string functions and name the new versions after the type of user input? (Which is a Unicode character /code point.)

Though, like I said, I can live with other name ideas for the new 32-bit versions of these functions. ???

Glyph is actually more accurate than character, especially when you’re concerned with fonts and/or Unicode compatibility.

A “character” can mean many things, depending on the way you think about things. If you’re strictly a programmer, you think of it as an integer representing text data (char, wchar_t, char16, char32, etc), or possibly a code point. If you’re a linguist, you think of a grapheme, the smallest unit of written text. If you make fonts, you think of a glyph, the image of a grapheme or meaningful set of graphemes. Additionally, the appropriate glyph of a grapheme may change contextually (based surrounding graphemes), in case you didn’t know that.

So Glyph is the correct term. It prevents ambiguity. Although I question the utility of the single-glyph functions as implemented just because of that last reason.------------------------
Nate Fries

While you’re talking about updating SDL_TTF, though, I’d also like to point out that SDL could technically use RenderTargets to render directly to a texture, and could even make these extremely fast for the hardware renderers by processing FT_Outline with FT_Outline_Decompose for each glyph and taking advantage of vertex shaders.------------------------
Nate Fries

Nathaniel J Fries wrote:

Glyph is actually more accurate than character, especially when you’re concerned with fonts and/or Unicode compatibility.
Really? Despite that these functions do not take any glyphs as input at all?

Nathaniel J Fries wrote:

A “character” can mean many things, depending on the way you think about things. If you’re strictly a programmer, …
Why the ‘if’? When making use of an SDL library function call, you are a programmer, at least to some extent, aren’t you?

Nathaniel J Fries wrote:

…you think of it as an integer representing text data (char, wchar_t, char16, char32, etc), …
But these are merely integer data types for holding characters or code points. I doubt many programmers will lose track of that distinction. (Besides, speaking for myself, I more often think of it as a sequence of UTF-8 code units anyway.)

Nathaniel J Fries wrote:

…or possibly a code point.
Now we’re talking. Except that I would scratch that overcautious ‘possibly’. I feel fairly confident what’s ultimately in the mind of basically any programmer thinking of characters is precisely the concept of code points, nothing else (whether or not he/she has even heard that term for it).

Nathaniel J Fries wrote:

If you’re a linguist, you think of a grapheme, the smallest unit of written text. If you make fonts, you think of a glyph, the image of a grapheme or meaningful set of graphemes. Additionally, the appropriate glyph of a grapheme may change contextually (based surrounding graphemes), in case you didn’t know that.
Thanks for the lecture, but, first, I doubt any terminology in this area will confuse someone who is both a linguist and a programmer making use of SDL. Second, I fail to see what your point could have possibly been with bringing up contextual forms (such as found in Arabic script). The SDL_ttf rendering functions don’t support even basic right-to-left writing, much less contextual forms. If they did, though, it would, as far as I can see, only serve to stress the lack of correspondence between glyphs and characters (/code points), further undermining the argument you are trying to make.

Nathaniel J Fries wrote:

So Glyph is the correct term. It prevents ambiguity.
Huh? And how, again, does anything of what you just said lead up to that conclusion?
The strictly correct term is code point, nothing else ??? if you want to nitpick. That’s the numeric value passed as input parameter to the SDL_ttf functions we are discussing.

A code point can, however, be assigned to a (abstract) character (mapped to the code point), the resulting association being called an 

encoded character. The Unicode standard even states that “informally [an encoded character] can be thought of as an abstract character taken together with its assigned code point”. In other words, the character can refer to its assigned code point or vice versa. Which, again, I’m pretty sure is precisely the habit of most programmers who are at all dealing with textual data.

A “character” is incidentally also what the input parameter to the discussed functions is already said to be, in the existing function documentation! So the SDL people too, like everybody else involved in programming, are unsurprisingly referring to code points as characters.

Glyphs, by contrast, do 

not generally correspond to code points. So glyph is incorrect, no ambiguity there.Some Unicode terminology here (http://www.unicode.org/glossary/).

Also, semantics aside, you leave the question hanging as to what other new names you’d like to see for the 32-bit versions of these functions.

Nathaniel J Fries wrote:

While you’re talking about updating SDL_TTF, though, I’d also like to point out that SDL could technically use RenderTargets to render directly to a texture, and could even make these extremely fast for the hardware renderers by processing FT_Outline with FT_Outline_Decompose for each glyph and taking advantage of vertex shaders.
Thanks for the tip. I’ll look into it. Except that there doesn’t seem to be much interest in any updating of the library in the first place. (Which, again, is kind of understandable, I guess.)