SDL 1.3: UTF-8 vs UTF-16 vs UTF-32?

Bill_Kendrick · February 4, 2006, 1:55am

I think part of the problem is that a certain combination of key presses
can construct a single character, in some locales.

See: Complex text layout - Wikipedia
And: Desktop Technologies -- CTL

This is why it’s been relatively simple for Tux Paint to get translated
to numerous langauges (since the translators are in control of the final
strings that I send to SDL_ttf for rendering), but it will be very
difficult to support text entry in those languages (using Tux Paint’s
“Text” tool).

(I just discovered this… sounds quite interesting:
Home | Graphite )On Fri, Feb 03, 2006 at 08:20:45PM -0500, Simon Roby wrote:

On 2/3/06, Daniel K. O. <danielko.listas at gmail.com> wrote:

The problem (if I understand Lantiga’s original message) is that some
IMEs may send more than a single char at once.

I think keyboard events that send more than one character should
simply emit more than one event.

–
-bill! Tux Paint 2006 wall calendar,
bill at newbreedsoftware.com CDROM, bumper sticker & apparel
http://www.newbreedsoftware.com/ http://www.cafepress.com/newbreedsw

slouken · February 4, 2006, 3:45am

I think keyboard events that send more than one character should
simply emit more than one event.

I think part of the problem is that a certain combination of key presses
can construct a single character, in some locales.

It?s more than that - certain combinations of key presses can produce a single
phrase which can be displayed as a string to the user, and then further input
can modify that phrase before completing it.

-Sam Lantinga, Senior Software Engineer, Blizzard Entertainment

Daniel_K_O · February 4, 2006, 4:50am

Sam Lantinga wrote:

It?s more than that - certain combinations of key presses can produce a single
phrase which can be displayed as a string to the user, and then further input
can modify that phrase before completing it.

Shouldn’t text input be handled on a higher level? (higher than key events)
Maybe “SDL_IME” or “SDL_TextInput” (working on top of SDL, together with
SDL_ttf)?

Or will SDL 1.3 provide some basic user interface functionality? If not,
I think those special events should just be forwarded to another library
that will collect/manage the input information to produce text input -
not affecting the usual “key press event”, just creating a special event
that could be captured somewhere else to build the string the user is
trying to input.

PS: BTW, now that I mentioned GTK+, I think SDL (or SDL_ttf) could use
Pango to do text layout. Those guys already made a lot of effort to make
the text look right in all languages.

PS2: sorry, I’m new to this list, is there a roadmap for SDL 1.3 besides
the TODO in the CVS?—
Daniel K. O.

Johannes_Schmidt · February 4, 2006, 3:35pm

UTF-8/UTF-16/UTF-32 is not Unicode. It’s a memory representation encoding

When returning a single char, just return the associated codepoint. A
codepoint is just a simple number after all. Encode it in a 32bit value
since you need that to represent all the Unicode codepoints. After that,
the user will have to convert the result to a string format ( UTF-8 char
*, UTF-16LE wchar_t * for those using it etc … ) himself as needed.
That way, you avoid all endianess issues as long as the user doesn’t just
memory copy the result into a string ( which is bad anyway )

I strongly agree with Christophe. Unicode Translation Formats are
meant for encoding strings, not single characters. The returned
character should be a simple 32-bit integer, not encoded in any way.

Sorry, just to clarify:
What do you mean by “not encoded”?
Always using the system encoding (as in 1.2)?

If he needs to, the developper can easily translate it himself (it’s
easy, really) to whatever UTF (or UCS) encoding he requires (or if
he’s lazy, he can simply drop the upper 24-bits and use it as if it
were latin1).
[…]
As long as the system encoding is something like ISO8859-1(5) …

Regards,
JohannesOn Saturday 04 February 2006 00:22, Simon Roby wrote:

On 2/3/06, Christophe Cavalaria <chris.cavalaria at free.fr> wrote:

Christophe_Cavalaria · February 4, 2006, 4:05pm

Johannes Schmidt wrote:

Sorry, just to clarify:
What do you mean by “not encoded”?
Always using the system encoding (as in 1.2)?

Unicode defines 2 sets of standards. First, it defines glyph codepoints.
It’s that part of the standard that says that 97 is ‘a’ and that 233 is ‘?’

Then, it defines the various encodings like UTF-8, UTF-16 etc… That part
of the standard is purely a storage definition. All it says is how you
should write in memory a specific array of numbers. For exemple, when you
want the string ‘a?’ in UTF-8 format, you must have in memory :
[0x61,0xc3,0xa9]

So, you have :

Codepoint definitions : single char
Storage definitions : strings

When a function must return a single char, you should return a codepoint,
not a length 1 string.

If he needs to, the developper can easily translate it himself (it’s
easy, really) to whatever UTF (or UCS) encoding he requires (or if
he’s lazy, he can simply drop the upper 24-bits and use it as if it
were latin1).
[…]
As long as the system encoding is something like ISO8859-1(5) …

The standard unicode codepoints map directly to latin-1 for numbers < 256.
When a functions sends you the codepoint 233, you can easily convert it to
latin-1 by writing the 8bit char 233.

Johannes_Schmidt · February 4, 2006, 5:04pm

[…]

Ahh, many thanks
If we know that we only get a single char, code points seem to be reasonable.

Another question:
What do we do with SDL methods which expect strings, e.g.
SDL_WM_SetCaption?

Are paths always ASCII (I think of SDL_GL_LoadLibrary)?

Regards,
JohannesOn Saturday 04 February 2006 17:05, Christophe Cavalaria wrote:

Johannes Schmidt wrote:

Sorry, just to clarify:
What do you mean by “not encoded”?
Always using the system encoding (as in 1.2)?

Unicode defines 2 sets of standards. First, it defines glyph codepoints.
It’s that part of the standard that says that 97 is ‘a’ and that 233 is ‘?’

Then, it defines the various encodings like UTF-8, UTF-16 etc… That part
of the standard is purely a storage definition.

slouken · February 4, 2006, 6:26pm

Another question:
What do we do with SDL methods which expect strings, e.g.
SDL_WM_SetCaption?

Right now, on systems that support it, we use UTF-8.
I think we can formalize this, so all strings in 1.3 will be UTF-8,
and we can provide conversion routines to/from UTF-16/32 for internal
and external use.

-Sam Lantinga, Senior Software Engineer, Blizzard Entertainment

Christian_Walther · February 4, 2006, 7:18pm

Gerry JJ wrote:

It would be nice if there was position-based keysyms as well.

I strongly agree with Gerry’s arguments. Here’s what I wrote about it
previously: http://article.gmane.org/gmane.comp.lib.sdl/24697

Sam Lantinga wrote:

Aside from scancode, which is keyboard vendor specific, I don’t think
there’s a portable way to do this. Does anyone know differently?

Just make up SDL’s own set of codes, and have each backend
implementation map from its OS or hardware dependent scancodes to that
set? If that’s not possible on some OSes, they can still fall back to
mapping-by-character.

It’s easy to do on Mac OS X. It’s already done on DirectX. On X11, it’s
a bit more complex - it seems that X-server- and/or hardware-specific
look-up tables would be needed, but this could still be done for the
most common cases (XFree86/Xorg on PC and Mac hardware, at least), with
a graceful fallback to mapping-by-character on other systems.

Brian Raiter wrote:

I can at any time replace my qwerty keyboard with a dvorak keyboard,
or even a chording keyboard with a total of seven keys, without having
to reconfigure or even notify the operating system.

On what OS and hardware is that? I have never seen a system where the OS
does not need to be told what’s printed on your keycaps. Or does your
Dvorak keyboard actually move the keys around compared to the Qwerty
keyboard, instead of just relabeling them? If so, I don’t think that’s
the norm. My US keyboard has the keys in the same places as my Swiss
German one (mostly), they’re just labeled differently.

-Christian

David_Olofson · February 4, 2006, 10:58pm

[…]

Brian Raiter wrote:

I can at any time replace my qwerty keyboard with a dvorak
keyboard, or even a chording keyboard with a total of seven keys,
without having to reconfigure or even notify the operating
system.

On what OS and hardware is that? I have never seen a system where
the OS does not need to be told what’s printed on your keycaps. Or
does your Dvorak keyboard actually move the keys around compared to
the Qwerty keyboard, instead of just relabeling them? If so, I don’t
think that’s the norm.

There are hardwired Dvorak keyboards (to avoid relying on third party
proprietary tools on some platforms), but the normal solution is to
just use a different keymap. My custom Dvorak keyboards are just
standard swedish Qwertys with the keycaps moved around, and my layout
is close enough to the swedish “standard” Dvorak that I can use their
DLLs for my occasional Windows sessions. (I’ve just flipped the shift
state of the number keys.)

Anyway, my Dvorak layouts still break 90% of the games that rely on
“WSAD”, regardless of API, at least on X/Linux. Many Windows games
also seem to prefer going by the keymaps instead of raw codes.

However, I actually prefer it that way, as long as the game can be
reconfigured. Games that go by the unmapped scancodes usually say “S”
when I press the “O” key, which isn’t very helpful at all. No big
deal, though …until the game tries to be smart and hints “Press the
S key to …”, and I have to remember where the S key is located on a
Qwerty keyboard.

So, “physical position” key codes would be a very nice thing to have
in some situation, but it should be noted that applications that use
them should never assume anything about what might be printed on the
keycaps. No talking about pressing this or that key, without actually
checking what’s supposed to be printed on it, please.

//David Olofson - Programmer, Composer, Open Source Advocate

.------- http://olofson.net - Games, SDL examples -------.
| http://zeespace.net - 2.5D rendering engine |
| http://audiality.org - Music/audio engine |
| http://eel.olofson.net - Real time scripting |
'-- http://www.reologica.se - Rheology instrumentation --'On Saturday 04 February 2006 20:18, Christian Walther wrote:

Christian_Walther · February 5, 2006, 1:42pm

David Olofson wrote:

So, “physical position” key codes would be a very nice thing to have
in some situation, but it should be noted that applications that use
them should never assume anything about what might be printed on the
keycaps. No talking about pressing this or that key, without actually
checking what’s supposed to be printed on it, please.

Of course. We’d absolutely need a function “const char
*SDL_KeyDescription(SDL_scancode s)” or so that takes into account the
OS keymap. If a game uses “physical position” codes but translates them
to human-readable descriptions using a hard-coded table, that’s just a
bug, IMHO.

I hope such a function can be implemented on most OSes. If not, we’d
have to resort to populating a table on-the-fly using the data from
incoming key events, which would be a bit of a hack…

-Christian

Torsten_Giebl · February 5, 2006, 1:57pm

Hello !

I hope such a function can be implemented on most OSes. If not, we’d
have to resort to populating a table on-the-fly using the data from
incoming key events, which would be a bit of a hack…

For Text Apps. you normally don`t need to know
what key on the keyboard is what exactly,
you just get the Unicode character and do things with it.

For Games it is the best to have a Settings menu,
where you say Fire1 is for example Joypad Button 5 or
Unicode Character 40. This is standard in todays games
and it is not hard to implement.

In an emu i would make a photo of the original for example
AMIGA Keyboard, let the user be able to click each key
and then the user should press the key on his keyboard,
he wants to use. Okay, it is a little bit work,
but this data can be saved and then other German Users can
use them too.

Or are the keyboards of one country different ? Do all German
keyboards send the same number for the same key ?

German was just an example here, it could be also American a.s.o.

CU

Christophe_Cavalaria · February 5, 2006, 9:48pm

Torsten Giebl wrote:

Hello !

I hope such a function can be implemented on most OSes. If not, we’d
have to resort to populating a table on-the-fly using the data from
incoming key events, which would be a bit of a hack…

For Text Apps. you normally don`t need to know
what key on the keyboard is what exactly,
you just get the Unicode character and do things with it.

For Games it is the best to have a Settings menu,
where you say Fire1 is for example Joypad Button 5 or
Unicode Character 40. This is standard in todays games
and it is not hard to implement.

In an emu i would make a photo of the original for example
AMIGA Keyboard, let the user be able to click each key
and then the user should press the key on his keyboard,
he wants to use. Okay, it is a little bit work,
but this data can be saved and then other German Users can
use them too.

Or are the keyboards of one country different ? Do all German
keyboards send the same number for the same key ?

German was just an example here, it could be also American a.s.o.

CU

There’s another use for such table. On a key configuration screen, you
should remember the entered scancode and display the corresponding keyname
to the user. That function would allow that.