SDL 1.3: UTF-8 vs UTF-16 vs UTF-32?

slouken · February 3, 2006, 4:20pm

In SDL 1.3 we’ll be doing a few things to improve internationalized input:
(all subject to discussion and real-world testing)

The SDLK_* keysym value will be defined as the unmodified (shift, etc.)
Unicode value of printable keys, and constants in a special range for
non-printable keys.
The keysym state array will no longer be exposed, instead you’ll have
to call a new API function to explicitly query a key state.
The constant SDLK_LAST will no longer exist
The ‘unicode’ field of the SDL keysym will no longer exist.
Text input will be handled through a set of new messages:
SDL_CHAR, SDL_PRECOMPOSED_CHAR (plus other IME messages)

The question is, for Unicode input messages, should the data be UTF-8,
UTF-16, or UTF-32? It’s very possible for a single message to contain a
string of multiple characters, because of the way text compositioning works.
In the same way, should the call to retrieve the name of a key return
UTF-8 text or UTF-16/32 characters?

-Sam Lantinga, Senior Software Engineer, Blizzard Entertainment

Olivier_Delannoy · February 3, 2006, 4:27pm

While utf8 is able to handle all encoding and to provide on the fly
compression why using multy-byte types ?On 2/3/06, Sam Lantinga wrote:

In SDL 1.3 we’ll be doing a few things to improve internationalized input:
(all subject to discussion and real-world testing)

The SDLK_* keysym value will be defined as the unmodified (shift, etc.)
Unicode value of printable keys, and constants in a special range for
non-printable keys.

The keysym state array will no longer be exposed, instead you’ll have
to call a new API function to explicitly query a key state.

The constant SDLK_LAST will no longer exist

The ‘unicode’ field of the SDL keysym will no longer exist.

Text input will be handled through a set of new messages:
SDL_CHAR, SDL_PRECOMPOSED_CHAR (plus other IME messages)

The question is, for Unicode input messages, should the data be UTF-8,
UTF-16, or UTF-32? It’s very possible for a single message to contain a
string of multiple characters, because of the way text compositioning works.
In the same way, should the call to retrieve the name of a key return
UTF-8 text or UTF-16/32 characters?
    -Sam Lantinga, Senior Software Engineer, Blizzard Entertainment
SDL mailing list
SDL at libsdl.org
http://www.libsdl.org/mailman/listinfo/sdl

Johannes_Schmidt · February 3, 2006, 4:50pm

And for UTF-16/32: Should they be little endian, big endian or
platform specific?
AFAIK, Windows and Java use UTF-16LE.

Do we get new dependencies when using wide strings (i.e. which version of
stdlib already supports those wide string functions)?

Will there be support for converting into locale encoding
(very handy for most platforms)?

What encoding will be used for the SDLK_* keysyms
(so far used to get the key name)? An enum type is more suitable for
UTF-16/32 … How would that be compatible to UTF-8?

Johannes

< http://libufo.sourceforge.net > The OpenGL GUI ToolkitAm Freitag 03 Februar 2006 17:20 schrieb Sam Lantinga:

In SDL 1.3 we’ll be doing a few things to improve internationalized input:
(all subject to discussion and real-world testing)

The SDLK_* keysym value will be defined as the unmodified (shift, etc.)
Unicode value of printable keys, and constants in a special range for
non-printable keys.

The keysym state array will no longer be exposed, instead you’ll have
to call a new API function to explicitly query a key state.

The constant SDLK_LAST will no longer exist

The ‘unicode’ field of the SDL keysym will no longer exist.

Text input will be handled through a set of new messages:
SDL_CHAR, SDL_PRECOMPOSED_CHAR (plus other IME messages)

The question is, for Unicode input messages, should the data be UTF-8,
UTF-16, or UTF-32? It’s very possible for a single message to contain a
string of multiple characters, because of the way text compositioning
works. In the same way, should the call to retrieve the name of a key
return UTF-8 text or UTF-16/32 characters?

Alexander_Ellwein · February 3, 2006, 4:50pm

Olivier Delannoy wrote:

While utf8 is able to handle all encoding and to provide on the fly
compression why using multy-byte types ?

Yes, plus you can use the usual char* instead of (very platform-specific)
wchar_t* types.–
Alexander Ellwein

Mikael_Eriksson · February 3, 2006, 4:54pm

It’s also very easy to convert between ISO-8859-1(5?) and UTF-8.
-------------- next part --------------
A non-text attachment was scrubbed…
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: http://lists.libsdl.org/pipermail/sdl-libsdl.org/attachments/20060203/d51bbc1a/attachment.pgpOn Fri, Feb 03, 2006 at 05:27:18PM +0100, Olivier Delannoy wrote:

While utf8 is able to handle all encoding and to provide on the fly
compression why using multy-byte types ?

Johannes_Schmidt · February 3, 2006, 4:57pm

Which normalization form will be used?
NFC (precomposed) or NFD (decomposed) characters?

Johannes

< http://libufo.sourceforge.net > The OpenGL GUI ToolkitAm Freitag 03 Februar 2006 17:20 schrieb Sam Lantinga:

In SDL 1.3 we’ll be doing a few things to improve internationalized input:
(all subject to discussion and real-world testing)

The SDLK_* keysym value will be defined as the unmodified (shift, etc.)
Unicode value of printable keys, and constants in a special range for
non-printable keys.

The keysym state array will no longer be exposed, instead you’ll have
to call a new API function to explicitly query a key state.

The constant SDLK_LAST will no longer exist

The ‘unicode’ field of the SDL keysym will no longer exist.

Text input will be handled through a set of new messages:
SDL_CHAR, SDL_PRECOMPOSED_CHAR (plus other IME messages)

The question is, for Unicode input messages, should the data be UTF-8,
UTF-16, or UTF-32? It’s very possible for a single message to contain a
string of multiple characters, because of the way text compositioning
works. In the same way, should the call to retrieve the name of a key
return UTF-8 text or UTF-16/32 characters?

Johannes_Schmidt · February 3, 2006, 4:59pm

In what way would it be more difficult to convert to UTF-16/32?

Johannes

< http://libufo.sourceforge.net > The OpenGL GUI ToolkitAm Freitag 03 Februar 2006 17:54 schrieb Mikael Eriksson:

On Fri, Feb 03, 2006 at 05:27:18PM +0100, Olivier Delannoy wrote:

While utf8 is able to handle all encoding and to provide on the fly
compression why using multy-byte types ?

It’s also very easy to convert between ISO-8859-1(5?) and UTF-8.

Bob_Pendleton · February 3, 2006, 5:09pm

In SDL 1.3 we’ll be doing a few things to improve internationalized input:
(all subject to discussion and real-world testing)

The SDLK_* keysym value will be defined as the unmodified (shift, etc.)
Unicode value of printable keys, and constants in a special range for
non-printable keys.

The keysym state array will no longer be exposed, instead you’ll have
to call a new API function to explicitly query a key state.

The constant SDLK_LAST will no longer exist

The ‘unicode’ field of the SDL keysym will no longer exist.

Text input will be handled through a set of new messages:
SDL_CHAR, SDL_PRECOMPOSED_CHAR (plus other IME messages)

The question is, for Unicode input messages, should the data be UTF-8,
UTF-16, or UTF-32? It’s very possible for a single message to contain a
string of multiple characters, because of the way text compositioning works.
In the same way, should the call to retrieve the name of a key return
UTF-8 text or UTF-16/32 characters?

I would go with whatever wchar_t is on the system you are running on.
That way the existing wide character support functions will work with
SDL. That will be most convenient for C/C++ programmers and is likely to
work well for all other languages and utilities on a system because it
is likely to match the locale of the system.

OTOH, if we must ignore language and system support for i18n then I
would go with utf32 for all purposes. It is simply the easiest to work
with and will cause the fewest surprise for programmers.

	Bob PendletonOn Fri, 2006-02-03 at 08:20 -0800, Sam Lantinga wrote:

-Sam Lantinga, Senior Software Engineer, Blizzard Entertainment

SDL mailing list
SDL at libsdl.org
http://www.libsdl.org/mailman/listinfo/sdl

–
±-------------------------------------+

Bob Pendleton: writer and programmer +
email: Bob at Pendleton.com +
web: www.GameProgrammer.com +
www.Wise2Food.com +
nutrient info on 7,000+ common foods +
±-------------------------------------+

slouken · February 3, 2006, 5:16pm

Which normalization form will be used?
NFC (precomposed) or NFD (decomposed) characters?

I’m not sure. Probably precomposed, I’d have to check what IME’s
typically provide.

-Sam Lantinga, Senior Software Engineer, Blizzard Entertainment

Gerry_Jo_Jellestad · February 3, 2006, 5:33pm

It would be nice if there was position-based keysyms as well. Specially
emulators for older computers would benefit from this, where the original
keyboard layout is often different from modern PC keyboards. Preserving
the
original keyboard layout of the emulated machine can be important for
maximum
compatibility, and also for giving the right “feel” =). Without support
for
this in SDL, emulators have to make their own keyboard mapping files for
all
keyboard layouts that they want to support (which usually means that people
from smaller countries have to make our own mapping files).

Regular games could benefit from this too, for having a sane default
position-
based config. For example, an FPS could have a w-s-a-d-type default
setting,
always positioned where it would be on a qwerty keyboard. A game
supporting
multiple players on one keyboard could have a couple of settings designed
for
avoiding ghost key/keyclash issues.

Coupled with a way to get the name of a key given a position (for notifying
the player which keys do what in the default config, etc), I think this
could
be good.

(As for the UTF question, my vote is for UTF-8).

GerryOn Fri, 03 Feb 2006 17:20:58 +0100, Sam Lantinga wrote:

The SDLK_* keysym value will be defined as the unmodified (shift, etc.)
Unicode value of printable keys, and constants in a special range for
non-printable keys.

slouken · February 3, 2006, 6:09pm

It would be nice if there was position-based keysyms as well.

Aside from scancode, which is keyboard vendor specific, I don’t think
there’s a portable way to do this. Does anyone know differently?

-Sam Lantinga, Senior Software Engineer, Blizzard Entertainment

Christophe_Cavalaria · February 3, 2006, 7:02pm

Sam Lantinga wrote:

In SDL 1.3 we’ll be doing a few things to improve internationalized input:
(all subject to discussion and real-world testing)

The SDLK_* keysym value will be defined as the unmodified (shift, etc.)
Unicode value of printable keys, and constants in a special range for
non-printable keys.

The keysym state array will no longer be exposed, instead you’ll have
to call a new API function to explicitly query a key state.

The constant SDLK_LAST will no longer exist

The ‘unicode’ field of the SDL keysym will no longer exist.

Text input will be handled through a set of new messages:
SDL_CHAR, SDL_PRECOMPOSED_CHAR (plus other IME messages)

The question is, for Unicode input messages, should the data be UTF-8,
UTF-16, or UTF-32? It’s very possible for a single message to contain a
string of multiple characters, because of the way text compositioning
works. In the same way, should the call to retrieve the name of a key
return UTF-8 text or UTF-16/32 characters?

-Sam Lantinga, Senior Software Engineer, Blizzard Entertainment

UTF-8/UTF-16/UTF-32 is not Unicode. It’s a memory representation encoding

When returning a single char, just return the associated codepoint. A
codepoint is just a simple number after all. Encode it in a 32bit value
since you need that to represent all the Unicode codepoints. After that,
the user will have to convert the result to a string format ( UTF-8 char *,
UTF-16LE wchar_t * for those using it etc … ) himself as needed. That
way, you avoid all endianess issues as long as the user doesn’t just memory
copy the result into a string ( which is bad anyway )

Brian_Raiter · February 3, 2006, 7:24pm

Aside from scancode, which is keyboard vendor specific, I don’t
think there’s a portable way to do this. Does anyone know
differently?

I can at any time replace my qwerty keyboard with a dvorak keyboard,
or even a chording keyboard with a total of seven keys, without having
to reconfigure or even notify the operating system. So, in the most
general case: No.

b

atrix2 · February 3, 2006, 7:28pm

Yeah, the only thing I can think of is in the game you could have the person
choose which keyboard type they have from a list of a few common keyboard
types.> ----- Original Message -----

From: sdl-bounces+atrix2=cox.net@libsdl.org
[mailto:sdl-bounces+atrix2=cox.net at libsdl.org] On Behalf Of Brian Raiter
Sent: Friday, February 03, 2006 11:24 AM
To: sdl at libsdl.org
Subject: Re: [SDL] SDL 1.3: UTF-8 vs UTF-16 vs UTF-32?

Aside from scancode, which is keyboard vendor specific, I don’t
think there’s a portable way to do this. Does anyone know
differently?

I can at any time replace my qwerty keyboard with a dvorak keyboard,
or even a chording keyboard with a total of seven keys, without having
to reconfigure or even notify the operating system. So, in the most
general case: No.

b

SDL mailing list
SDL at libsdl.org
http://www.libsdl.org/mailman/listinfo/sdl

Torsten_Giebl · February 3, 2006, 7:46pm

Hello !

The question is, for Unicode input messages, should the data be UTF-8,
UTF-16, or UTF-32? It’s very possible for a single message to contain a
string of multiple characters, because of the way text compositioning
works. In the same way, should the call to retrieve the name of a key
return UTF-8 text or UTF-16/32 characters?

It should be UTF-8.
FLTK for example, a GUI toolkit
uses UTF-8 and they have no problems
with it.

CU

Torsten_Giebl · February 3, 2006, 7:57pm

Hello !

It would be nice if there was position-based keysyms as well. Specially
emulators for older computers would benefit from this, where the original
keyboard layout is often different from modern PC keyboards. Preserving
the original keyboard layout of the emulated machine can be important for
maximum compatibility, and also for giving the right “feel” =). Without
support for this in SDL, emulators have to make their own keyboard mapping
files for all keyboard layouts that they want to support (which usually
means that people from smaller countries have to make our own mapping
files).

Regular games could benefit from this too, for having a sane default
position- based config. For example, an FPS could have a w-s-a-d-type
default setting, always positioned where it would be on a qwerty keyboard.
A game
supporting multiple players on one keyboard could have a couple of settings
designed for avoiding ghost key/keyclash issues.

Coupled with a way to get the name of a key given a position (for
notifying the player which keys do what in the default config, etc), I
think this could be good.

Something like this would be cool,
but i think it would be too oversized in a lib
called “Simple DirectMedia Layer”

When input is working right on SDL,
a coder can write a configmenue that highlights
a button and then press whatever the button should
be on my keyboard. So if only one german / english / chinese user
does this every other german / … user can just use that saved datas.

CU

Simon_Roby · February 3, 2006, 11:22pm

I strongly agree with Christophe. Unicode Translation Formats are
meant for encoding strings, not single characters. The returned
character should be a simple 32-bit integer, not encoded in any way.
If he needs to, the developper can easily translate it himself (it’s
easy, really) to whatever UTF (or UCS) encoding he requires (or if
he’s lazy, he can simply drop the upper 24-bits and use it as if it
were latin1). Everything else is too high-level for SDL.On 2/3/06, Christophe Cavalaria <chris.cavalaria at free.fr> wrote:

UTF-8/UTF-16/UTF-32 is not Unicode. It’s a memory representation encoding

When returning a single char, just return the associated codepoint. A
codepoint is just a simple number after all. Encode it in a 32bit value
since you need that to represent all the Unicode codepoints. After that,
the user will have to convert the result to a string format ( UTF-8 char *,
UTF-16LE wchar_t * for those using it etc … ) himself as needed. That
way, you avoid all endianess issues as long as the user doesn’t just memory
copy the result into a string ( which is bad anyway )

–

SR

Bill_Kendrick · February 3, 2006, 11:42pm

So, uh, would this affect an SDL app’s ability to handle, say, Arabic?

(Total newbie, BTW )On Fri, Feb 03, 2006 at 06:22:13PM -0500, Simon Roby wrote:

I strongly agree with Christophe. Unicode Translation Formats are
meant for encoding strings, not single characters.

–
-bill! Tux Paint 2006 wall calendar,
bill at newbreedsoftware.com CDROM, bumper sticker & apparel
http://www.newbreedsoftware.com/ http://www.cafepress.com/newbreedsw

Daniel_K_O · February 4, 2006, 12:09am

Simon Roby wrote:

I strongly agree with Christophe. Unicode Translation Formats are
meant for encoding strings, not single characters. The returned
character should be a simple 32-bit integer, not encoded in any way.
If he needs to, the developper can easily translate it himself (it’s
easy, really) to whatever UTF (or UCS) encoding he requires (or if
he’s lazy, he can simply drop the upper 24-bits and use it as if it
were latin1). Everything else is too high-level for SDL.

The problem (if I understand Lantiga’s original message) is that some
IMEs may send more than a single char at once.

Take a look at a toolkits with i18n support, like GTK+:
http://developer.gnome.org/doc/API/2.0/gdk/gdk-Event-Structures.html#GdkEventKey
http://developer.gnome.org/doc/API/2.0/gdk/gdk-Keyboard-Handling.html

Looks like they used the member “string” sometime in the past, but it’s
flagged as deprecated. The docs suggest that the actual text input
should be handled on a higher level (GtkIMContext) than the key press
events. Maybe the GTK+ developers could give some suggestions on the new
i18n support for SDL.—
Daniel K. O.

Simon_Roby · February 4, 2006, 1:20am

I think keyboard events that send more than one character should
simply emit more than one event.On 2/3/06, Daniel K. O. <danielko.listas at gmail.com> wrote:

The problem (if I understand Lantiga’s original message) is that some
IMEs may send more than a single char at once.

–

SR