SDL_iconv UCS-4 endianness

Christian_Walther · September 25, 2007, 5:36pm

I just discovered an inconsistency with SDL_iconv: SDL’s internal iconv
implementation (used by SDL_iconv on Mac OS X) interprets “UCS-4” as
native-endian, while GNU libc iconv (used by SDL_iconv on Linux)
interprets it as big-endian (same as “UCS-4BE”) (and seems to have no
option for native-endian UCS-4). Any idea who’s right (if there even is
a commonly accepted definition of “right”)?

-Christian

Andreas_K_Foerster · October 7, 2007, 6:13pm

I just discovered an inconsistency with SDL_iconv: SDL’s internal iconv
implementation (used by SDL_iconv on Mac OS X) interprets “UCS-4” as
native-endian, while GNU libc iconv (used by SDL_iconv on Linux)
interprets it as big-endian (same as “UCS-4BE”) (and seems to have no
option for native-endian UCS-4). Any idea who’s right (if there even is
a commonly accepted definition of “right”)?

This is undefined. So there is no right or wrong in this case.

If you want a more predictable behaviour, use UTF-32BE or UTF-32LE.
This should be the same result in SDL_iconv and in glibc…Am Tuesday, dem 25. Sep 2007 schrieb Christian Walther:

–
AKFoerster

Andreas_K_Foerster · October 7, 2007, 7:16pm

I just discovered an inconsistency with SDL_iconv: SDL’s internal iconv
implementation (used by SDL_iconv on Mac OS X) interprets “UCS-4” as
native-endian, while GNU libc iconv (used by SDL_iconv on Linux)
interprets it as big-endian (same as “UCS-4BE”) (and seems to have no
option for native-endian UCS-4). Any idea who’s right (if there even is
a commonly accepted definition of “right”)?

If you want the “native” encoding for GNU libc,
use the encoding name “WCHAR_T”.
(I still think it would be a good idea to define “WCHAR_T” also in SDL)Am Tuesday, dem 25. Sep 2007 schrieb Christian Walther:

–
AKFoerster

Christian_Walther · October 8, 2007, 9:17am

list at akfoerster.de wrote:

This is undefined. So there is no right or wrong in this case.

OK, good to know. Thanks.

If you want the “native” encoding for GNU libc,
use the encoding name “WCHAR_T”.
(I still think it would be a good idea to define “WCHAR_T” also in SDL)

Is that guaranteed to be unicode? I seem to remember something about
"locale-dependent encoding" together with wchar_t, but I may be wrong.
Also, isn’t wchar_t 16 bits on Windows (UTF-16 recently, UCS-2 earlier)?
That would make “WCHAR_T” unsuitable for conversion to/from "raw"
unicode code points, which is what I need. (Actually, all I need is to
encode/decode single characters to/from UTF-8, and my current solution
is to just rip the respective parts out of SDL_iconv. Simpler than doing
SDL_iconv_open() etc. every time anyway.)

-Christian

Andreas_K_Foerster · October 9, 2007, 3:14pm

If you want the “native” encoding for GNU libc,
use the encoding name “WCHAR_T”.
(I still think it would be a good idea to define “WCHAR_T” also in SDL)

Is that guaranteed to be unicode? I seem to remember something about
"locale-dependent encoding" together with wchar_t, but I may be wrong.

Okay, I had to look it up again.
In general it is not guaranteed. But for new glibc implementations it
is.

From the glibc documentation:
| But for GNU systems wchar_t is always 32 bits wide and, therefore,
| capable of representing all UCS-4 values and, therefore, covering all
| of ISO 10646. Some Unix systems define wchar_t as a 16-bit type and
| thereby follow Unicode very strictly. This definition is perfectly
| fine with the standard, but it also means that to represent all
| characters from Unicode and ISO 10646 one has to use UTF-16 surrogate
| characters, which is in fact a multi-wide-character encoding. But
| resorting to multi-wide-character encoding contradicts the purpose of
| the wchar_t type.
[…]
| We have said above that the natural choice is using Unicode or ISO
| 10646. This is not required, but at least encouraged, by the ISO C
| standard. The standard defines at least a macro STDC_ISO_10646
| that is only defined on systems where the wchar_t type encodes ISO 10646
| characters. If this symbol is not defined one should avoid making
| assumptions about the wide character representation. If the programmer
| uses only the functions provided by the C library to handle wide
| character strings there should be no compatibility problems with other
| systems.
http://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html#Extended-Char-Intro

Also, isn’t wchar_t 16 bits on Windows (UTF-16 recently, UCS-2 earlier)?

As far as I know, yes.Am Monday, dem 08. Oct 2007 schrieb Christian Walther:

–
AKFoerster