Iconv type WCHAR_T

Hello,

one of my programs heavily depends on a iconv implementation.
Most of them support the type WCHAR_T and I need that.

Here is a suggestion of how it could be implemented in SDL as
attachment.

Attention: the code is untested!
It depends on that stddef.h is included. Can I rely on that?–
AKFoerster
-------------- next part --------------
A non-text attachment was scrubbed…
Name: SDL_iconv.patch
Type: text/x-diff
Size: 1166 bytes
Desc: not available
URL: http://lists.libsdl.org/pipermail/sdl-libsdl.org/attachments/20070807/5dcdcb03/attachment.patch

one of my programs heavily depends on a iconv implementation.
Most of them support the type WCHAR_T and I need that.

Here is a suggestion of how it could be implemented in SDL as
attachment.

It seems like a reasonable patch, although you should probably use
UCS2 and UCS4 instead of UTF16 and UTF32, which are encodings of UCS
characters.

Attention: the code is untested!

Can you test it and post a followup?

Thanks!
-Sam Lantinga, Lead Software Engineer, Blizzard Entertainment

one of my programs heavily depends on a iconv implementation. Most of them support the
type WCHAR_T and I need that.

Here is a suggestion of how it could be implemented in SDL as attachment.

It seems like a reasonable patch, although you should probably use UCS2 and UCS4 instead
of UTF16 and UTF32, which are encodings of UCS characters.

I chose ENCODING_UTF32NATIVE and ENCODING_UTF16NATIVE because they are defined for the
native endianess. Is that also guaranteed for UCS-2 / UCS-4?

When I use “USC-4” with my systems native iconv implementation, I really get the wrong one.
I have to specify “UCS-4LE” to get it going. That’s why I would prefer to use “WCHAR_T”,
because it is more portable; but not to SDL’s iconv implementation… So it is really a mess
right now. :frowning:

BTW. if you are curious for which project I need it:
http://akfoerster.de/akfavatar/
(The address will most probably change in the future)

Attention: the code is untested!
Can you test it and post a followup?

Better don’t wait for it. ;-)Am Tuesday, dem 07. Aug 2007 schrieb Sam Lantinga:


AKFoerster

one of my programs heavily depends on a iconv implementation.
Most of them support the type WCHAR_T and I need that.

Here is a suggestion of how it could be implemented in SDL as
attachment.

It seems like a reasonable patch, although you should probably use
UCS2 and UCS4 instead of UTF16 and UTF32, which are encodings of
UCS characters.

I chose ENCODING_UTF32NATIVE and ENCODING_UTF16NATIVE because they
are defined for the native endianess. Is that also guaranteed for
UCS-2 / UCS-4?

When I use “USC-4” with my systems native iconv implementation, I
really get the wrong one. I have to specify “UCS-4LE” to get it
going. That’s why I would prefer to use “WCHAR_T”, because it is more
portable; but not to SDL’s iconv implementation… So it is really a
mess right now. :frowning:

BTW. if you are curious for which project I need it:
http://akfoerster.de/akfavatar/
(The address will most probably change in the future)

Attention: the code is untested!
Can you test it and post a followup?

Better don’t wait for it. :wink:

is this related to nls and gettext ? those do not work on small
systems with uclibc.

mattOn Tue, 7 Aug 2007 20:06:12 +0200 “Andreas K. Foerster” wrote:

Am Tuesday, dem 07. Aug 2007 schrieb Sam Lantinga:

[iconv]

is this related to nls and gettext ?

It is not related to gettext.
It is for charset encoding conversion.
For example for converting Latin1 or UTF-8 to UCS-32.

The iconv API is required by the SUS specification and so
most modern systems have a native implementation.

those do not work on small systems with uclibc.

That is why SDL has a small replacement implementation for iconv.
That is what we are talking about.

BTW. also Windows does not have the iconv API.Am Tuesday, dem 07. Aug 2007 schrieb matt:


AKFoerster

Uh, either me or you is missing something: as far as I know, UCS4 is Unicode
encoded with 4 bytes/octets per codepoint, i.e. in 32-bit integers. UTF-32
would be exactly the same.

Looking at the code, I’d like to add that wchar_t maps to UCS4 on e.g. Linux
(probably many POSIX-systems, too) and to UTF-16 (yes, not UCS2!) on e.g.
win32 since at least NT5 (including CE). On earlier win32 systems is used to
be UCS2 (ISO10646 or somesuch), but that is valid UTF-16, too, so using
UTF-16 never hurts.

Other than that, I seem to remember at one Unix system where the encoding of
wchar_t was locale-dependant, so it could be switched to some far east
charset…

UliOn Tuesday 07 August 2007 16:24:55 Sam Lantinga wrote:

one of my programs heavily depends on a iconv implementation.
Most of them support the type WCHAR_T and I need that.

Here is a suggestion of how it could be implemented in SDL as
attachment.

It seems like a reasonable patch, although you should probably use
UCS2 and UCS4 instead of UTF16 and UTF32, which are encodings of UCS
characters.

Uh, either me or you is missing something: as far as I know, UCS4 is Unicode
encoded with 4 bytes/octets per codepoint, i.e. in 32-bit integers. UTF-32
would be exactly the same.

Yes, with the current Unicode code space UCS4 and UTF-32 are equivalent,
although I believe UTF-32 has provision for multiple 4 byte units encoding
a single codepoint.

Looking at the code, I’d like to add that wchar_t maps to UCS4 on e.g. Linux
(probably many POSIX-systems, too) and to UTF-16 (yes, not UCS2!) on e.g.
win32 since at least NT5 (including CE). On earlier win32 systems is used to
be UCS2 (ISO10646 or somesuch), but that is valid UTF-16, too, so using
UTF-16 never hurts.

Yes, although UTF-16 can’t represent all of the Unicode space, which it
seems you have a pretty good understanding of.

Other than that, I seem to remember at one Unix system where the encoding of
wchar_t was locale-dependant, so it could be switched to some far east
charset…

Hmm, as far as I know it’s always UCS4.

See ya,
-Sam Lantinga, Lead Software Engineer, Blizzard Entertainment

Uh, either me or you is missing something: as far as I know, UCS4 is
Unicode encoded with 4 bytes/octets per codepoint, i.e. in 32-bit
integers. UTF-32 would be exactly the same.

Yes, with the current Unicode code space UCS4 and UTF-32 are equivalent,
although I believe UTF-32 has provision for multiple 4 byte units encoding
a single codepoint.

The Unicode consortium even stripped down UTF-8 to at most four bytes
containing at most 24 bit of payload and even though it could theoretically
encode 31 bit using 6 bytes. Still, those are currently far from being used -
after all one more bit already doubles the number of available codepoints.
I’m expecting the time_t to go 64 bits before those barriers are broken. :wink:

Looking at the code, I’d like to add that wchar_t maps to UCS4 on e.g.
Linux (probably many POSIX-systems, too) and to UTF-16 (yes, not UCS2!)
on e.g. win32 since at least NT5 (including CE). On earlier win32 systems
is used to be UCS2 (ISO10646 or somesuch), but that is valid UTF-16, too,
so using UTF-16 never hurts.

Yes, although UTF-16 can’t represent all of the Unicode space, which it
seems you have a pretty good understanding of.

No, it’s vice versa: UCS2 can only encode codepoints up to 65535 (the Basic
Multilingual Plane, I think it is called). UTF-16 can encode the whole
Unicode range using ‘surrogate pairs’, i.e. encoding a single codepoint in
two wchar_t, needed for IIRC Thai scripts.

cheers and thumbs up

UliOn Saturday 11 August 2007 04:41:43 Sam Lantinga wrote:

No, it’s vice versa: UCS2 can only encode codepoints up to 65535 (the Basic
Multilingual Plane, I think it is called). UTF-16 can encode the whole
Unicode range using ‘surrogate pairs’, i.e. encoding a single codepoint in
two wchar_t, needed for IIRC Thai scripts.

Yes, that’s what I meant. :slight_smile:
My understanding is that the 16-bit WCHAR is UCS2 because it doesn’t support
the surrogate pair encoding of UTF-16.

Cheers!
-Sam Lantinga, Lead Software Engineer, Blizzard Entertainment

Wait: which WCHAR are you talking about? If it is the one used in the win32
API, it really depends on the particular system or rather the code working
with the WCHAR, because since it is 16 bits it can encode both UCS2 and
UTF-16, of course. Now, modern variants of MS Windows do really use UTF-16,
though I can’t tell from which point they started doing so.

However, if you mean general wchar_t that have 16 bits, the same applies.
Whether these encode UTF-16 or UCS2 or perhaps something totaly different is
up to the system and how it uses them. However, I’m not aware of any other
system other than win32 that uses 16 bit wchar_t.

Just for the record: the ranges of 16 bit words that UTF-16 uses to encode
surrogate pairs are not used by UCS2, so valid UCS2 is always valid UTF-16.

regards

UliOn Saturday 11 August 2007 16:43:46 Sam Lantinga wrote:

No, it’s vice versa: UCS2 can only encode codepoints up to 65535 (the
Basic Multilingual Plane, I think it is called). UTF-16 can encode the
whole Unicode range using ‘surrogate pairs’, i.e. encoding a single
codepoint in two wchar_t, needed for IIRC Thai scripts.

Yes, that’s what I meant. :slight_smile:
My understanding is that the 16-bit WCHAR is UCS2 because it doesn’t
support the surrogate pair encoding of UTF-16.

Wait: which WCHAR are you talking about? If it is the one used in the win32
API

Yes, that’s the one I’m talking about - the one that the *W API functions take
as a parameter.

In any case, it looks like for all languages other than Thai that UCS2, WCHAR,
and UTF-16 are all equivalent.

I’ve forgotten the original point of all this, so I’ll stop now. :slight_smile:

See ya,
-Sam Lantinga, Lead Software Engineer, Blizzard Entertainment

Attention: the code is untested!
Can you test it and post a followup?

Better don’t wait for it. :wink:

I’ll wait for it. Given the amount of confusion and discussion I’d rather
wait for somebody to actually test it. :slight_smile:

Thanks!
-Sam Lantinga, Lead Software Engineer, Blizzard Entertainment