Bug: iconv: UTF-8 decoder

Hello,

as some of you might know I work on a graphical library for
text based applications.

UTF-8 allows to have international text.
So I wrote a little test program for it, that just prints
"??pa?c??y? ??p" (“Hello World” in russian).
It worked as expected on most systems, that have a native iconv
implementation, but it broke on Windows.
So I think that bug is in SDL’s iconv implementation, to be more
precise in the UTF-8 decoder. (SDL-1.2.14)

Well, I know you Amerikans had some problems with Russia in the
past. But let me remind you, that the cold war is over already.
So I think that bug should be fixed for the sake of the diplomatic
relationships. :wink:

And by the way, the same problem appears with Thai characters.
But that’s not so urgent. I don’t speak Thai, and they don’t have
no nuclear weapons. ;-)–
AKFoerster

Well, what is the problem? You said only that it "broke on Windows."
A little more information might help. Maybe some example program, too.On Thu, Dec 3, 2009 at 10:46 AM, wrote:

And by the way, the same problem appears with Thai characters.


http://codebad.com/

And by the way, the same problem appears with Thai characters.

Well, what is the problem? You said only that it "broke on Windows."
A little more information might help. Maybe some example program, too.

Okay.
It’s hard to come up with an example, because the code was highly integated
with my library and is endian dependent.

But here is a short example (for little endian machines):

#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include “SDL.h”

int
main (int argc, char *argv[])
{
const char *text="??pa?c??y? ??p";
wchar_t *wt, *p;

setlocale (LC_ALL, “”);

if (sizeof(wchar_t) == 4)
wt = (wchar_t *) SDL_iconv_string (“UTF-32LE”, “UTF-8”, text,
SDL_strlen(text) + 1);
else
wt = (wchar_t *) SDL_iconv_string (“UTF-16LE”, “UTF-8”, text,
SDL_strlen(text) + 1);

printf ("%ls\n", wt);

p = wt;
while (*p)
printf("U+%04lx, ", *p++);

printf("\n");

return 0;
}

On GNU/Linux, which has a native iconv implementation, it prints:
??pa?c??y? ??p
U+0417, U+0434, U+0070, U+0061, U+0432, U+0063, U+0442, U+0432, U+0079, U+0439, U+0020, U+043c, U+0438, U+0070,
That is how it should be.

On Windows you have to link it with the option “-mconsole” at the end.
The output is then:
??pa?c??y? ??p
U+fffd, U+fffd, U+0070, U+0061, U+fffd, U+0063, U+fffd, U+fffd, U+0079, U+fffd, U+0020, U+fffd, U+fffd, U+0070,
Note: U+fffd stands for “unknown character”.Am Donnerstag, dem 03. Dez 2009 schrieb Donny Viszneki:

On Thu, Dec 3, 2009 at 10:46 AM, <@Andreas_K_Foerster> wrote:


AKFoerster

The output is then:
??pa?c??y? ??p
U+fffd, U+fffd, U+0070, U+0061, U+fffd, U+0063, U+fffd, U+fffd, U+0079, U+fffd, U+0020, U+fffd, U+fffd, U+0070,
Note: U+fffd stands for “unknown character”.

I know that the Windows textconsole is rather… limited and cannot show the
russian texts. That’s why I also displayed the values.

Also note that other languages like Greek and even Hebrew are fine,
also under Windows.–
AKFoerster

I forget the details, but in Tux Paint ( http://www.tuxpaint.org/ ),
the fellow who ports it to Windows (John Popplewell) applies[*] some
patches to libiconv, libintl and gettext. I believe they were
originally patched by GIMP developers.

See: ftp://ftp.tuxpaint.org/unix/x/tuxpaint/source/libs/win32-patches/

(There’s also some other stuff here that may or may not be useful to
others: ftp://ftp.tuxpaint.org/unix/x/tuxpaint/source/libs/win32/ )

I don’t run Windows, let alone do any development directly on or for
that platform, so I have no idea what any of this is or why.
I think John’s on this list, though, so he might chime in.
(In any case, he’s easily reached by email.)

Good luck!

[*] Or at least, at some point in the past, applied (past-tense).On Thu, Dec 03, 2009 at 04:46:03PM +0100, list at akfoerster.de wrote:

Hello,

as some of you might know I work on a graphical library for
text based applications.

UTF-8 allows to have international text.
So I wrote a little test program for it, that just prints
"???pa??c???y?? ???p" (“Hello World” in russian).
It worked as expected on most systems, that have a native iconv
implementation, but it broke on Windows.


-bill!
Sent from my computer

Hi,

yes, applied is the right word. Last time I built libiconv (1.13.1) the
patch I used just adds a resource file to the build, containing version
information and gettext (0.17) just built as-is. This was using
MinGW/MSYS on Windows XP.

I’ve not been following changes in UTF-8 support in SDL recently, but
the OP seems to be saying that SDL on Windows isn’t using iconv, but a
built-in ‘fallback’ function. Had a quick look and it seems to be in
’src/stdlib/SDL_iconv.c '.

A quick scan of the code suggests that it only supports ASCII and LATIN1
encodings, apart from the usual UTF and UCS conversions,

cheers,
John.On Thu, Dec 03, 2009 at 12:24:20PM -0800, Bill Kendrick wrote:

On Thu, Dec 03, 2009 at 04:46:03PM +0100, list at akfoerster.de wrote:

Hello,

as some of you might know I work on a graphical library for
text based applications.

UTF-8 allows to have international text.
So I wrote a little test program for it, that just prints
"???pa??c???y?? ???p" (“Hello World” in russian).
It worked as expected on most systems, that have a native iconv
implementation, but it broke on Windows.

I forget the details, but in Tux Paint ( http://www.tuxpaint.org/ ),
the fellow who ports it to Windows (John Popplewell) applies[*] some
patches to libiconv, libintl and gettext. I believe they were
originally patched by GIMP developers.

See: ftp://ftp.tuxpaint.org/unix/x/tuxpaint/source/libs/win32-patches/

(There’s also some other stuff here that may or may not be useful to
others: ftp://ftp.tuxpaint.org/unix/x/tuxpaint/source/libs/win32/ )

I don’t run Windows, let alone do any development directly on or for
that platform, so I have no idea what any of this is or why.
I think John’s on this list, though, so he might chime in.
(In any case, he’s easily reached by email.)

Good luck!

[*] Or at least, at some point in the past, applied (past-tense).

I’ve not been following changes in UTF-8 support in SDL recently, but
the OP seems to be saying that SDL on Windows isn’t using iconv, but a
built-in ‘fallback’ function. Had a quick look and it seems to be in
’src/stdlib/SDL_iconv.c '.

Yes, the definition is in the header file ‘include/SDL_stdinc.h’ at the end.

If a system has a native iconv implementation then SDL_iconv_open and the
other functions are just macros for those. But on systems without an
iconv implementation, like Windows (anything else???) there is a fallback
solution in SDL.

A quick scan of the code suggests that it only supports ASCII and LATIN1
encodings, apart from the usual UTF and UCS conversions,

That’s right. But that is enough for me.

Well, it is not that important to me. But I stumbled across an obvious bug
and so I thought, it should be reported… Sorry for the jokes!Am Freitag, dem 04. Dez 2009 schrieb John Popplewell:


AKFoerster

And by the way, the same problem appears with Thai characters.

Well, what is the problem? You said only that it "broke on Windows."
A little more information might help. Maybe some example program, too.

Okay.
It’s hard to come up with an example, because the code was highly integated
with my library and is endian dependent.

But here is a short example (for little endian machines):

#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include “SDL.h”

int
main (int argc, char *argv[])
{
?const char *text="??pa?c??y? ??p";

Weird, that really threw me when I first looked at it. Are you
expecting this to create a UTF-8 literal? Have you verified that it
does on both compilers? Have you verified that your Windows local is
UTF-8 and not one of the Windows specific code pages?

I’m kind of immersed in Unicode right now I was started to write a
parser that handles Unicode input which lead me to the problem of
writing isDigit() for Unicode… Do you know how many version of the
decimal digits their are in Unicode? Looked at parsing Chinese
numbers? And, what are you supposed to do with all those different
special characters that encode common fractions? It is enough to make
a programmer weep.

Bob PendletonOn Thu, Dec 3, 2009 at 11:46 AM, wrote:

Am Donnerstag, dem 03. Dez 2009 schrieb Donny Viszneki:

On Thu, Dec 3, 2009 at 10:46 AM, ? wrote:

?wchar_t *wt, *p;

?setlocale (LC_ALL, “”);

?if (sizeof(wchar_t) == 4)
? ?wt = (wchar_t *) SDL_iconv_string (“UTF-32LE”, “UTF-8”, text,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? SDL_strlen(text) + 1);
?else
? ?wt = (wchar_t *) SDL_iconv_string (“UTF-16LE”, “UTF-8”, text,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? SDL_strlen(text) + 1);

?printf ("%ls\n", wt);

?p = wt;
?while (*p)
? ?printf("U+%04lx, ", *p++);

?printf("\n");

?return 0;
}

On GNU/Linux, which has a native iconv implementation, it prints:
??pa?c??y? ??p
U+0417, U+0434, U+0070, U+0061, U+0432, U+0063, U+0442, U+0432, U+0079, U+0439, U+0020, U+043c, U+0438, U+0070,
That is how it should be.

On Windows you have to link it with the option “-mconsole” at the end.
The output is then:
??pa?c??y? ??p
U+fffd, U+fffd, U+0070, U+0061, U+fffd, U+0063, U+fffd, U+fffd, U+0079, U+fffd, U+0020, U+fffd, U+fffd, U+0070,
Note: U+fffd stands for “unknown character”.


AKFoerster


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


±----------------------------------------------------------

Hi,

not directly relevant to your problem, but this got me wondering about
SDL linking to libiconv on Windows. I use MinGW/MSYS to build the
libraries for the Tux Paint and was curious about your problem.

I tried building the latest SVN version of SDL-1.2 and discovered that
it wasn’t making use of a perfectly good installation of libiconv :slight_smile:

The configure script makes various check for iconv.h availability and
usability and these checks pass OK, but the ‘checking for iconv’ test
(part of a series of checks for things like snprintf, strncasecmp etc.)
does not pass.

I came up with a patch that shouldn’t break anything, and adds a test
for libiconv that does pass. I’m not an auto* expert, so there’s
probably a better/correct way of doing it, but I thought I’d stick it on
here anyway.

Apply patch like this:

cd SDL-1.2
patch -p1 < …/SDL-win32-iconv-fix.patch

Then:

./autogen.sh
./configure

and rebuild SDL, which will then depend on, in my case, libiconv-2.dll
(I have libiconv-1.13.1 installed).

Back to your test program. I now get the correct output from the while
loop printing out 16-bit numbers, but I’ve never seen anything output at
all from the printf() that displays the whole string (in the MSYS
console or a ‘DOS box’). I’m using an English (United States) locale
though,

cheers,
John.On Fri, Dec 04, 2009 at 01:36:02PM +0100, list at akfoerster.de wrote:

Am Freitag, dem 04. Dez 2009 schrieb John Popplewell:

I’ve not been following changes in UTF-8 support in SDL recently, but
the OP seems to be saying that SDL on Windows isn’t using iconv, but a
built-in ‘fallback’ function. Had a quick look and it seems to be in
’src/stdlib/SDL_iconv.c '.

Yes, the definition is in the header file ‘include/SDL_stdinc.h’ at the end.

If a system has a native iconv implementation then SDL_iconv_open and the
other functions are just macros for those. But on systems without an
iconv implementation, like Windows (anything else???) there is a fallback
solution in SDL.

A quick scan of the code suggests that it only supports ASCII and LATIN1
encodings, apart from the usual UTF and UCS conversions,

That’s right. But that is enough for me.

Well, it is not that important to me. But I stumbled across an obvious bug
and so I thought, it should be reported… Sorry for the jokes!


AKFoerster


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org
-------------- next part --------------
*** configure.in Fri Dec 4 22:20:39 2009
— configure.in.new Fri Dec 4 22:19:19 2009


*** 183,188 ****
— 183,196 ----
)
AC_CHECK_FUNCS(malloc calloc realloc free getenv putenv unsetenv qsort abs bcopy memset memcpy memmove strlen strlcpy strlcat strdup _strrev _strupr _strlwr strchr strrchr strstr itoa _ltoa _uitoa _ultoa strtol strtoul _i64toa _ui64toa strtoll strtoull atoi atof strcmp strncmp _stricmp strcasecmp _strnicmp strncasecmp sscanf snprintf vsnprintf iconv sigaction setjmp nanosleep)

  • if test x$ac_cv_func_iconv != xyes; then
    
  •     AC_CHECK_LIB(iconv, libiconv, have_libiconv=yes)
    
  •     if test x$have_libiconv = xyes; then
    
  •         AC_DEFINE(HAVE_ICONV)
    
  •         EXTRA_LDFLAGS="$EXTRA_LDFLAGS -liconv"
    
  •     fi
    
  • fi
    
  • AC_CHECK_LIB(iconv, libiconv_open, [EXTRA_LDFLAGS="$EXTRA_LDFLAGS -liconv"])
    AC_CHECK_LIB(m, pow, [EXTRA_LDFLAGS="$EXTRA_LDFLAGS -lm"])
    
    fi

I wonder how you do expect that this works at all.

And I am surprised to know that it works on Linux, actually.

First, C and C++ standards define that by default each string character
is char size. If you plan to use Unicode or similar you need to use wide
character
strings by prefixing them with L (L""). But you don’t do that on your
example.


PauloOn Fri, Dec 4, 2009 at 8:49 PM, Bob Pendleton wrote:

On Thu, Dec 3, 2009 at 11:46 AM, wrote:

Am Donnerstag, dem 03. Dez 2009 schrieb Donny Viszneki:

On Thu, Dec 3, 2009 at 10:46 AM, wrote:

And by the way, the same problem appears with Thai characters.

Well, what is the problem? You said only that it "broke on Windows."
A little more information might help. Maybe some example program, too.

Okay.
It’s hard to come up with an example, because the code was highly
integated
with my library and is endian dependent.

But here is a short example (for little endian machines):

#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include “SDL.h”

int
main (int argc, char *argv[])
{
const char *text="??pa?c??y? ??p";

Weird, that really threw me when I first looked at it. Are you
expecting this to create a UTF-8 literal? Have you verified that it
does on both compilers? Have you verified that your Windows local is
UTF-8 and not one of the Windows specific code pages?

I’m kind of immersed in Unicode right now I was started to write a
parser that handles Unicode input which lead me to the problem of
writing isDigit() for Unicode… Do you know how many version of the
decimal digits their are in Unicode? Looked at parsing Chinese
numbers? And, what are you supposed to do with all those different
special characters that encode common fractions? It is enough to make
a programmer weep.

Bob Pendleton

wchar_t *wt, *p;

setlocale (LC_ALL, “”);

if (sizeof(wchar_t) == 4)
wt = (wchar_t *) SDL_iconv_string (“UTF-32LE”, “UTF-8”, text,
SDL_strlen(text) + 1);
else
wt = (wchar_t *) SDL_iconv_string (“UTF-16LE”, “UTF-8”, text,
SDL_strlen(text) + 1);

printf ("%ls\n", wt);

p = wt;
while (*p)
printf("U+%04lx, ", *p++);

printf("\n");

return 0;
}

On GNU/Linux, which has a native iconv implementation, it prints:
??pa?c??y? ??p
U+0417, U+0434, U+0070, U+0061, U+0432, U+0063, U+0442, U+0432, U+0079,
U+0439, U+0020, U+043c, U+0438, U+0070,
That is how it should be.

On Windows you have to link it with the option “-mconsole” at the end.
The output is then:
??pa?c??y? ??p
U+fffd, U+fffd, U+0070, U+0061, U+fffd, U+0063, U+fffd, U+fffd, U+0079,
U+fffd, U+0020, U+fffd, U+fffd, U+0070,
Note: U+fffd stands for “unknown character”.


AKFoerster


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


±----------------------------------------------------------


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org

?const char *text="??pa?c??y? ??p";

Weird, that really threw me when I first looked at it. Are you
expecting this to create a UTF-8 literal?

Err… yes!

Have you verified that it does on both compilers?

It works that way with any other language. Just Russian and Thai is
broken and only with the iconv-implementation in SDL.

Well, English is not my native language, so I’m used to use UTF-8
in my programs all the time. So I’m very surprised, that you are
surprised…

Note I’m writing my own graphical backend with SDL. It uses wchar_t
internally, but has functions which accept multibyte charsets like
ISO-8859-X or UTF-8. They are then converted with iconv…

Have you verified that your Windows local is
UTF-8 and not one of the Windows specific code pages?

The iconv API isn’t supposed to use the locale.

In the example program I just set the locale for the printf command,
which uses wcstombs internally and so it needs the locale setting.

I’m kind of immersed in Unicode right now I was started to write a
parser that handles Unicode input which lead me to the problem of
writing isDigit() for Unicode… Do you know how many version of the
decimal digits their are in Unicode? Looked at parsing Chinese
numbers? And, what are you supposed to do with all those different
special characters that encode common fractions? It is enough to make
a programmer weep.

Well, categorizing characters is really hard for unicode.
And even if you can handle it on a character basis, often numbers are
given in letters. Think of roman numbers or in modern times
hexadecimal…

[After looking it up…]
Well, the characters are already categorized. Take the folloging
table and search for “DIGIT”…
http://www.unicode.org/Public/5.2.0/ucd/NamesList.txt

For more information about UTF-8 in C:
http://www.cl.cam.ac.uk/~mgk25/unicode.html
http://www.unicode.org/faq/programming.html

P.S.: And sorry again for the jokes!Am Freitag, dem 04. Dez 2009 schrieb Bob Pendleton:


AKFoerster

Back to your test program. I now get the correct output from the while
loop printing out 16-bit numbers, but I’ve never seen anything output at
all from the printf() that displays the whole string (in the MSYS
console or a ‘DOS box’). I’m using an English (United States) locale
though,

Well, the numbers are the crucial thing.

For printf() the locale setting is actually used.
In my real software project I don’t use the console, but wrote my own
text-output facilities…Am Samstag, dem 05. Dez 2009 schrieb John Popplewell:


AKFoerster

Hi,

yes it does seem to be a bug. I took at look at the code and found that
it was deciding that the sequence was ‘overlong’ here:

} else if ( p[0] >= 0xC0 ) {
    if ( (p[0] & 0xE0) != 0xC0 ) {
        /* Skip illegal sequences
        return SDL_ICONV_EILSEQ;
        */
        ch = UNKNOWN_UNICODE;
    } else {
        if ( (p[0] & 0xCE) == 0xC0 ) {  <<<<<< looks suspicious!
            overlong = SDL_TRUE;
        }
        ch = (Uint32)(p[0] & 0x1F);
        left = 1;
    }
}

The line I’ve marked above does look suspicious, based on looking at the
code for handling similar cases above it, e.g (edited!).

} else if ( p[0] >= 0xF0 ) {         <<<<<< see the 0xF0
    if ( (p[0] & 0xF8) != 0xF0 ) {
        ch = UNKNOWN_UNICODE;
    } else {
        if ( p[0] == 0xF0 ) {        <<<<<< see the 0xF0
            overlong = SDL_TRUE;
        }
        ch = (Uint32)(p[0] & 0x07);
        left = 3;
    }
} else if ( p[0] >= 0xE0 ) {         <<<<<< see the 0xE0
    if ( (p[0] & 0xF0) != 0xE0 ) {
        ch = UNKNOWN_UNICODE;
    } else {
        if ( p[0] == 0xE0 ) {        <<<<<< see the 0xE0
            overlong = SDL_TRUE;
        }
        ch = (Uint32)(p[0] & 0x0F);
        left = 2;
    }

If I change it to the following:

        if ( p[0] == 0xC0 ) {
            overlong = SDL_TRUE;
        }

it generates the same output as SDL when using libiconv.

Of course, I’ve no idea if there is a good reason for the test being the
way it is. I looked at other code for converting UTF8 to UCS4 (which is
what the above code is doing, I believe) and I couldn’t see anything
resembling the above test. I may have missed something though :slight_smile:

cheers,
John.On Fri, Dec 04, 2009 at 01:36:02PM +0100, list at akfoerster.de wrote:

<snip!>

Well, it is not that important to me. But I stumbled across an obvious bug
and so I thought, it should be reported… Sorry for the jokes!

I wonder how you do expect that this works at all.

And I am surprised to know that it works on Linux, actually.

First, C and C++ standards define that by default each string character
is char size. If you plan to use Unicode or similar you need to use wide
character
strings by prefixing them with L (L""). But you don’t do that on your
example.

That was my initial reaction, but if the editor is outputing utf-8,
and the compiler accepts utf-8 then you can store full Unicode in a
char[] because all parts of the utf-8 encoding fit in 8 bit char
values even though the characters being encode span 1 to many chars
each.

Bob PendletonOn Sat, Dec 5, 2009 at 2:22 AM, Paulo Pinto wrote:


Paulo

On Fri, Dec 4, 2009 at 8:49 PM, Bob Pendleton <@Bob_Pendleton> wrote:

On Thu, Dec 3, 2009 at 11:46 AM, ? wrote:

Am Donnerstag, dem 03. Dez 2009 schrieb Donny Viszneki:

On Thu, Dec 3, 2009 at 10:46 AM, ? wrote:

And by the way, the same problem appears with Thai characters.

Well, what is the problem? You said only that it "broke on Windows."
A little more information might help. Maybe some example program, too.

Okay.
It’s hard to come up with an example, because the code was highly
integated
with my library and is endian dependent.

But here is a short example (for little endian machines):

#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include “SDL.h”

int
main (int argc, char *argv[])
{
?const char *text="??pa?c??y? ??p";

Weird, that really threw me when I first looked at it. Are you
expecting this to create a UTF-8 literal? Have you verified that it
does on both compilers? Have you verified that your Windows local is
UTF-8 and not one of the Windows specific code pages?

I’m kind of immersed in Unicode right now I was started to write a
parser that handles Unicode input which lead me to the problem of
writing isDigit() for Unicode… Do you know how many version of the
decimal digits their are in Unicode? Looked at parsing Chinese
numbers? And, what are you supposed to do with all those different
special characters that encode common fractions? It is enough to make
a programmer weep.

Bob Pendleton

?wchar_t *wt, *p;

?setlocale (LC_ALL, “”);

?if (sizeof(wchar_t) == 4)
? ?wt = (wchar_t *) SDL_iconv_string (“UTF-32LE”, “UTF-8”, text,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? SDL_strlen(text) + 1);
?else
? ?wt = (wchar_t *) SDL_iconv_string (“UTF-16LE”, “UTF-8”, text,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? SDL_strlen(text) + 1);

?printf ("%ls\n", wt);

?p = wt;
?while (*p)
? ?printf("U+%04lx, ", *p++);

?printf("\n");

?return 0;
}

On GNU/Linux, which has a native iconv implementation, it prints:
??pa?c??y? ??p
U+0417, U+0434, U+0070, U+0061, U+0432, U+0063, U+0442, U+0432, U+0079,
U+0439, U+0020, U+043c, U+0438, U+0070,
That is how it should be.

On Windows you have to link it with the option “-mconsole” at the end.
The output is then:
??pa?c??y? ??p
U+fffd, U+fffd, U+0070, U+0061, U+fffd, U+0063, U+fffd, U+fffd, U+0079,
U+fffd, U+0020, U+fffd, U+fffd, U+0070,
Note: U+fffd stands for “unknown character”.


AKFoerster


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


±----------------------------------------------------------


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


±----------------------------------------------------------

?const char *text="??pa?c??y? ??p";

Weird, that really threw me when I first looked at it. Are you
expecting this to create a UTF-8 literal?

Err… yes!

Have you verified that it does on both compilers?

It works that way with any other language. Just Russian and Thai is
broken and only with the iconv-implementation in SDL.

Well, English is not my native language, so I’m used to use UTF-8
in my programs all the time. So I’m very surprised, that you are
surprised…

LOL that’s fair! (My native language is the western dialect of
American English. Though now days I mostly speak the version of the
southern dialect of American Engiish as it is spoken in central Texas.
Since moving to Texas I’ve learned to pronounce gruene as “green”, to
roll my 'r’s and to pronounce ‘ll’ as y :slight_smile:

So, while I “knew” that it should work the way you expect, I’ve never
actually seen that in C code before.

Note I’m writing my own graphical backend with SDL. It uses wchar_t
internally, but has functions which accept multibyte charsets like
ISO-8859-X or UTF-8. They are then converted with iconv…

Yep, I’m working on a similar project. I’ve decide to not use wchar_t
because it is not large enough to handle Unicode on Windows. I’m using
C++ so my current plan is to build my own (or borrow the gnu) 32 bit
char_traits and use that to instantiate true 32 bit UCS string and IO
classes on GNU and Windows based systems. I am, of course, questioning
the wisdom of doing that and I wonder why you don’t do something
similar?

Have you verified that your Windows local is
UTF-8 and not one of the Windows specific code pages?

The iconv API isn’t supposed to use the locale.

I know that. I was asking to make sure the program text was actually
in the character set you think it is. If it was using a Windows
specific code page then the char[] would not be in utf-8 and you would
get the failure you see.

In the example program I just set the locale for the printf command,
which uses wcstombs internally and so it needs the locale setting.

I’m kind of immersed in Unicode right now I was started to write a
parser that handles Unicode input which lead me to the problem of
writing isDigit() for Unicode… Do you know how many version of the
decimal digits their are in Unicode? Looked at parsing Chinese
numbers? And, what are you supposed to do with all those different
special characters that encode common fractions? It is enough to make
a programmer weep.

Well, categorizing characters is really hard for unicode.
And even if you can handle it on a character basis, often numbers are
given in letters. Think of roman numbers or in modern times
hexadecimal…

[After looking it up…]
Well, the characters are already categorized. Take the folloging
table and search for “DIGIT”…
http://www.unicode.org/Public/5.2.0/ucd/NamesList.txt

Yeah, I found that too. It doesn’t really address the complete
problem. You can’t just blindly parse that file and look for
everything with the word DIGIT and the name of a DIGIT take a look at:

0C7E TELUGU FRACTION DIGIT THREE FOR EVEN POWERS OF FOUR

For example.

Oh well, I’m in that part of the project where you try to cover the
general case. The phase just before you through up your hands and just
try to do what is reasonable.

Bob PendletonOn Sat, Dec 5, 2009 at 8:53 AM, wrote:

Am Freitag, dem 04. Dez 2009 schrieb Bob Pendleton:

For more information about UTF-8 in C:
http://www.cl.cam.ac.uk/~mgk25/unicode.html
http://www.unicode.org/faq/programming.html

P.S.: And sorry again for the jokes!


AKFoerster


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


±----------------------------------------------------------

Hi,

I discovered the testiconv.c program that’s part of the SDL test-suite.
It explains what the above code is trying to do. There is a bug though,
the test should be:

  } else if ( p[0] >= 0xC0 ) {
    if ( (p[0] & 0xE0) != 0xC0 ) {
      /* Skip illegal sequences
        return SDL_ICONV_EILSEQ;
      */
      ch = UNKNOWN_UNICODE;
    } else {
      if ( (p[0] & 0xDE) == 0xC0 ) {    <<<<<<<< here
        overlong = SDL_TRUE;
      }
      ch = (Uint32)(p[0] & 0x1F);
      left = 1;
    }
  } else {

so I filed a bug #896 with more details
http://bugzilla.libsdl.org/show_bug.cgi?id=896

best regards,
John PopplewellOn Sat, Dec 05, 2009 at 11:57:56PM +0000, John Popplewell wrote:

On Fri, Dec 04, 2009 at 01:36:02PM +0100, list at akfoerster.de wrote:

<snip!>

Well, it is not that important to me. But I stumbled across an obvious bug
and so I thought, it should be reported… Sorry for the jokes!

Hi,

yes it does seem to be a bug. I took at look at the code and found that
it was deciding that the sequence was ‘overlong’ here:

} else if ( p[0] >= 0xC0 ) {
    if ( (p[0] & 0xE0) != 0xC0 ) {
        /* Skip illegal sequences
        return SDL_ICONV_EILSEQ;
        */
        ch = UNKNOWN_UNICODE;
    } else {
        if ( (p[0] & 0xCE) == 0xC0 ) {  <<<<<< looks suspicious!
            overlong = SDL_TRUE;
        }
        ch = (Uint32)(p[0] & 0x1F);
        left = 1;
    }
}

Of course, I’ve no idea if there is a good reason for the test being the
way it is. I looked at other code for converting UTF8 to UCS4 (which is
what the above code is doing, I believe) and I couldn’t see anything
resembling the above test. I may have missed something though :slight_smile:

If you don’t have time to write a patch, I’ll incorporate your
changes, but if you do, that will help guarantee I don’t mess it up.
:slight_smile:

Thanks John!On Tue, Dec 8, 2009 at 11:14 PM, John Popplewell wrote:

On Sat, Dec 05, 2009 at 11:57:56PM +0000, John Popplewell wrote:

On Fri, Dec 04, 2009 at 01:36:02PM +0100, list at akfoerster.de wrote:

<snip!>

Well, it is not that important to me. But I stumbled across an obvious bug
and so I thought, it should be reported… Sorry for the jokes!

Hi,

yes it does seem to be a bug. I took at look at the code and found that
it was deciding that the sequence was ‘overlong’ here:

? ? } else if ( p[0] >= 0xC0 ) {
? ? ? ? if ( (p[0] & 0xE0) != 0xC0 ) {
? ? ? ? ? ? /* Skip illegal sequences
? ? ? ? ? ? return SDL_ICONV_EILSEQ;
? ? ? ? ? ? */
? ? ? ? ? ? ch = UNKNOWN_UNICODE;
? ? ? ? } else {
? ? ? ? ? ? if ( (p[0] & 0xCE) == 0xC0 ) { ?<<<<<< looks suspicious!
? ? ? ? ? ? ? ? overlong = SDL_TRUE;
? ? ? ? ? ? }
? ? ? ? ? ? ch = (Uint32)(p[0] & 0x1F);
? ? ? ? ? ? left = 1;
? ? ? ? }
? ? }

Of course, I’ve no idea if there is a good reason for the test being the
way it is. I looked at other code for converting UTF8 to UCS4 (which is
what the above code is doing, I believe) and I couldn’t see anything
resembling the above test. I may have missed something though :slight_smile:

Hi,

I discovered the testiconv.c program that’s part of the SDL test-suite.
It explains what the above code is trying to do. There is a bug though,
the test should be:

? ? ?} else if ( p[0] >= 0xC0 ) {
? ? ? ?if ( (p[0] & 0xE0) != 0xC0 ) {
? ? ? ? ?/* Skip illegal sequences
? ? ? ? ? ?return SDL_ICONV_EILSEQ;
? ? ? ? ?*/
? ? ? ? ?ch = UNKNOWN_UNICODE;
? ? ? ?} else {
? ? ? ? ?if ( (p[0] & 0xDE) == 0xC0 ) { ? ?<<<<<<<< here
? ? ? ? ? ?overlong = SDL_TRUE;
? ? ? ? ?}
? ? ? ? ?ch = (Uint32)(p[0] & 0x1F);
? ? ? ? ?left = 1;
? ? ? ?}
? ? ?} else {

so I filed a bug #896 with more details
http://bugzilla.libsdl.org/show_bug.cgi?id=896

best regards,
John Popplewell


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org


-Sam Lantinga, Founder and President, Galaxy Gameworks LLC

list at akfoerster.de wrote:

Hello,

as some of you might know I work on a graphical library for
text based applications.

UTF-8 allows to have international text.
So I wrote a little test program for it, that just prints
"??pa?c??y? ??p" (“Hello World” in russian).
It worked as expected on most systems, that have a native iconv
implementation, but it broke on Windows.
So I think that bug is in SDL’s iconv implementation, to be more
precise in the UTF-8 decoder. (SDL-1.2.14)

Well, I know you Amerikans had some problems with Russia in the
past. But let me remind you, that the cold war is over already.
So I think that bug should be fixed for the sake of the diplomatic
relationships. :wink:

And by the way, the same problem appears with Thai characters.
But that’s not so urgent. I don’t speak Thai, and they don’t have
no nuclear weapons. :wink:

Hi, I’ve been almost exclusively working on internationalisation on
Linux for the past 12 months, here are some things I’ve learnt that may
be of use:

First, iconv is one of the most badly-designed APIs I’ve ever come
across; it has one and ONLY one redeeming feature; it is part of POSIX.

As if the stateful nature, which has some extremely annoying nuances
and is very difficult to use correctly, wasn’t bad enough, the main
interface:

iconv_open(const char *tocode, const char *fromcode);

relies on modifying the ‘tocode’ variable in order to change behaviour,
such as enabling transliteration. Anyone with half a brain knows that
sort of thing should be passed in as a flag…

Secondly, iconv is exclusively I/O focused; iconv does not provide a
standard way of using and manipulating (Unicode) string data, which, as
Bob Pendleton pointed out, is not very straight forward…

The ‘way forward’ is to use libICU; it is cross-platform, has multiple
language bindings and solves the additional problem of providing a
unified way of handling strings across multiple languages and
implementations.

I guess keeping iconv as a fall-back isn’t so bad, especially as it can
be implemented without introducing additional dependencies, but really,
as soon as one scratches beneath the surface of the problem, libICU
becomes the obvious choice.

Eddy