Revision 7645 error?

Nathaniel_J_Fries · August 26, 2013, 1:43am

I haven’t checked on SDL in awhile and decided to go through the changesets, when I found this: http://hg.libsdl.org/SDL/rev/cc775832d501

Windows wide characters are supposed to be UTF-16[1], which is not binary compatible with UTF-32[2]. Unless documentation from MSDN failed to mention that this function behaves differently from every other Windows unicode-related function, it is an error to treat the buffer from ToUnicode as UTF-32. Also, the documentation on ToUnicode[3] implies that it is possible that there are more than two characters to be written into the buffer, such as some uncommon cases where multiple code points are required to represent a single character in text (tbh, Windows probably doesn’t even support those cases, but they do exist).

1: http://msdn.microsoft.com/en-us/library/windows/desktop/ff381407(v=vs.85).aspx
2: http://en.wikipedia.org/wiki/UTF-16#Code_points_U.2B10000_to_U.2B10FFFF
3: http://msdn.microsoft.com/en-us/library/windows/desktop/ms646320(v=vs.85).aspx (see under return value)------------------------
Nate Fries

icculus · August 26, 2013, 10:19pm

You’re right, this patch appears to be incorrect.

–ryan.On Aug 25, 2013, at 9:43 PM, “Nathaniel J Fries” wrote:

I haven’t checked on SDL in awhile and decided to go through the changesets, when I found this: http://hg.libsdl.org/SDL/rev/cc775832d501

Windows wide characters are supposed to be UTF-16[1], which is not binary compatible with UTF-32[2]. Unless documentation from MSDN failed to mention that this function behaves differently from every other Windows unicode-related function, it is an error to treat the buffer from ToUnicode as UTF-32. Also, the documentation on ToUnicode[3] implies that it is possible that there are more than two characters to be written into the buffer, such as some uncommon cases where multiple code points are required to represent a single character in text (tbh, Windows probably doesn’t even support those cases, but they do exist).

1: http://msdn.microsoft.com/en-us/library/windows/desktop/ff381407(v=vs.85).aspx
2: http://en.wikipedia.org/wiki/UTF-16#Code_points_U.2B10000_to_U.2B10FFFF
3: http://msdn.microsoft.com/en-us/library/windows/desktop/ms646320(v=vs.85).aspx (see under return value)

Nate Fries

SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org

Nathaniel_J_Fries · August 29, 2013, 11:40am

I see a similar error in the handling of WM_CHAR:

Code:
case WM_CHAR:
{
char text[5];

        WIN_ConvertUTF32toUTF8(wParam, text);
        SDL_SendKeyboardText(text);
    }
    returnCode = 0;
    break;

According to MSDN, the wParam for WM_CHAR is UTF-16 for Unicode Windows, and a locale-dependent ANSI encoding for ANSI Windows.------------------------
Nate Fries

ginkgobitter · February 21, 2014, 7:51am

Thanks Nathaniel for pointing this out That post would have been even better in the related bugzilla entry (1876) since it wouldn’t have taken me half a year to take notice. ToUnicode() didn’t specifically state a character set so I naively assumed it would be a unicode codepoint in which case I thought UTF32 would do just fine (yes, I realize it’s not the same).

Using some of the emojis present on the windows8 onscreen keyboard (US English layout) I could produce UTF16 high surrogate markers - therefore it definitely is UTF16. Oddly enough those emojis will come in as two separate events, one of which being the high surrogate marker, the other being the second half of the code. So far, I cannot quite make sense of this behavior.

I am currently trying to figure an encoding issue with WM_CHAR out - certain higher value UTF-16 codes (eg. 0x1750) aren’t received correctly after the event passed the windows message queue; however, other lower ones which still exceed 0xff are fine.

If you have any suggestions on improving the code I would appreciate it (refer to bugzilla #2406 for this).

ginkgobitter · February 22, 2014, 10:08pm

I submitted a patch for this problem in bugzilla #2406.

The whole problem arose from me not being aware of the different behaviors WM_CHAR messages can have resulting in me trying to force ANSI-applications to receive Unicode characters. My apologies. On the bright side it ended up enhancing support for WM_CHAR events and added support for WM_UNICHAR events in SDL - so it’s not all bad.

If someone could add the remark on the wiki as suggested in the bugzilla entry, that would be great.