Unicode filenames

CodeCat · March 1, 2010, 2:26pm

My app uses UTF-8 internally, as it seems that SDL 1.3 has made the move to UTF-8 as its internal string representation as well. The only exception, however, is SDL_RWops. It currently doesn’t seem to support filenames encoded in UTF-8 on Windows, as it uses the non-Unicode system call CreateFileA rather than the Unicode call CreateFileW. Will this be added into SDL 1.3 anytime soon? And is there a way to get UTF-8 support in filenames until then?

Nathaniel_J_Fries · March 1, 2010, 4:14pm

And is there a way to get UTF-8 support in filenames until then?

I think modern Unix-like systems consider 8-bit character strings to be UTF-8 strings already. So just use fopen?

icculus · March 1, 2010, 5:17pm

I think modern Unix-like systems consider 8-bit character strings to be
UTF-8 strings already. So just use fopen?

That’s the convention, but it’s not enforced, I think. A Linux
filesystem can have any byte in the filename except null characters and
‘/’ … if you give it data that’s not valid UTF-8, it’ll still work.

…it’ll probably confuse all sorts of software, but it’ll “work”.

I think Mac OS X explicitly says it wants UTF-8 when using the POSIX
APIs (they are not stored in UTF-8 in the filesystem, it just converts
for you). Non-POSIX APIs on the Mac want CFStrings or NSStrings or what
not, which handle unicode details behind the scenes (you can create
CFStrings from UTF-8 data).

Windows never wants UTF-8. Since Windows 98, they’ve provided an API
to convert to/from UTF-8 (and full Unicode support in all NT-based
OSes), but they want UTF-16. If you try to pass UTF-8 bytes to the APIs
that take a char * instead of a WCHAR *, it thinks you’re using the
current “codepage” and it will not work, ever. These have to be manually
converted (SDL does this internally for you).

SDL wants UTF-8 whenever it asks for filenames, and converts as
appropriate for the system.

–ryan.

Gregory_Smith · March 1, 2010, 6:21pm

Windows never wants UTF-8. Since Windows 98, they’ve provided an API to
convert to/from UTF-8 (and full Unicode support in all NT-based OSes), but
they want UTF-16. If you try to pass UTF-8 bytes to the APIs that take a char

instead of a WCHAR *, it thinks you’re using the current “codepage” and it
will not work, ever. These have to be manually converted (SDL does this
internally for you).

SDL wants UTF-8 whenever it asks for filenames, and converts as appropriate
for the system.

Which, I just want to point out, is a real pain for those of us using
mingw32, since it used to be the case that you could pass strings you got
from the POSIX file APIs (like readdir) into SDL_RWFromFile without care
about the code page (as long as it was 1-byte)–but now any path that
isn’t ASCII breaks. Sure, trying to do anything more sophisticated (like
displaying the path) was broken, but it was “the user sees weird
characters” broken rather than “the application can’t find its data files
and exits” broken.

We can, and should, I guess, write entirely separate file handling code
for Windows that uses the WCHAR* functions from the Windows API.
Amusingly, SDL in other cases reduces the amount of platform specific
code you have to write. Rather than doing the opposite here (and changing
13 versions of previous behavior), one wonders whether it would have been
too expensive to add an SDL_RWFromFileUTF8() instead

GregoryOn Mon, 1 Mar 2010, Ryan C. Gordon wrote:

icculus · March 1, 2010, 9:34pm

We can, and should, I guess, write entirely separate file handling code
for Windows that uses the WCHAR* functions from the Windows API.
Amusingly, SDL in other cases reduces the amount of platform specific
code you have to write. Rather than doing the opposite here (and changing
13 versions of previous behavior), one wonders whether it would have been
too expensive to add an SDL_RWFromFileUTF8() instead

Use SDL_iconv. If you’re getting data in an arbitrary encoding, you’re
going to have to anyhow.

If mingw32 is still giving you results in the current codepage instead
of UTF-8, I’d argue that’s a bug they should fix. What do they give you
for characters that can’t be represented in the current codepage? What
do they give you for languages that can’t be represented in 8-bits at all?

–ryan.

Gregory_Smith · March 1, 2010, 10:10pm

Use SDL_iconv.

Hmm, this comment in SDL_RWops.c threw me:

/* Use UCS2: no UTF-16 support here. Try again in SDL 1.3. */

But, looking at SDL_iconv.c I can see UTF-16 support is in there. Still
need to use the WCHAR* APIs–weird code page support is non-existent.

I don’t really expect SDL to fix this mess–I don’t even think it can. I
was just a little annoyed at the surprise API change. For the time being I
just patched that back out of our static-linked SDL. Things weren’t
working 100% before so it’s a good excuse for us to start doing the right
thing anyway.

If mingw32 is still giving you results in the current codepage instead of
UTF-8, I’d argue that’s a bug they should fix.

mingw32’s fopen() is designed to be compatible with Microsoft’s, so it
takes the 8-bit code page. So I guess they decided readdir() should work
the same way. At least you can pass the results of one into the other that
way.

What do they give you for characters that can’t be represented in the
current codepage? What do they give you for languages that can’t be
represented in 8-bits at all?

::shrug::

GregoryOn Mon, 1 Mar 2010, Ryan C. Gordon wrote:

icculus · March 1, 2010, 11:57pm

/* Use UCS2: no UTF-16 support here. Try again in SDL 1.3. */
But, looking at SDL_iconv.c I can see UTF-16 support is in there.

That branch only runs on Win9x machines, and there isn’t UTF-16 support
in any of those (“wide chars” changed from UCS-2 to UTF-16 in Windows XP
or Win2000 or so).

What that code does is convert from UTF-8 to UCS-2 for Win9x, then uses
a Win32 API to attempt to convert to the current system’s codepage.

The goal was to make RWops work with UTF-8 strings on all platforms,
instead of just some of them.

–ryan.

CodeCat · March 1, 2010, 7:53pm

I would definitely advise against writing a wchar-based interface. Wchar is pretty much the definition of unportability, it’s not even the same size on two systems (32 bits on Linux, 16 bits on Windows) let alone the same encoding. Some applications switch their internal representation depending on platform (including the infamous UNICODE defines on Windows and wxWidgets), but that brings all sorts of nightmares with it like not even knowing what encoding your application’s strings are in. That makes it very hard to write string transcoding functions or even just do text processing.

I personally like the idea of being able to count on a single encoding (UTF-8) being always available for all SDL functions. It makes building your app around that encoding a lot easier. The fact that UTF-8 support in Windows is broken isn’t a big deal for SDL, since SDL can just transcode to UTF-16 behind the scenes while presenting a single interface to the programmer.

John_K_Luebs · March 2, 2010, 8:19am

Ryan C. Gordon wrote:

We can, and should, I guess, write entirely separate file handling
code for Windows that uses the WCHAR* functions from the Windows API.
Amusingly, SDL in other cases reduces the amount of platform
specific code you have to write. Rather than doing the opposite here
(and changing 13 versions of previous behavior), one wonders whether
it would have been too expensive to add an SDL_RWFromFileUTF8()
instead

Use SDL_iconv. If you’re getting data in an arbitrary encoding, you’re
going to have to anyhow.

If mingw32 is still giving you results in the current codepage instead
of UTF-8, I’d argue that’s a bug they should fix.
If we’re talking about using mingw, that’s usually just writing against
the native APIs, and the core POSIX-like APIs that one uses with mingw
are not actually implemented by mingw, they’re implemented by the native
Microsoft CRT DLL.

I don’t think it changes your basic point about the lack of robustness
in using these APIs though.

–jkl