UTF-16 clipboard

PiotrGrochowski · August 29, 2022, 9:34am

How do I get UTF-16 text input and output of the clipboard without incurring the horrendous bloats of UTF-8?

PiotrGrochowski · November 5, 2022, 5:09pm

This is getting ridiculous. In SDL audio there are options to adjust the audio format. Why not adjust text encoding as well in SDL text functions? The way this could be implemented is in terms of an SDL function to change encoding between UTF-8, UTF-16, and system 8-bit encoding. Then the functions that take pointer to UTF-8 text now take void pointer and treat it as the currently selected encoding.

PiotrGrochowski · November 6, 2022, 5:59pm

What are you talking about? UTF-16 is used internally in my code. UTF-16 is used externally in Windows. So the clipboard should be direct UTF-16 to clipboard in this case. There is nothing ridiculous about supporting this type of clipboard operation.
Of course, in some systems that SDL supports, UTF-8 is used, but not all of them. If anything, it’s ridiculous for an intermediate UTF-8 representation to be necessary in cases when neither internal nor external side actually use it. The complexity of UTF-8 involves a lot of branches, especially on the decode side, and for very large strings it involves excessive memory allocation to store intermediate result.

PiotrGrochowski · November 6, 2022, 9:48pm

No, it’s definitely not the right thing. There is no reason to impose a certain format on the user if the system doesn’t rely in it. It is also completely ignorant of many embedded use cases where Unicode is not used at all. Standardising on UTF-8 is analogous to standardising on white race or standardising on GNU software; it is discriminating against the other possibilities. It only adds unnecessary bloat to software relying in UTF-16 strings, ISO 8859-1 strings, etc. and I mainly write code to manipulate UTF-16 strings.

icculus · November 7, 2022, 3:37am

SDL_iconv() can be used to conveniently move from UTF-8 to whatever encoding you like. I’m not unsympathetic to the concern about unbounded data lengths, but it’s also probably reasonable to assume most clipboards’ contents are measured in bytes, not megabytes.

Peter87 · November 7, 2022, 12:22pm

It doesn’t do it like that for the graphics. Instead it supports many different graphic formats so that you can use whatever is most efficient on your platform.

That said, I wouldn’t have thought the clipboard was that performance critical that it would matter if you had to convert to/and from UTF-8, but maybe if you have huge strings it matters, I don’t know…

I think what SDL does is probably fine for most applications. If you want to optimize the clipboard handling for Windows then perhaps it’s not too bad if you have to use Windows specific functions outside of SDL, is it?

PiotrGrochowski · November 7, 2022, 1:32pm

Indeed SDL video and SDL audio support multiple formats so it doesn’t make sense for SDL text not to support multiple formats either. I can output 8-bit audio and I can output 16-bit audio as well.
“The whole purpose of an abstraction layer like SDL is to hide the differences between platforms by providing a uniform interface, so the same application will run, unmodified, on all those platforms. That necessarily implies that the same text encoding will be used, so there has got to be a conversion somewhere.” However, when the user can choose the encoding, it becomes even more abstractive since it allows the library to input and output in the user’s format.

sjr · November 8, 2022, 1:52am

Are you really asking the SDL developers to (essentially) write multiple versions of every function that takes a string?

Peter87 · November 8, 2022, 10:13am

SDL_ttf does something like that. For every function that takes a string you can choose between three different versions.

TTF_RenderText_Solid (Latin1)
TTF_RenderUTF8_Solid (UTF-8)
TTF_RenderUNICODE_Solid (UCS-2)

But looking at the implementation it seems like the “Text” and “UNICODE” versions convert to UTF-8 internally so this is more about convenience and not about performance.

Note that it doesn’t support full UTF-16 or “system 8-bit encoding” as suggested by Piotr.

PiotrGrochowski · November 8, 2022, 12:58pm

All UTF-8 string functions interfacing non-UTF-8 platforms must have some sort of conversion built into them. It is possible that either there is separate conversion code for every single text function, or it is shared in a common function. The way to handle multiple encodings would be to make the conversion code a branch depending on the current global text encoding. (UTF-8 could be default to preserve backwards compatibility, but system 8-bit encoding and UTF-16 would be available as well)

“Every other platform supported by SDL 2.0, as far as I know, uses UTF-8 (Linux, MacOS, Android and iOS at least).” Doesn’t Android have UTF-16 clipboard?

“Text
A CharSequence. ”
“A CharSequence is a readable sequence of char values. This interface provides uniform, read-only access to many different kinds of char sequences. A char value represents a character in the Basic Multilingual Plane (BMP) or a surrogate. Refer to Unicode Character Representation for details.”

Peter87 · November 8, 2022, 2:56pm

It looks that way based on the interface although I don’t know how it’s stored internally.
It’s probably because Android is based on Java which uses UTF-16 a lot.

The Android clipboard seems to also support some additional functionality with uri and intent, not just plain-text.

Looks like SDL_SetClipboardText and SDL_GetClipboardText on Android uses the JNI functions NewStringUTF and GetStringUTFChars to convert to and from Java strings.

PiotrGrochowski · November 8, 2022, 3:29pm

“Looks like SDL_SetClipboardText and SDL_GetClipboardText on Android uses the JNI functions NewStringUTF and GetStringUTFChars to convert to and from Java strings.” And the specification for these functions mentions modified UTF-8, not actual UTF-8. And modified UTF-8 basically means storing null characters as a two byte sequence (C0 80) and storing non-BMP characters in two surrogates. Indeed, I can confirm that when I have non-BMP characters in the clipboard and I use SDL_GetClipboardText on Android, they are represented in two surrogates rather than their actual UTF-8 encoding, and that is invalid in UTF-8. D83E DF00 D83E DF01 D83E DF02 D83E DF03 (U+1FB00 U+1FB01 U+1FB02 U+1FB03) becomes ED A0 BE ED BC 80 ED A0 BE ED BC 81 ED A0 BE ED BC 82 ED A0 BE ED BC 83 (invalid UTF-8). So not only does SDL not support system 8-bit encoding and UTF-16 APIs, but the UTF-8 that it was intended to support isn’t even correct in all platforms.

Peter87 · November 8, 2022, 4:08pm

I was wondering about that when I read the code. What you describe is obviously a bug. I think you should report it.

PiotrGrochowski · November 8, 2022, 4:25pm

I am a gitphobe so I’m not going to submit any GitHub issues.

Peter87 · November 8, 2022, 4:28pm

I understand. I also don’t use GitHub. I’m going to flag your post and hope someone else takes care of it.

ROSY · November 8, 2022, 7:04pm

Does this clipboard convert? I thought it was only copying bytes …

Peter87 · November 8, 2022, 8:55pm

If it was just copying bytes you would have problems when copy-pasting between programs that use different text encodings. The SDL clipboard functions use UTF-8. That means that on platforms where the underlying clipboard API use some other encoding there has to be some conversion happening.

Mason_Wheeler · November 10, 2022, 12:16am

That’s the precise opposite of how a cross-platform abstraction layer works. Yes, it hides the differences between platforms by providing a uniform interface to the developer, but we’re not talking about a developer interface here. We’re talking about the OS side of the interface, and the entire purpose of the abstraction layer is to translate between the developer interface and the OS’s preferred, native way of doing things.

This means that if the OS is using UTF-16, you use UTF-16, even if the developer interface is UTF-8. If you’re not doing that consistently across the various supported platforms, your claim to have a cross-platform abstraction layer goes right out the window.

rtrussell · November 10, 2022, 9:15am

If that’s the case I completely misunderstood what the OP was asking for, sorry. I thought he was wanting SDL2 to provide UTF-16 as an alternative encoding at the developer interface, specifically at the SDL_GetClipboardText() interface. I’ve deleted my comment.

PiotrGrochowski · November 10, 2022, 11:58am

SDL audio supports multiple formats (8-bit, 16-bit, 32-bit, float) so does that mean it cannot be considered a cross-platform abstraction layer?

What I mean is that I process UTF-16 internally in my code, and I would like the SDL string functions to be able handle UTF-16 as well (and such that I would have a way of specifying UTF-16 encoding beforehand, like how I select audio format before starting audio), in such a way that I can pass the pointer to UTF-16 string to the functions, and conversions between UTF-8 only occur when the direct external interface relies on UTF-8.