The keyboard in SDL

Christer_Sandberg · January 31, 2007, 8:15pm

Hi,

I have been testing the keyboard feature of SDL keyboard looking into the
SDL_keysym structure received from the event queue. I found the following
behaviour unexpected according to the API docs:

When unicode is enbaled by SDL_EnableUNICODE() the value of the unicode
field is the same as that of the sym field for unshifted keys. E.g. when
pressing the key with the Swedish letter ‘?’ (the key with a ‘;’ on a US
keyboard mapping) the content of both sym and unicode is 246 (which is a
correct representation of ‘?’ in ISO-8859 encoding). The actual unicode value
of ‘?’ is however a two byte number (in UTF-8).
When unicode is enbaled by SDL_EnableUNICODE() the value of the unicode
field (which is obviously an ISO-8859 code) correctly translated to the
corresponding uppercase ISO-8859 code (i.e. 214) in case the shift key is
held down.
When pressing a modifier key only, the mod field will be 0 at the same time
SDL_GetModState() returns the KMOD_* flag. Not until a key to be modified the
mod field get the same value as SDL_GetModState() returns.

I am running an Ubuntu machine using SDL-1.2-11

Is this the intended behaviour and things are just unclear in the docs or is
some of these bugs ?
Or is it a configuration problem of my system? - But I can find no special
requirements and unicode is available in other of my apps, e.g. it is the
default in the console.–
Christer

Christian_Walther · January 31, 2007, 8:52pm

Christer Sandberg wrote:

E.g. when pressing the key with the Swedish letter ‘?’ (the key with
a ‘;’ on a US keyboard mapping) the content of both sym and unicode
is 246 (which is a correct representation of ‘?’ in ISO-8859
encoding). The actual unicode value of ‘?’ is however a two byte
number (in UTF-8).

The unicode field contains the Unicode code point of the generated
character (which is a number), not its UTF-8 encoded form (which is a
sequence of bytes). The code point of ‘?’ is indeed 246. In fact code
points 0 to 255 of Unicode coincide with ISO-Latin-1, so to test whether
SDL really produces Unicode and not Latin-1 values, you need to use
characters that are not in Latin-1.

-Christian

Christer_Sandberg · February 1, 2007, 9:05pm

Christer Sandberg wrote:

E.g. when pressing the key with the Swedish letter ‘?’ (the key with
a ‘;’ on a US keyboard mapping) the content of both sym and unicode
is 246 (which is a correct representation of ‘?’ in ISO-8859
encoding). The actual unicode value of ‘?’ is however a two byte
number (in UTF-8).

The unicode field contains the Unicode code point of the generated
character (which is a number), not its UTF-8 encoded form (which is a
sequence of bytes). The code point of ‘?’ is indeed 246. In fact code
points 0 to 255 of Unicode coincide with ISO-Latin-1, so to test whether
SDL really produces Unicode and not Latin-1 values, you need to use
characters that are not in Latin-1.
Thanks for your answer. I wasn’t aware there existed a "code point"
representaition (however, the SDL docs does not mention the “code point”).
I have have spent some time trying to understand what relation there are
between a unicode encoding like UTF-8 and “code point”, but I have had little
or no success. So I’m asking for someone to make a miracle.
From Wikipedia:
"In text processing, Unicode takes the role of providing a unique code point ?
a number, not a glyph ? for each character. In other words, Unicode
represents a character in an abstract way and leaves the visual rendering
(size, shape, font or style) to other software, such as a web browser or word
processor."
But as far as I know this description matches also UTF-8, or? And on
www.unicode.org there essentially the same text:
"Characters are represented by code points that reside only in a memory
representation, as strings in memory, or on disk. The Unicode Standard deals
only with character codes. Glyphs represent the shapes that characters can
have when they are rendered or displayed."
And also:
"In the Unicode character encoding model, precisely defined encoding forms
specify how each integer (code point) for a Unicode character is to be
expressed as a sequence of one or more code units. The Unicode Standard
provides three distinct encoding forms for Unicode characters, using 8-bit,
16-bit, and 32-bit units. These are correspondingly named UTF-8,…“
What I get from this is that the number representing e.g. ‘?’ (246) is the
"code point” and UTF-8 is how to organize the bits (or bytes) in that number.
But in this case the number is not even the right one (as far as I can see),
but it is mapped to some other number (and combined with a control byte).

What I want to do is to use the Unicode code point received from SDL:s
keyboard event using a graphical text rendering machine that expects UTF-8,
and that the result should be the same letter as the one on the key cap of
the users keyboard. For example just using printf to print it in a console
that is configured for UTF-8 (I have bigger plans for the future).
Technically this must be solvable, since e.g. scanf succeeds with the trick.

Is there a function in SDL that performs a translation needed for this, or
does anyone know about some lib providing it, or some location where I can
find a translation table.
Thanks in advance
ChristerOn Wednesday 31 January 2007 21:52, Christian Walther wrote:

Nicolas_Goy · February 2, 2007, 9:10am

You should not confuse Unicode and UTF-8.

Unicode is NOT an encoding, it’s a standard CHARACTER SET.

The encoding are, UTF-8, UTF-16 (little and big endian) and UTF-32
(same thing as 16 for endian).

In SDL, the unicode value is UCS4. (I think so, not 100% sure about
sdl, it’s quite strange because UCS4 is 32 bit, and sdl return a
16bit value, but I think it’s UCS4 but they just drop the value if
it’s bigger than 65535. Sam, Ryan?)

Let me explain a bit:

UTF-32 is the UCS4 value encoded on 32 bit, simply a number, stored
in little or big endian.

For example (I took a rare kanji because it has a huge value:) )

0x2F9F4 is the UCS4 value, and can be encoding as is on 32 bit.

Now, how can I encode this on 16 or 8 bit? It’s impossible!
The answer is, surrogates. This barbaric words means a “prefix” to be
used to inform the parser that our char is encoded on two unit. (a
unit is 16 or 8 bit depending of the encoding, can be up to 4 units
with utf8)

I will not enter into the details, but for example, the above char is:

0xD87E 0xDDF4 in UTF-16 and
0xF0 0xAF 0xA7 0xB4 in UTF-8.

In short (if you want to understand fully, read the doc), in utf-16
case, d87e means, this char is on two units and the second unit is
code table xxx. Same logic for UTF-8.

So, you can store any UCS4 data in any UTF encoding.

Is there a function in SDL that performs a translation needed for
this, or
does anyone know about some lib providing it, or some location
where I can
find a translation table.

Now about this:

your answer is “man 3 iconv”.

or:

iconv_t converter;
converter = iconv_open(“UTF-8”, “UTF-32”); // (or UTF-16 for SDL,
again not sure about SDL behaviour, you can also append BE or LE
after the encoding name for endianess)

char * myUtf16Buff = …;
char * myUtf8Buff = …;

iconv(converter, myUtf16Buff, lengthInByteOfMyUtf16Buff, myUtf8Buff,
lengthInByteOfMyUtf8Buff);

iconv_close(converter);

Should do the job.

Recommended reading:

http://unicode.org/unicode/faq/
http://unicode.org/Public/BETA/CVTUTF-1-4/readme.txt with code
here: http://unicode.org/Public/BETA/CVTUTF-1-4/ConvertUTF.c (and
the .h, just list the directory)
http://www.joconner.com/javai18n/articles/UTF8.html
http://en.wikipedia.org/wiki/UTF-8

Best of luckOn Feb 1, 2007, at 10:05 PM, Christer Sandberg wrote:

–
Kuon
Programmer and sysadmin.

“Computers should not stop working when the users’ brain does.”

Nicolas_Goy · February 2, 2007, 9:55am

iconv(converter, myUtf16Buff, lengthInByteOfMyUtf16Buff, myUtf8Buff,
lengthInByteOfMyUtf8Buff);

Little correction, the size_t are pointers, anyway you should read doc:)

Good luckOn Feb 2, 2007, at 10:10 AM, Kuon - Nicolas Goy - ??? (Goyman.com SA) wrote:

–
Kuon
Programmer and sysadmin.

“Computers should not stop working when the users’ brain does.”

Facundo_Dominguez · February 2, 2007, 12:56pm

Hi.

As far as I know UNICODE is a character representation designed for
using in memory.
It is a comfortable format because you can handle an array of UNICODE
characters much like
you handle a normal char array (characters use a fixed amount of bytes,
multiple of 2, and the
representation of each one is independent of the other characters around).

UTF-8 is one of many formats suitable for text storage or streaming.
That is, you try to optimize the space needed
to store the text, or the time you need to send it over a connection,
and that makes the representation a little more
complicated and not so good for working with the individual characters
of the text on memory.

I imagine that browsers, for example, receive web pages in a format of
the storage-streaming family, and then
translate them to UNICODE in order to put it on the screen. And to put
it on the screen they may use some table
relating unicode codes to glyphs.

Hope this helps. It was neither for me an easy to find piece of
information.
I apologize for not giving any references, but I do not remember where I
finally found this.

Regards

Christian_Walther · February 2, 2007, 6:15pm

Kuon - Nicolas Goy - ??? (Goyman.com SA) wrote:> On Feb 1, 2007, at 10:05 PM, Christer Sandberg wrote:

Is there a function in SDL that performs a translation needed for
this

your answer is “man 3 iconv”.

In fact, SDL brings its own version of that (SDL_iconv) that wraps the
system version or uses its own implementation on systems that don’t have
it. I’m not sure if it’s documented anywhere (the wiki seems to be
unreachable at the moment), but have a look at the bottom of
http://www.libsdl.org/cgi/viewvc.cgi/branches/SDL-1.2/include/SDL_stdinc.h?view=markup,
or
http://www.libsdl.org/cgi/viewvc.cgi/branches/SDL-1.2/src/stdlib/SDL_iconv.c?view=markup
for the implementation.

-Christian

Christer_Sandberg · February 3, 2007, 9:35pm

Thanks for all the answers, they were all valuable - I think I get it now.
Great that SDL gives support for multi-platform conversion functions (maybe
more people would start using them if the were included in the docs?). It
seem to works fine if I pass “UCS-2” for the source format, so hopefully the
truncated UCS-4 (if that is correct?) is compatible with USC-2 as long as
only zeros are truncated.
Extra credits for SDL_iconv_string()!

Now a new questions araises: I need a function similar to toupper() that
operates on Unicode, and I can find no such in SDL. I have looked at it out
there but can’t find anything for C.
Someone that has any clue where to start digging? (I realize that it is likely
to find out how to do it reading the complete Unicode docs, but that is
rather extensive)

Thanks,
Christer

Nicolas_Goy · February 4, 2007, 10:16am

toupper

I’d say: towupper

and anything in #include <wctype.h>

RegardsOn Feb 3, 2007, at 10:35 PM, Christer Sandberg wrote:

Kuon
Programmer and sysadmin.

“Computers should not stop working when the users’ brain does.”

Mattias_Karlsson · February 5, 2007, 11:13am

Now a new questions araises: I need a function similar to toupper() that
operates on Unicode, and I can find no such in SDL. I have looked at it out
there but can’t find anything for C.
Someone that has any clue where to start digging? (I realize that it is likely
to find out how to do it reading the complete Unicode docs, but that is
rather extensive)

My digging let me to belive that toupper()/tolower()/sorting is
intentionaly left out since each locale has its own rules. IIRC the
example given is ‘?’ that in german is sorted together with ‘a’ but in swedish
sorted as its own character.

So the only solution is to use “wchar.h”, and note that wchar is
UTF-16 on Windows and UTF-32 on most unices.On Sat, 3 Feb 2007, Christer Sandberg wrote:

Torsten_Giebl · February 5, 2007, 10:23pm

Hello !

As far as I know UNICODE is a character representation designed for
using in memory. It is a comfortable format because you can handle an array
of UNICODE characters much like you handle a normal char array (characters
use a fixed amount of bytes, multiple of 2, and the representation of each
one is independent of the other characters around).

Which makes it also easy to calculate how
many real characters a string has.

CU

Brian_Raiter · February 6, 2007, 2:22am

As far as I know UNICODE is a character representation designed for
using in memory. It is a comfortable format because you can handle an array
of UNICODE characters much like you handle a normal char array (characters
use a fixed amount of bytes, multiple of 2, and the representation of each
one is independent of the other characters around).

Of course, neither of those things are completely true.

Unicode values are 21 bits in size. So you usually need to a full 4
bytes for each character (i.e. UTF-32) in order to treat your string
as a “normal” array of characters. Most people prefer to use one of
the encodings that keep strings from getting too large (e.g. UTF-8,
UTF-16, or something higher-level like SCSU).

Also, Unicode contains a number of combining modifier characters (e.g.
accent marks). Their representations are affected by the character(s)
it modifies (depending on what exactly you mean by “representation”).

b