mIRC Homepage

mIRC 7 Charset Detection

Posted By: bwuser

mIRC 7 Charset Detection - 01/08/10 08:38 PM

Hey,

I'm wondering to what extent mIRC is performing charset detection on incoming messages.

Examples:
- mIRC 6.x user sends "" (German umlauts) in ISO-8859-1 or ISO-8859-15: mIRC 7 displays them just fine.
- French user has trouble getting to display correctly.

This means that mIRC 7 must be performing some kind of charset detection, although in a limited fashion. Can anybody (Khaled?) enlighten me about the specifics? Thanks!
Posted By: Wims

Re: mIRC 7 Charset Detection - 01/08/10 08:43 PM

I'm french and have no problem, the problem really is that such users do not understand that mIRC is now unicode, there's no charset anymore.
Quote:
- mIRC 6.x user sends "" (German umlauts) in ISO-8859-1 or ISO-8859-15: mIRC 7 displays them just fine.
This is because of utf-8, the first 255 characters remain the same

Did you see this ?
Posted By: bwuser

Re: mIRC 7 Charset Detection - 02/08/10 05:51 AM

Originally Posted By: Wims
This is because of utf-8, the first 255 characters remain the same

That's wrong. Only the first 127 characters remain the same, hence my question. wink
Posted By: Collective

Re: mIRC 7 Charset Detection - 02/08/10 06:08 AM

The French user you linked to had no trouble displaying e-acute, hence the screenshots containing it. His problem is that mIRC is encoding a channel name containing even though the original channel has an unencoded name.

There is no charset detection. mIRC uses Unicode, in which the first 255 characters are the same as those in ISO-8859-1.
Posted By: Wims

Re: mIRC 7 Charset Detection - 02/08/10 06:58 AM

Oh yeah my bad, then it might be because mirc recognize an invalid sequence of utf-8 and then don't decode it.
Posted By: bwuser

Re: mIRC 7 Charset Detection - 03/08/10 05:52 PM

Originally Posted By: Collective
There is no charset detection. mIRC uses Unicode, in which the first 255 characters are the same as those in ISO-8859-1.

No! Still wrong!
Originally Posted By: http://en.wikipedia.org/wiki/UTF-8
The first 128 characters of the Unicode character set (which correspond directly to the ASCII) use a single octet with the same binary value as in ASCII.


Originally Posted By: Wims
Oh yeah my bad, then it might be because mirc recognize an invalid sequence of utf-8 and then don't decode it.

Why would it decode UTF-8? It probably turns it into UTF-16 for internal use, but I wouldn't call that decoding.
Also, more importantly, ISO-8859-1 encoded umlauts are NOT valid UTF-8. If interpreted as UTF-8 without any conversion (which is what you suggested), the characters turn into gibberish, not the correct visual representation.

So anybody? Khaled? frown

To clarify: All I want to know is what mIRC does when it encounters an input sequence that is not valid UTF-8.
Posted By: Collective

Re: mIRC 7 Charset Detection - 03/08/10 06:15 PM

Originally Posted By: bwuser
Originally Posted By: Collective
There is no charset detection. mIRC uses Unicode, in which the first 255 characters are the same as those in ISO-8859-1.

No! Still wrong!
Originally Posted By: http://en.wikipedia.org/wiki/UTF-8
The first 128 characters of the Unicode character set (which correspond directly to the ASCII) use a single octet with the same binary value as in ASCII.

I'm not sure what point you're trying to make by quoting that. I never said that characters between 128-255 were sent unencoded. I said that when they were receieved (UTF8-encoded or otherwise) they'd be displayed as Unicode, and that mIRC does not seek to override their display based on some codepage detection mechanism.

Quote:
Why would it decode UTF-8? It probably turns it into UTF-16 for internal use, but I wouldn't call that decoding.

"Decode" seems a perfectly reasonable term here. Don't make me get out a dictionary.

Quote:
Also, more importantly, ISO-8859-1 encoded umlauts are NOT valid UTF-8. If interpreted as UTF-8 without any conversion (which is what you suggested)

That isn't what he suggested. He suggested invalid UTF-8 sequences would not be decoded (or "converted", as you put it). This is not entirely true (and hence neither is my assertation above) due to this bug, however it is true for invalid sequences that bug doesn't affect. Put another way: bytes that are not part of a valid UTF-8 sequence are treated as characters.
Posted By: uberRegenbogen

Re: mIRC 7 Charset Detection - 17/08/10 12:01 PM

Originally Posted By: bwuser
Originally Posted By: Collective
...mIRC uses Unicode, in which the first 255 characters are the same as those in ISO-8859-1.

No! Still wrong!
Originally Posted By: http://en.wikipedia.org/wiki/UTF-8
The first 128 characters of the Unicode character set (which correspond directly to the ASCII) use a single octet with the same binary value as in ASCII.


That is UTF-8—which is not quite the same thing as Unicode. UTF-8 is a particular method of encoding Unicode for 8-bit transports (e.g. IRC). In Unicode the first 256 codepoints (0-255) correspond to ISO-8859-1. UTF-8 uses byte-values in the 128-255 range for encoded characters; therefore bytes in this range will never correspond to the values of the encoded characters—destroying the codepoint/byte-value relationship for the range.

Originally Posted By: bwuser
Originally Posted By: Wims
...it might be because mirc recognize an invalid sequence of utf-8 and then don't decode it.

Why would it decode UTF-8? It probably turns it into UTF-16 for internal use, but I wouldn't call that decoding.

I would—but not to UTF-anything. AFAIK, internally, mIRC 7 is straight UCS-2 (raw 16-bit text). UTF-16 would only be useful for handling 24 and 32 bit characters, which mIRC doesn't yet seem to support; though it's possible that i'm wrong, and it is UCS-4 internally, and just doesn't support it fully in scripts (e.g. $chr(),$asc()).

As for decoding, my experience is that mIRC 6 (in UTF-8-decode mode) handles UTF-8 decoding errors by rendering the entire line as 8-bit text in the system's default codepage—Not a bad fallback for IRC, where there is still a lot of non-UTF-8 floating about.

mIRC 7.08 [yeah i need to up to the release] seems to favour UTF-8—only falling back to 8-bit if there is nothing on the line that looks like UTF-8. This causes non-UTF-8 text on a line that mIRC thinks is UTF-8 to get mangled (as you both correctly surmised). I feel that the 6.35 behaviour makes more sense for the current IRC landscape.

If both users are using mIRC 7, there should be no problem with messages. But messages from other users, or channel-names (or anything else) that contain 8-bit (or 16-bit) non-UTF-8 text, but include something that looks like a UTF-8 encoded character, will get mangled. :\

Not exactly a bug; but inferior to the mIRC 6 behaviour, IMHO.
Posted By: qwerty

Re: mIRC 7 Charset Detection - 17/08/10 12:16 PM

Quote:
I would—but not to UTF-anything. AFAIK, internally, mIRC 7 is straight UCS-2 (raw 16-bit text). UTF-16 would only be useful for handling 24 and 32 bit characters, which mIRC doesn't yet seem to support; though it's possible that i'm wrong, and it is UCS-4 internally, and just doesn't support it fully in scripts (e.g. $chr(),$asc()).
mIRC does support characters beyond the BMP through UTF-16 surrogate pairs - it is just $chr/$asc that don't. $chr/$asc equivalents can be scripted although it would be nice if this functionality were built in.
Posted By: Knoeki

Re: mIRC 7 Charset Detection - 17/08/10 01:03 PM

Originally Posted By: qwerty
Quote:
I would—but not to UTF-anything. AFAIK, internally, mIRC 7 is straight UCS-2 (raw 16-bit text). UTF-16 would only be useful for handling 24 and 32 bit characters, which mIRC doesn't yet seem to support; though it's possible that i'm wrong, and it is UCS-4 internally, and just doesn't support it fully in scripts (e.g. $chr(),$asc()).
mIRC does support characters beyond the BMP through UTF-16 surrogate pairs - it is just $chr/$asc that don't. $chr/$asc equivalents can be scripted although it would be nice if this functionality were built in.


Agreeed. I'm also pretty sure it's been suggested before.
Posted By: jaytea

Re: mIRC 7 Charset Detection - 17/08/10 05:35 PM

Originally Posted By: Knoeki
Originally Posted By: qwerty
Quote:
I would—but not to UTF-anything. AFAIK, internally, mIRC 7 is straight UCS-2 (raw 16-bit text). UTF-16 would only be useful for handling 24 and 32 bit characters, which mIRC doesn't yet seem to support; though it's possible that i'm wrong, and it is UCS-4 internally, and just doesn't support it fully in scripts (e.g. $chr(),$asc()).
mIRC does support characters beyond the BMP through UTF-16 surrogate pairs - it is just $chr/$asc that don't. $chr/$asc equivalents can be scripted although it would be nice if this functionality were built in.


Agreeed. I'm also pretty sure it's been suggested before.


for mIRC to support the full Unicode code space internally, each member being a separate individual character in our scripts for all intents and purposes, would probably require it to switch from the current UTF-16 encoding form to UTF-32 (otherwise every string related function would need to parse the line for surrogate pairs as is currently done for displaying text). this would double the amount of memory required to store Unicode internally which is likely an unreasonable compromise given how rare it is to encounter characters in the supplementary Unicode planes
Posted By: MeStinkBAD

Re: mIRC 7 Charset Detection - 17/08/10 10:08 PM

Here's the source of a Windows equivalent of the unix "iconv" command line tool used for converting between different charsets...
Posted By: Voglea

Re: mIRC 7 Charset Detection - 18/08/10 03:16 AM

http://www.hawkee.com/scripts/13334373/ like iconv too, but as native mirc script :}
Posted By: jaytea

Re: mIRC 7 Charset Detection - 18/08/10 06:47 AM

mIRC can already perform code page lookups and conversions between them quite simply, though it only directly supports the ANSI code pages as well as a few other less common ones. the full list and rough method is provided in the 'Unicode Related Functions' section of my article on Unicode support
Posted By: uberRegenbogen

Re: mIRC 7 Charset Detection - 23/08/10 03:03 PM

Originally Posted By: qwerty
mIRC does support characters beyond the BMP through UTF-16 surrogate pairs - it is just $chr/$asc that don't. $chr/$asc equivalents can be scripted although it would be nice if this functionality were built in.

Unfortunately it seems not to work for me. I tried jaytea's UTF-16 demonstrations, and mIRC displayed the constituent surrogate pair (as two characters) instead of a single character. I'm guessing that it's a Wine (1.0.1) issue.
© 2021 mIRC Discussion Forums