Originally Posted By: bwuser
Originally Posted By: Collective
...mIRC uses Unicode, in which the first 255 characters are the same as those in ISO-8859-1.

No! Still wrong!
Originally Posted By: http://en.wikipedia.org/wiki/UTF-8
The first 128 characters of the Unicode character set (which correspond directly to the ASCII) use a single octet with the same binary value as in ASCII.


That is UTF-8—which is not quite the same thing as Unicode. UTF-8 is a particular method of encoding Unicode for 8-bit transports (e.g. IRC). In Unicode the first 256 codepoints (0-255) correspond to ISO-8859-1. UTF-8 uses byte-values in the 128-255 range for encoded characters; therefore bytes in this range will never correspond to the values of the encoded characters—destroying the codepoint/byte-value relationship for the range.

Originally Posted By: bwuser
Originally Posted By: Wims
...it might be because mirc recognize an invalid sequence of utf-8 and then don't decode it.

Why would it decode UTF-8? It probably turns it into UTF-16 for internal use, but I wouldn't call that decoding.

I would—but not to UTF-anything. AFAIK, internally, mIRC 7 is straight UCS-2 (raw 16-bit text). UTF-16 would only be useful for handling 24 and 32 bit characters, which mIRC doesn't yet seem to support; though it's possible that i'm wrong, and it is UCS-4 internally, and just doesn't support it fully in scripts (e.g. $chr(),$asc()).

As for decoding, my experience is that mIRC 6 (in UTF-8-decode mode) handles UTF-8 decoding errors by rendering the entire line as 8-bit text in the system's default codepage—Not a bad fallback for IRC, where there is still a lot of non-UTF-8 floating about.

mIRC 7.08 [yeah i need to up to the release] seems to favour UTF-8—only falling back to 8-bit if there is nothing on the line that looks like UTF-8. This causes non-UTF-8 text on a line that mIRC thinks is UTF-8 to get mangled (as you both correctly surmised). I feel that the 6.35 behaviour makes more sense for the current IRC landscape.

If both users are using mIRC 7, there should be no problem with messages. But messages from other users, or channel-names (or anything else) that contain 8-bit (or 16-bit) non-UTF-8 text, but include something that looks like a UTF-8 encoded character, will get mangled. :\

Not exactly a bug; but inferior to the mIRC 6 behaviour, IMHO.


Irgendwo über dem Regenbogen