Register Log In

Forums Developers mIRC 7 Charset Detection

Print Thread

Re: mIRC 7 Charset Detection #224747 17/08/10 12:01 PM
Joined: Aug 2010 Posts: 19 USA U uberRegenbogen Pikka bird
uberRegenbogen Pikka bird U Joined: Aug 2010 Posts: 19 USA	Originally Posted By: bwuser Originally Posted By: Collective ...mIRC uses Unicode, in which the first 255 characters are the same as those in ISO-8859-1. No! Still wrong! Originally Posted By: http://en.wikipedia.org/wiki/UTF-8 The first 128 characters of the Unicode character set (which correspond directly to the ASCII) use a single octet with the same binary value as in ASCII. That is UTF-8—which is not quite the same thing as Unicode. UTF-8 is a particular method of encoding Unicode for 8-bit transports (e.g. IRC). In Unicode the first 256 codepoints (0-255) correspond to ISO-8859-1. UTF-8 uses byte-values in the 128-255 range for encoded characters; therefore bytes in this range will never correspond to the values of the encoded characters—destroying the codepoint/byte-value relationship for the range. Originally Posted By: bwuser Originally Posted By: Wims ...it might be because mirc recognize an invalid sequence of utf-8 and then don't decode it. Why would it decode UTF-8? It probably turns it into UTF-16 for internal use, but I wouldn't call that decoding. I would—but not to UTF-anything. AFAIK, internally, mIRC 7 is straight UCS-2 (raw 16-bit text). UTF-16 would only be useful for handling 24 and 32 bit characters, which mIRC doesn't yet seem to support; though it's possible that i'm wrong, and it is UCS-4 internally, and just doesn't support it fully in scripts (e.g. $chr(),$asc()). As for decoding, my experience is that mIRC 6 (in UTF-8-decode mode) handles UTF-8 decoding errors by rendering the entire line as 8-bit text in the system's default codepage—Not a bad fallback for IRC, where there is still a lot of non-UTF-8 floating about. mIRC 7.08 [yeah i need to up to the release] seems to favour UTF-8—only falling back to 8-bit if there is nothing on the line that looks like UTF-8. This causes non-UTF-8 text on a line that mIRC thinks is UTF-8 to get mangled (as you both correctly surmised). I feel that the 6.35 behaviour makes more sense for the current IRC landscape. If both users are using mIRC 7, there should be no problem with messages. But messages from other users, or channel-names (or anything else) that contain 8-bit (or 16-bit) non-UTF-8 text, but include something that looks like a UTF-8 encoded character, will get mangled. :\ Not exactly a bug; but inferior to the mIRC 6 behaviour, IMHO.

Entire Thread
Subject	Posted By	Posted
mIRC 7 Charset Detection	Anonymous	01/08/10 08:38 PM
Re: mIRC 7 Charset Detection	Wims	01/08/10 08:43 PM
Re: mIRC 7 Charset Detection	Anonymous	02/08/10 05:51 AM
Re: mIRC 7 Charset Detection	Collective	02/08/10 06:08 AM
Re: mIRC 7 Charset Detection	Anonymous	03/08/10 05:52 PM
Re: mIRC 7 Charset Detection	Collective	03/08/10 06:15 PM
Re: mIRC 7 Charset Detection	uberRegenbogen	17/08/10 12:01 PM
Re: mIRC 7 Charset Detection	qwerty	17/08/10 12:16 PM
Re: mIRC 7 Charset Detection	Knoeki	17/08/10 01:03 PM
Re: mIRC 7 Charset Detection	jaytea	17/08/10 05:35 PM
Re: mIRC 7 Charset Detection	Anonymous	17/08/10 10:08 PM
Re: mIRC 7 Charset Detection	uberRegenbogen	23/08/10 03:03 PM
Re: mIRC 7 Charset Detection	Wims	02/08/10 06:58 AM
Re: mIRC 7 Charset Detection	Anonymous	18/08/10 03:16 AM
Re: mIRC 7 Charset Detection	jaytea	18/08/10 06:47 AM

Link Copied to Clipboard