Register Log In

Forums Developers mIRC 7 Charset Detection

Print Thread

mIRC 7 Charset Detection #223796 01/08/10 08:38 PM
B bwuser
bwuser B	Hey, I'm wondering to what extent mIRC is performing charset detection on incoming messages. Examples: - mIRC 6.x user sends "äöü" (German umlauts) in ISO-8859-1 or ISO-8859-15: mIRC 7 displays them just fine. - French user has trouble getting é to display correctly. This means that mIRC 7 must be performing some kind of charset detection, although in a limited fashion. Can anybody (Khaled?) enlighten me about the specifics? Thanks!

Re: mIRC 7 Charset Detection #223798 01/08/10 08:43 PM
Joined: Jul 2006 Posts: 4,062 France W Wims Hoopy frood
Wims Hoopy frood W Joined: Jul 2006 Posts: 4,062 France	I'm french and have no problem, the problem really is that such users do not understand that mIRC is now unicode, there's no charset anymore. Quote: - mIRC 6.x user sends "äöü" (German umlauts) in ISO-8859-1 or ISO-8859-15: mIRC 7 displays them just fine. This is because of utf-8, the first 255 characters remain the same Did you see this ? Last edited by Wims; 01/08/10 08:46 PM. #mircscripting @ irc.swiftirc.net == the best mIRC help channel

Re: mIRC 7 Charset Detection Wims #223814 02/08/10 05:51 AM
B bwuser
bwuser B	Originally Posted By: Wims This is because of utf-8, the first 255 characters remain the same That's wrong. Only the first 127 characters remain the same, hence my question.

Re: mIRC 7 Charset Detection #223816 02/08/10 06:08 AM
Joined: Dec 2002 Posts: 3,015 London, UK C Collective Hoopy frood
Collective Hoopy frood C Joined: Dec 2002 Posts: 3,015 London, UK	The French user you linked to had no trouble displaying e-acute, hence the screenshots containing it. His problem is that mIRC is encoding a channel name containing é even though the original channel has an unencoded name. There is no charset detection. mIRC uses Unicode, in which the first 255 characters are the same as those in ISO-8859-1.

Re: mIRC 7 Charset Detection #223821 02/08/10 06:58 AM
Joined: Jul 2006 Posts: 4,062 France W Wims Hoopy frood
Wims Hoopy frood W Joined: Jul 2006 Posts: 4,062 France	Oh yeah my bad, then it might be because mirc recognize an invalid sequence of utf-8 and then don't decode it. #mircscripting @ irc.swiftirc.net == the best mIRC help channel

Re: mIRC 7 Charset Detection Collective #223955 03/08/10 05:52 PM
B bwuser
bwuser B	Originally Posted By: Collective There is no charset detection. mIRC uses Unicode, in which the first 255 characters are the same as those in ISO-8859-1. No! Still wrong! Originally Posted By: http://en.wikipedia.org/wiki/UTF-8 The first 128 characters of the Unicode character set (which correspond directly to the ASCII) use a single octet with the same binary value as in ASCII. Originally Posted By: Wims Oh yeah my bad, then it might be because mirc recognize an invalid sequence of utf-8 and then don't decode it. Why would it decode UTF-8? It probably turns it into UTF-16 for internal use, but I wouldn't call that decoding. Also, more importantly, ISO-8859-1 encoded umlauts are NOT valid UTF-8. If interpreted as UTF-8 without any conversion (which is what you suggested), the characters turn into gibberish, not the correct visual representation. So anybody? Khaled? To clarify: All I want to know is what mIRC does when it encounters an input sequence that is not valid UTF-8. Last edited by bwuser; 03/08/10 05:54 PM.

Re: mIRC 7 Charset Detection #223958 03/08/10 06:15 PM
Joined: Dec 2002 Posts: 3,015 London, UK C Collective Hoopy frood
Collective Hoopy frood C Joined: Dec 2002 Posts: 3,015 London, UK	Originally Posted By: bwuser Originally Posted By: Collective There is no charset detection. mIRC uses Unicode, in which the first 255 characters are the same as those in ISO-8859-1. No! Still wrong! Originally Posted By: http://en.wikipedia.org/wiki/UTF-8 The first 128 characters of the Unicode character set (which correspond directly to the ASCII) use a single octet with the same binary value as in ASCII. I'm not sure what point you're trying to make by quoting that. I never said that characters between 128-255 were sent unencoded. I said that when they were receieved (UTF8-encoded or otherwise) they'd be displayed as Unicode, and that mIRC does not seek to override their display based on some codepage detection mechanism. Quote: Why would it decode UTF-8? It probably turns it into UTF-16 for internal use, but I wouldn't call that decoding. "Decode" seems a perfectly reasonable term here. Don't make me get out a dictionary. Quote: Also, more importantly, ISO-8859-1 encoded umlauts are NOT valid UTF-8. If interpreted as UTF-8 without any conversion (which is what you suggested) That isn't what he suggested. He suggested invalid UTF-8 sequences would not be decoded (or "converted", as you put it). This is not entirely true (and hence neither is my assertation above) due to this bug, however it is true for invalid sequences that bug doesn't affect. Put another way: bytes that are not part of a valid UTF-8 sequence are treated as characters.

Re: mIRC 7 Charset Detection #224747 17/08/10 12:01 PM
Joined: Aug 2010 Posts: 19 USA U uberRegenbogen Pikka bird
uberRegenbogen Pikka bird U Joined: Aug 2010 Posts: 19 USA	Originally Posted By: bwuser Originally Posted By: Collective ...mIRC uses Unicode, in which the first 255 characters are the same as those in ISO-8859-1. No! Still wrong! Originally Posted By: http://en.wikipedia.org/wiki/UTF-8 The first 128 characters of the Unicode character set (which correspond directly to the ASCII) use a single octet with the same binary value as in ASCII. That is UTF-8—which is not quite the same thing as Unicode. UTF-8 is a particular method of encoding Unicode for 8-bit transports (e.g. IRC). In Unicode the first 256 codepoints (0-255) correspond to ISO-8859-1. UTF-8 uses byte-values in the 128-255 range for encoded characters; therefore bytes in this range will never correspond to the values of the encoded characters—destroying the codepoint/byte-value relationship for the range. Originally Posted By: bwuser Originally Posted By: Wims ...it might be because mirc recognize an invalid sequence of utf-8 and then don't decode it. Why would it decode UTF-8? It probably turns it into UTF-16 for internal use, but I wouldn't call that decoding. I would—but not to UTF-anything. AFAIK, internally, mIRC 7 is straight UCS-2 (raw 16-bit text). UTF-16 would only be useful for handling 24 and 32 bit characters, which mIRC doesn't yet seem to support; though it's possible that i'm wrong, and it is UCS-4 internally, and just doesn't support it fully in scripts (e.g. $chr(),$asc()). As for decoding, my experience is that mIRC 6 (in UTF-8-decode mode) handles UTF-8 decoding errors by rendering the entire line as 8-bit text in the system's default codepage—Not a bad fallback for IRC, where there is still a lot of non-UTF-8 floating about. mIRC 7.08 [yeah i need to up to the release] seems to favour UTF-8—only falling back to 8-bit if there is nothing on the line that looks like UTF-8. This causes non-UTF-8 text on a line that mIRC thinks is UTF-8 to get mangled (as you both correctly surmised). I feel that the 6.35 behaviour makes more sense for the current IRC landscape. If both users are using mIRC 7, there should be no problem with messages. But messages from other users, or channel-names (or anything else) that contain 8-bit (or 16-bit) non-UTF-8 text, but include something that looks like a UTF-8 encoded character, will get mangled. :\ Not exactly a bug; but inferior to the mIRC 6 behaviour, IMHO.

Re: mIRC 7 Charset Detection uberRegenbogen #224748 17/08/10 12:16 PM
Joined: Jan 2003 Posts: 2,125 Q qwerty Hoopy frood
qwerty Hoopy frood Q Joined: Jan 2003 Posts: 2,125	Quote: I would—but not to UTF-anything. AFAIK, internally, mIRC 7 is straight UCS-2 (raw 16-bit text). UTF-16 would only be useful for handling 24 and 32 bit characters, which mIRC doesn't yet seem to support; though it's possible that i'm wrong, and it is UCS-4 internally, and just doesn't support it fully in scripts (e.g. $chr(),$asc()). mIRC does support characters beyond the BMP through UTF-16 surrogate pairs - it is just $chr/$asc that don't. $chr/$asc equivalents can be scripted although it would be nice if this functionality were built in. Last edited by qwerty; 17/08/10 05:04 PM.

Re: mIRC 7 Charset Detection qwerty #224749 17/08/10 01:03 PM
K Knoeki
Knoeki K	Originally Posted By: qwerty Quote: I would—but not to UTF-anything. AFAIK, internally, mIRC 7 is straight UCS-2 (raw 16-bit text). UTF-16 would only be useful for handling 24 and 32 bit characters, which mIRC doesn't yet seem to support; though it's possible that i'm wrong, and it is UCS-4 internally, and just doesn't support it fully in scripts (e.g. $chr(),$asc()). mIRC does support characters beyond the BMP through UTF-16 surrogate pairs - it is just $chr/$asc that don't. $chr/$asc equivalents can be scripted although it would be nice if this functionality were built in. Agreeed. I'm also pretty sure it's been suggested before.

Re: mIRC 7 Charset Detection #224756 17/08/10 05:35 PM
Joined: Feb 2006 Posts: 523 J jaytea Fjord artisan
jaytea Fjord artisan J Joined: Feb 2006 Posts: 523	Originally Posted By: Knoeki Originally Posted By: qwerty Quote: I would—but not to UTF-anything. AFAIK, internally, mIRC 7 is straight UCS-2 (raw 16-bit text). UTF-16 would only be useful for handling 24 and 32 bit characters, which mIRC doesn't yet seem to support; though it's possible that i'm wrong, and it is UCS-4 internally, and just doesn't support it fully in scripts (e.g. $chr(),$asc()). mIRC does support characters beyond the BMP through UTF-16 surrogate pairs - it is just $chr/$asc that don't. $chr/$asc equivalents can be scripted although it would be nice if this functionality were built in. Agreeed. I'm also pretty sure it's been suggested before. for mIRC to support the full Unicode code space internally, each member being a separate individual character in our scripts for all intents and purposes, would probably require it to switch from the current UTF-16 encoding form to UTF-32 (otherwise every string related function would need to parse the line for surrogate pairs as is currently done for displaying text). this would double the amount of memory required to store Unicode internally which is likely an unreasonable compromise given how rare it is to encounter characters in the supplementary Unicode planes

Re: mIRC 7 Charset Detection jaytea #224762 17/08/10 10:08 PM
M MeStinkBAD
MeStinkBAD M	Here's the source of a Windows equivalent of the unix "iconv" command line tool used for converting between different charsets...

Re: mIRC 7 Charset Detection #224769 18/08/10 03:16 AM
V Voglea
Voglea V	http://www.hawkee.com/scripts/13334373/ like iconv too, but as native mirc script :}

Re: mIRC 7 Charset Detection #224772 18/08/10 06:47 AM
Joined: Feb 2006 Posts: 523 J jaytea Fjord artisan
jaytea Fjord artisan J Joined: Feb 2006 Posts: 523	mIRC can already perform code page lookups and conversions between them quite simply, though it only directly supports the ANSI code pages as well as a few other less common ones. the full list and rough method is provided in the 'Unicode Related Functions' section of my article on Unicode support

Re: mIRC 7 Charset Detection qwerty #224992 23/08/10 03:03 PM
Joined: Aug 2010 Posts: 19 USA U uberRegenbogen Pikka bird
uberRegenbogen Pikka bird U Joined: Aug 2010 Posts: 19 USA	Originally Posted By: qwerty mIRC does support characters beyond the BMP through UTF-16 surrogate pairs - it is just $chr/$asc that don't. $chr/$asc equivalents can be scripted although it would be nice if this functionality were built in. Unfortunately it seems not to work for me. I tried jaytea's UTF-16 demonstrations, and mIRC displayed the constituent surrogate pair (as two characters) instead of a single character. I'm guessing that it's a Wine (1.0.1) issue.

Link Copied to Clipboard