mIRC Home    About    Download    Register    News    Help

Print Thread
#223796 01/08/10 08:38 PM
Joined: Jul 2006
Posts: 248
B
bwuser Offline OP
Fjord artisan
OP Offline
Fjord artisan
B
Joined: Jul 2006
Posts: 248
Hey,

I'm wondering to what extent mIRC is performing charset detection on incoming messages.

Examples:
- mIRC 6.x user sends "äöü" (German umlauts) in ISO-8859-1 or ISO-8859-15: mIRC 7 displays them just fine.
- French user has trouble getting é to display correctly.

This means that mIRC 7 must be performing some kind of charset detection, although in a limited fashion. Can anybody (Khaled?) enlighten me about the specifics? Thanks!

Joined: Jul 2006
Posts: 4,145
W
Hoopy frood
Offline
Hoopy frood
W
Joined: Jul 2006
Posts: 4,145
I'm french and have no problem, the problem really is that such users do not understand that mIRC is now unicode, there's no charset anymore.
Quote:
- mIRC 6.x user sends "äöü" (German umlauts) in ISO-8859-1 or ISO-8859-15: mIRC 7 displays them just fine.
This is because of utf-8, the first 255 characters remain the same

Did you see this ?

Last edited by Wims; 01/08/10 08:46 PM.

#mircscripting @ irc.swiftirc.net == the best mIRC help channel
Joined: Jul 2006
Posts: 248
B
bwuser Offline OP
Fjord artisan
OP Offline
Fjord artisan
B
Joined: Jul 2006
Posts: 248
Originally Posted By: Wims
This is because of utf-8, the first 255 characters remain the same

That's wrong. Only the first 127 characters remain the same, hence my question. wink

Joined: Dec 2002
Posts: 3,138
C
Hoopy frood
Offline
Hoopy frood
C
Joined: Dec 2002
Posts: 3,138
The French user you linked to had no trouble displaying e-acute, hence the screenshots containing it. His problem is that mIRC is encoding a channel name containing é even though the original channel has an unencoded name.

There is no charset detection. mIRC uses Unicode, in which the first 255 characters are the same as those in ISO-8859-1.

Joined: Jul 2006
Posts: 4,145
W
Hoopy frood
Offline
Hoopy frood
W
Joined: Jul 2006
Posts: 4,145
Oh yeah my bad, then it might be because mirc recognize an invalid sequence of utf-8 and then don't decode it.


#mircscripting @ irc.swiftirc.net == the best mIRC help channel
Joined: Jul 2006
Posts: 248
B
bwuser Offline OP
Fjord artisan
OP Offline
Fjord artisan
B
Joined: Jul 2006
Posts: 248
Originally Posted By: Collective
There is no charset detection. mIRC uses Unicode, in which the first 255 characters are the same as those in ISO-8859-1.

No! Still wrong!
Originally Posted By: http://en.wikipedia.org/wiki/UTF-8
The first 128 characters of the Unicode character set (which correspond directly to the ASCII) use a single octet with the same binary value as in ASCII.


Originally Posted By: Wims
Oh yeah my bad, then it might be because mirc recognize an invalid sequence of utf-8 and then don't decode it.

Why would it decode UTF-8? It probably turns it into UTF-16 for internal use, but I wouldn't call that decoding.
Also, more importantly, ISO-8859-1 encoded umlauts are NOT valid UTF-8. If interpreted as UTF-8 without any conversion (which is what you suggested), the characters turn into gibberish, not the correct visual representation.

So anybody? Khaled? frown

To clarify: All I want to know is what mIRC does when it encounters an input sequence that is not valid UTF-8.

Last edited by bwuser; 03/08/10 05:54 PM.
Joined: Dec 2002
Posts: 3,138
C
Hoopy frood
Offline
Hoopy frood
C
Joined: Dec 2002
Posts: 3,138
Originally Posted By: bwuser
Originally Posted By: Collective
There is no charset detection. mIRC uses Unicode, in which the first 255 characters are the same as those in ISO-8859-1.

No! Still wrong!
Originally Posted By: http://en.wikipedia.org/wiki/UTF-8
The first 128 characters of the Unicode character set (which correspond directly to the ASCII) use a single octet with the same binary value as in ASCII.

I'm not sure what point you're trying to make by quoting that. I never said that characters between 128-255 were sent unencoded. I said that when they were receieved (UTF8-encoded or otherwise) they'd be displayed as Unicode, and that mIRC does not seek to override their display based on some codepage detection mechanism.

Quote:
Why would it decode UTF-8? It probably turns it into UTF-16 for internal use, but I wouldn't call that decoding.

"Decode" seems a perfectly reasonable term here. Don't make me get out a dictionary.

Quote:
Also, more importantly, ISO-8859-1 encoded umlauts are NOT valid UTF-8. If interpreted as UTF-8 without any conversion (which is what you suggested)

That isn't what he suggested. He suggested invalid UTF-8 sequences would not be decoded (or "converted", as you put it). This is not entirely true (and hence neither is my assertation above) due to this bug, however it is true for invalid sequences that bug doesn't affect. Put another way: bytes that are not part of a valid UTF-8 sequence are treated as characters.

Joined: Aug 2010
Posts: 19
U
Pikka bird
Offline
Pikka bird
U
Joined: Aug 2010
Posts: 19
Originally Posted By: bwuser
Originally Posted By: Collective
...mIRC uses Unicode, in which the first 255 characters are the same as those in ISO-8859-1.

No! Still wrong!
Originally Posted By: http://en.wikipedia.org/wiki/UTF-8
The first 128 characters of the Unicode character set (which correspond directly to the ASCII) use a single octet with the same binary value as in ASCII.


That is UTF-8—which is not quite the same thing as Unicode. UTF-8 is a particular method of encoding Unicode for 8-bit transports (e.g. IRC). In Unicode the first 256 codepoints (0-255) correspond to ISO-8859-1. UTF-8 uses byte-values in the 128-255 range for encoded characters; therefore bytes in this range will never correspond to the values of the encoded characters—destroying the codepoint/byte-value relationship for the range.

Originally Posted By: bwuser
Originally Posted By: Wims
...it might be because mirc recognize an invalid sequence of utf-8 and then don't decode it.

Why would it decode UTF-8? It probably turns it into UTF-16 for internal use, but I wouldn't call that decoding.

I would—but not to UTF-anything. AFAIK, internally, mIRC 7 is straight UCS-2 (raw 16-bit text). UTF-16 would only be useful for handling 24 and 32 bit characters, which mIRC doesn't yet seem to support; though it's possible that i'm wrong, and it is UCS-4 internally, and just doesn't support it fully in scripts (e.g. $chr(),$asc()).

As for decoding, my experience is that mIRC 6 (in UTF-8-decode mode) handles UTF-8 decoding errors by rendering the entire line as 8-bit text in the system's default codepage—Not a bad fallback for IRC, where there is still a lot of non-UTF-8 floating about.

mIRC 7.08 [yeah i need to up to the release] seems to favour UTF-8—only falling back to 8-bit if there is nothing on the line that looks like UTF-8. This causes non-UTF-8 text on a line that mIRC thinks is UTF-8 to get mangled (as you both correctly surmised). I feel that the 6.35 behaviour makes more sense for the current IRC landscape.

If both users are using mIRC 7, there should be no problem with messages. But messages from other users, or channel-names (or anything else) that contain 8-bit (or 16-bit) non-UTF-8 text, but include something that looks like a UTF-8 encoded character, will get mangled. :\

Not exactly a bug; but inferior to the mIRC 6 behaviour, IMHO.


Irgendwo über dem Regenbogen
Joined: Jan 2003
Posts: 2,523
Q
Hoopy frood
Offline
Hoopy frood
Q
Joined: Jan 2003
Posts: 2,523
Quote:
I would—but not to UTF-anything. AFAIK, internally, mIRC 7 is straight UCS-2 (raw 16-bit text). UTF-16 would only be useful for handling 24 and 32 bit characters, which mIRC doesn't yet seem to support; though it's possible that i'm wrong, and it is UCS-4 internally, and just doesn't support it fully in scripts (e.g. $chr(),$asc()).
mIRC does support characters beyond the BMP through UTF-16 surrogate pairs - it is just $chr/$asc that don't. $chr/$asc equivalents can be scripted although it would be nice if this functionality were built in.

Last edited by qwerty; 17/08/10 05:04 PM.

/.timerQ 1 0 echo /.timerQ 1 0 $timer(Q).com
Joined: Jan 2009
Posts: 116
Vogon poet
Offline
Vogon poet
Joined: Jan 2009
Posts: 116
Originally Posted By: qwerty
Quote:
I would—but not to UTF-anything. AFAIK, internally, mIRC 7 is straight UCS-2 (raw 16-bit text). UTF-16 would only be useful for handling 24 and 32 bit characters, which mIRC doesn't yet seem to support; though it's possible that i'm wrong, and it is UCS-4 internally, and just doesn't support it fully in scripts (e.g. $chr(),$asc()).
mIRC does support characters beyond the BMP through UTF-16 surrogate pairs - it is just $chr/$asc that don't. $chr/$asc equivalents can be scripted although it would be nice if this functionality were built in.


Agreeed. I'm also pretty sure it's been suggested before.


http://zowb.net

/server -m irc.p2p-network.net -j #zomgwtfbbq
(ssl on port 6697 and 7000)
Joined: Feb 2006
Posts: 546
J
Fjord artisan
Offline
Fjord artisan
J
Joined: Feb 2006
Posts: 546
Originally Posted By: Knoeki
Originally Posted By: qwerty
Quote:
I would—but not to UTF-anything. AFAIK, internally, mIRC 7 is straight UCS-2 (raw 16-bit text). UTF-16 would only be useful for handling 24 and 32 bit characters, which mIRC doesn't yet seem to support; though it's possible that i'm wrong, and it is UCS-4 internally, and just doesn't support it fully in scripts (e.g. $chr(),$asc()).
mIRC does support characters beyond the BMP through UTF-16 surrogate pairs - it is just $chr/$asc that don't. $chr/$asc equivalents can be scripted although it would be nice if this functionality were built in.


Agreeed. I'm also pretty sure it's been suggested before.


for mIRC to support the full Unicode code space internally, each member being a separate individual character in our scripts for all intents and purposes, would probably require it to switch from the current UTF-16 encoding form to UTF-32 (otherwise every string related function would need to parse the line for surrogate pairs as is currently done for displaying text). this would double the amount of memory required to store Unicode internally which is likely an unreasonable compromise given how rare it is to encounter characters in the supplementary Unicode planes


"The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde
Joined: Apr 2003
Posts: 342
M
Fjord artisan
Offline
Fjord artisan
M
Joined: Apr 2003
Posts: 342
Here's the source of a Windows equivalent of the unix "iconv" command line tool used for converting between different charsets...


Beware of MeStinkBAD! He knows more than he actually does!
Joined: Nov 2009
Posts: 81
V
Babel fish
Offline
Babel fish
V
Joined: Nov 2009
Posts: 81
http://www.hawkee.com/scripts/13334373/ like iconv too, but as native mirc script :}

Joined: Feb 2006
Posts: 546
J
Fjord artisan
Offline
Fjord artisan
J
Joined: Feb 2006
Posts: 546
mIRC can already perform code page lookups and conversions between them quite simply, though it only directly supports the ANSI code pages as well as a few other less common ones. the full list and rough method is provided in the 'Unicode Related Functions' section of my article on Unicode support


"The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde
Joined: Aug 2010
Posts: 19
U
Pikka bird
Offline
Pikka bird
U
Joined: Aug 2010
Posts: 19
Originally Posted By: qwerty
mIRC does support characters beyond the BMP through UTF-16 surrogate pairs - it is just $chr/$asc that don't. $chr/$asc equivalents can be scripted although it would be nice if this functionality were built in.

Unfortunately it seems not to work for me. I tried jaytea's UTF-16 demonstrations, and mIRC displayed the constituent surrogate pair (as two characters) instead of a single character. I'm guessing that it's a Wine (1.0.1) issue.


Irgendwo über dem Regenbogen

Link Copied to Clipboard