mIRC Home    About    Download    Register    News    Help

Print Thread
Joined: Jun 2007
Posts: 933
5
5618 Offline OP
Hoopy frood
OP Offline
Hoopy frood
5
Joined: Jun 2007
Posts: 933
I was observing some odd behaviour with some non-UTF-8 strings of text I was receiving as a result of clients connecting with exotic realnames/gecoses.
It took me a while to narrow it down, but I think this is an accurate description.
Paste characters from the following ranges one after another and send it to yourself on an IRC server without UTF-8 encoding via //.raw -n PRIVMSG $me :<string>

0128-0223
0128-0191
0192-0247
anychar
anychar
anychar

The anychars are optional, but the result is much more pronounced when you add a minimum of 3. E.g.
//.raw -n PRIVMSG $me : $chr(0221) $+ $chr(0181) $+ $chr(0240) $+ 111
vs
//.raw PRIVMSG $me : $chr(0221) $+ $chr(0181) $+ $chr(0240) $+ 111

This result, by the way, it the same as doing a $utfdecode() on the whole string. So basically mIRC seems to be decoding an unencoded string.

Last edited by 5618; 03/08/10 06:15 PM.
Joined: Dec 2002
Posts: 5,412
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,412
Thanks, this should be fixed in the next version.

Joined: Jun 2007
Posts: 933
5
5618 Offline OP
Hoopy frood
OP Offline
Hoopy frood
5
Joined: Jun 2007
Posts: 933
The behaviour is now indeed different on v7.11 and the output looks better. However, the first two chars are still combined into one char as if a $utfdecode was performed on them.

Joined: Feb 2006
Posts: 546
J
Fjord artisan
Offline
Fjord artisan
J
Joined: Feb 2006
Posts: 546
this is just a complete shot in the dark, but might this decoding failure have something to do with $isutf()'s current inability to detect invalid UTF-8:

Code:
//echo -a $isutf($chr(0221) $+ $chr(0181) $+ $chr(0240) $+ 111)


= 2, whereas we would expect it to = 0


"The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde
Joined: Dec 2002
Posts: 5,412
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,412
As far as I can tell, the first two characters are being correctly UTF-8 decoded into character 0775. You can check the character by character decoding method of your string here.

The reason mIRC leaves in the "eth" character (compared to the above decoder) is that it is trying to preserve as much of the string as possible, even if it is not valid in a UTF-8 context.

mIRC is lenient by design when it comes to invalid UTF-8 combinations since making them strict results in other side-effects. The UTF-8 method is consistent throughout mIRC, eg. when processing servers messages, loading files, internal conversions, script processing, and so on.


Link Copied to Clipboard