Register Log In

Forums Bug Reports v7.1 receiving a specific non-UTF-8 strain

Print Thread

v7.1 receiving a specific non-UTF-8 strain #223957 03/08/10 06:11 PM
Joined: Jun 2007 Posts: 933 5 5618 OP Hoopy frood
OP 5618 Hoopy frood 5 Joined: Jun 2007 Posts: 933	I was observing some odd behaviour with some non-UTF-8 strings of text I was receiving as a result of clients connecting with exotic realnames/gecoses. It took me a while to narrow it down, but I think this is an accurate description. Paste characters from the following ranges one after another and send it to yourself on an IRC server without UTF-8 encoding via //.raw -n PRIVMSG $me :<string> 0128-0223 0128-0191 0192-0247 anychar anychar anychar The anychars are optional, but the result is much more pronounced when you add a minimum of 3. E.g. //.raw -n PRIVMSG $me : $chr(0221) $+ $chr(0181) $+ $chr(0240) $+ 111 vs //.raw PRIVMSG $me : $chr(0221) $+ $chr(0181) $+ $chr(0240) $+ 111 This result, by the way, it the same as doing a $utfdecode() on the whole string. So basically mIRC seems to be decoding an unencoded string. Last edited by 5618; 03/08/10 06:15 PM.

Re: v7.1 receiving a specific non-UTF-8 strain 5618 #223990 04/08/10 01:36 AM
Joined: Dec 2002 Posts: 5,424 London, UK Khaled Hoopy frood
Khaled Hoopy frood Joined: Dec 2002 Posts: 5,424 London, UK	Thanks, this should be fixed in the next version.

Re: v7.1 receiving a specific non-UTF-8 strain Khaled #225871 12/09/10 07:29 AM
Joined: Jun 2007 Posts: 933 5 5618 OP Hoopy frood
OP 5618 Hoopy frood 5 Joined: Jun 2007 Posts: 933	The behaviour is now indeed different on v7.11 and the output looks better. However, the first two chars are still combined into one char as if a $utfdecode was performed on them.

Re: v7.1 receiving a specific non-UTF-8 strain 5618 #225876 12/09/10 12:16 PM
Joined: Feb 2006 Posts: 546 J jaytea Fjord artisan
jaytea Fjord artisan J Joined: Feb 2006 Posts: 546	this is just a complete shot in the dark, but might this decoding failure have something to do with $isutf()'s current inability to detect invalid UTF-8: Code: //echo -a $isutf($chr(0221) $+ $chr(0181) $+ $chr(0240) $+ 111) = 2, whereas we would expect it to = 0 "The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde

Re: v7.1 receiving a specific non-UTF-8 strain 5618 #225921 13/09/10 10:39 AM
Joined: Dec 2002 Posts: 5,424 London, UK Khaled Hoopy frood
Khaled Hoopy frood Joined: Dec 2002 Posts: 5,424 London, UK	As far as I can tell, the first two characters are being correctly UTF-8 decoded into character 0775. You can check the character by character decoding method of your string here. The reason mIRC leaves in the "eth" character (compared to the above decoder) is that it is trying to preserve as much of the string as possible, even if it is not valid in a UTF-8 context. mIRC is lenient by design when it comes to invalid UTF-8 combinations since making them strict results in other side-effects. The UTF-8 method is consistent throughout mIRC, eg. when processing servers messages, loading files, internal conversions, script processing, and so on.

Link Copied to Clipboard