Hi,
I wanted to write a script that alters a block of UTF-8 text given by the user. Since custom dialogs don't support Unicode, I had to use a custom window for this. (UTF-8 encoding for custom windows must be enabled.) The user pastes a block of text, and the script process each line and /alines it.
This solution works perfectly for a few small lines of text, but it turns out that if you paste a big block of text, mIRC doesn't encode it in UTF-8. (It also beeps when this happens, perhaps to warn me that there's no UTF-8 encoding? But why not?)
Anyway, here's a simple code snippet to show my point. It simply displays the values of the last three bytes of the text you enter in @test's editbox.
; /window -e @test
on *:INPUT:@test:{
var %last = $right($1-,3)
aline -p @test Last three bytes: $asc($left(%last,1)) $&
$asc($mid(%last,2,1)) $asc($mid(%last,3,1))
}
Now enable UTF-8 encoding with "/font", and paste the following line in the window:
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
The result is
32 194 169. "32" is the space before the copyright sign, and "194 169" is the UTF-8 representation of the copyright sign. Good.
Now try pasting this:
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
(The line length and amount of lines are important, because it only happens after a certain threshold.)
Even though this is the same line and UTF-8 is still enabled, you will get a different result than above (one for each line):
58 32 169. "58" is the colon, "32" is the space and "169" is a non-Unicode representation of the copyright sign. Bad.
Thanks for reading,
Rotem