No UTF-8 encoding when pasting large text blocks

Hi,

I wanted to write a script that alters a block of UTF-8 text given by the user. Since custom dialogs don't support Unicode, I had to use a custom window for this. (UTF-8 encoding for custom windows must be enabled.) The user pastes a block of text, and the script process each line and /alines it.

This solution works perfectly for a few small lines of text, but it turns out that if you paste a big block of text, mIRC doesn't encode it in UTF-8. (It also beeps when this happens, perhaps to warn me that there's no UTF-8 encoding? But why not?)

Anyway, here's a simple code snippet to show my point. It simply displays the values of the last three bytes of the text you enter in @test's editbox.

Code:

; /window -e @test
on *:INPUT:@test:{
   var %last = $right($1-,3)
   aline -p @test Last three bytes: $asc($left(%last,1)) $&
     $asc($mid(%last,2,1)) $asc($mid(%last,3,1))
}

Now enable UTF-8 encoding with "/font", and paste the following line in the window:

Code:

This is a line with a non-ASCII character which will be encoded in UTF-8: ©

The result is 32 194 169. "32" is the space before the copyright sign, and "194 169" is the UTF-8 representation of the copyright sign. Good.

Now try pasting this:

Code:

This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©
This is a line with a non-ASCII character which will be encoded in UTF-8: ©

(The line length and amount of lines are important, because it only happens after a certain threshold.)

Even though this is the same line and UTF-8 is still enabled, you will get a different result than above (one for each line): 58 32 169. "58" is the colon, "32" is the space and "169" is a non-Unicode representation of the copyright sign. Bad.

Thanks for reading,

Rotem

Oo, mIRC's inbuilt paste-flooding protection run amuk!

Very thorough description and test scenario's!

What paste-flooding protection?

If you mean that warning window, it's unrelated...

Thanks I was able to reproduce this issue. It is due to an internal limit on the length of the line that can be encoded. When pasting large amounts of text the limit was being exceeded, causing mIRC to skip the encoding. This should be fixed in the next version. This issue appears to be related to (or the same as) the issue reported here.