mIRC Home    About    Download    Register    News    Help

Print Thread
Page 1 of 2 1 2
#263162 11/06/18 01:19 AM
Joined: Jan 2004
Posts: 1,358
L
Hoopy frood
OP Offline
Hoopy frood
L
Joined: Jan 2004
Posts: 1,358
I've found that certain characters (surrogates) when passed through $regsubex do not combine as they should.

This should resolve the characters and place them next to each other to display a '10' inside a box. $replace works fine. $regsubex corrupts the string.
Code:
//echo -ag $json.unescape(\ud83d\udd1f)

//echo -ag $regsubex(aa,/(a)/gu,$chr($gettok(55357 56607,\n,32))) vs $replace(ab,a,$chr(55357),b,$chr(56607))


Code:
alias json.unescape {
  return $regsubex($1-,/\\(?:u(....)|(.))/gu,$escape.map(\t))
}
 
alias -l escape.map {
  if ($1 isalpha) return $chr(160)
  if ($1 !isalnum) return $1
  if ($base($1,16,10) > 32) return $chr($v1)
  return $chr(160)
}


Other examples, taken from https://github.com/minimaxir/big-list-of-naughty-strings/blob/master/blns.txt
Code:
0\uFE0F\u20E3 1\uFE0F\u20E3 2\uFE0F\u20E3 3\uFE0F\u20E3 4\uFE0F\u20E3 5\uFE0F\u20E3 6\uFE0F\u20E3 7\uFE0F\u20E3 8\uFE0F\u20E3 9\uFE0F\u20E3 \uD83D\uDD1F
\ud83c\uddfa\ud83c\uddf8\ud83c\uddf7\ud83c\uddfa\ud83c\uddf8 \ud83c\udde6\ud83c\uddeb\ud83c\udde6\ud83c\uddf2\ud83c\uddf8
\ud835\udce3\ud835\udcf1\ud835\udcee \ud835\udcfa\ud835\udcfe\ud835\udcf2\ud835\udcec\ud835\udcf4 \ud835\udceb\ud835\udcfb\ud835\udcf8\ud835\udd00\ud835\udcf7 \ud835\udcef\ud835\udcf8\ud835\udd01 \ud835\udcf3\ud835\udcfe\ud835\udcf6\ud835\udcf9\ud835\udcfc \ud835\udcf8\ud835\udcff\ud835\udcee\ud835\udcfb \ud835\udcfd\ud835\udcf1\ud835\udcee \ud835\udcf5\ud835\udcea\ud835\udd03\ud835\udd02 \ud835\udced\ud835\udcf8\ud835\udcf0


Last edited by Loki12583; 11/06/18 01:23 AM.
Joined: Feb 2003
Posts: 2,812
Hoopy frood
Offline
Hoopy frood
Joined: Feb 2003
Posts: 2,812
mIRC does not uniformly support SMP (Plane 1) and above Unicode characters. You're trying to output the symbol U+1F51F "KEYCAP TEN". mIRC wasn't built for handling non-Plane 0 characters.


Well. At least I won lunch.
Good philosophy, see good in bad, I like!
Joined: Jul 2006
Posts: 4,145
W
Hoopy frood
Offline
Hoopy frood
W
Joined: Jul 2006
Posts: 4,145
Assuming you are correct, how do you explain $replace (and any other area in the scripting language which are applicable) producing that character?


#mircscripting @ irc.swiftirc.net == the best mIRC help channel
Joined: Feb 2003
Posts: 2,812
Hoopy frood
Offline
Hoopy frood
Joined: Feb 2003
Posts: 2,812
Because it's not supported uniformly; all over the place.


Well. At least I won lunch.
Good philosophy, see good in bad, I like!
Joined: Apr 2004
Posts: 871
Sat Offline
Hoopy frood
Offline
Hoopy frood
Joined: Apr 2004
Posts: 871
Originally Posted By: Wims
Assuming you are correct, how do you explain $replace [..] producing [U+1F51F "KEYCAP TEN"]?

As $len on the output will show you: it does not. The resulting two surrogate halves just happen to render correctly as one character in the end, thanks to Windows rather than to mIRC. $regsubex just happens to do something special with such surrogate halves, which makes sense because the individual, isolated surrogate halves (as they are considered to be right now) are by definition not proper characters. As such, fixing $regsubex by no means makes mIRC's support for Unicode plane 1+ anywhere close to a reality. As Raccoon said, mIRC doesn't claim to support anything on that front either. So, right now, anything that does work is just a happy accident.

With that said: I for one am not at all against changing $regsubex's behavior in this regard. I just wouldn't consider it a bug..


Saturn, QuakeNet staff
Joined: Jul 2006
Posts: 4,145
W
Hoopy frood
Offline
Hoopy frood
W
Joined: Jul 2006
Posts: 4,145
Right, the unicode article by jaytea isn't so clear about it (and it's pinned so that's kind of mIRC claims), there's nothing about this, only /var and /echo are used to produce such characters in the article and no mention that you can't use this anywhere.

That being said, if it's only the rendering function which happen to show these chars, (compared to $replace's handling being different, here), then it would also render with $regsubex, given it were correctly replacing just like $replace is.
Which makes it a bug imo, $regsubex shouldn't be fixed because it makes mIRC's support for unicode plane 1+ but because it's at least inconsistent, if not just wrong, with the way it handles surrogates.


#mircscripting @ irc.swiftirc.net == the best mIRC help channel
Joined: Dec 2002
Posts: 5,411
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,411
$regsubex() is far more complex than $replace(). Not only does it depend on PCRE processing, it also involves making multiple conversions back and forth between Unicode and UTF-8, and making internal calls to evaluate the identifier provided.

Quote:
$regsubex(aa,/(a)/gu,$chr($gettok(55357 56607,\n,32)))

Can you describe what you are expecting to happen in each part of this call?

Joined: Feb 2003
Posts: 2,812
Hoopy frood
Offline
Hoopy frood
Joined: Feb 2003
Posts: 2,812
I rather expected your answer (multiple conversions, UTF encoding, double-byte string type limitations).

I wonder if it's possible to identify an unpaired surrogate, wait for other surrogates, then process UTF on the sequence when it's completed. I think up to 8 byte characters are possible.


Well. At least I won lunch.
Good philosophy, see good in bad, I like!
Joined: Feb 2006
Posts: 546
J
Fjord artisan
Offline
Fjord artisan
J
Joined: Feb 2006
Posts: 546
this result isn't actually limited to $regsubex; it affects all functions in mIRC that implicitly decode UTF-8, eg:

Code:
//bset -t &a 1 $chr($base(D800, 16, 10)) | echo -a $len($bvar(&a, 1-).text)


= 3

the internal UTF-8 decoding function won't touch unpaired surrogates. would tweaking this be encroaching on violating the sanctity of unicode? clearly there is invalid UTF-8 being represented at some level, so perhaps having it decoded as well as possible isn't such a tall order? laugh

btw, $regsubex() needs to encode (and later decode) the substitution parm in order to play nice with offset positions returned by PCRE (which only handles UTF-8 encoded strings). this seems necessary, and the observed bug is an unfortunate side effect.


"The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde
Joined: Jul 2006
Posts: 4,145
W
Hoopy frood
Offline
Hoopy frood
W
Joined: Jul 2006
Posts: 4,145
We're expecting to get $chr(55357), first surrogate, to be replaced by the first replacement of 'a', and $chr(56607), second surrogate, to be replaced by the second replacement of 'a', having them both returned from $regsubex, they would form the code point when being rendered with /echo, just like with $replace.
Not sure how $replace works but it looks like that when $regsubex adds to the final string to be returned when substituing, it checks for surrogate, something $replace isn't doing (and I'm assuming most others functions will behave like $replace).
Which one is correct? are they both correct?


#mircscripting @ irc.swiftirc.net == the best mIRC help channel
Joined: Dec 2002
Posts: 5,411
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,411
They are both correct.

$replace() performs an in place exchange of characters.

$regsubex() performs a more complex processing of the text that involves conversion back and forth between UTF-8 and Unicode, which involves checks on the validity of the encoding.

A possible solution might be to change mIRC to use Unicode PCRE instead of ANSI PCRE which would mean that UTF-8 conversions would not be needed. However, it is not clear how this would affect existing scripts that pass UTF-8/Unicode in strings to $regsub/$regsubex/etc.

Joined: Feb 2003
Posts: 2,812
Hoopy frood
Offline
Hoopy frood
Joined: Feb 2003
Posts: 2,812
I would attempt to post some examples on here, but this forum destroys unicode characters.

I will note however that attempting this same trick of UTF-8 Plane 0 surrogate pairs behaves he same way, and requires an extra pass of $utfdecode() wrapped around the $regsubex().

We might be able to fix consistency by enabling $utfdecode() to support Plane 1,2,3...

I do use $regsubex() to dice up BYTES regardless of encoding, so that I can handle them as BYTES.


Well. At least I won lunch.
Good philosophy, see good in bad, I like!
Joined: Dec 2002
Posts: 5,411
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,411
I have changed the regex routines to use Unicode PCRE calls. This will be in the next beta. I implemented this change as a number of #ifdefs to insert the corresponding Unicode calls/variables in various places, bypassing the need for UTF-8 conversions. So it can be reversed easily to the original ANSI PCRE.

This means that the OP's $regsubex() call now works as expected. The change also passes my current 100+ test calls to the regex identifiers, however it will need further testing to ensure backwards compatibility.

That said, this issue will actually be present throughout mIRC because back and forth UTF-8 conversions take place in many routines when switching between Unicode/ANSI. It just happens that it was not necessary with PCRE due to Unicode PCRE calls being available.

Joined: Jul 2006
Posts: 4,145
W
Hoopy frood
Offline
Hoopy frood
W
Joined: Jul 2006
Posts: 4,145
Note that you tried to use the utf16 lib of pcre recently and it was breaking scripts (namely, scripts which were using the (*UTF8) control verb, iirc), is this what you mean by Unicode PCRE calls?


#mircscripting @ irc.swiftirc.net == the best mIRC help channel
Joined: Dec 2002
Posts: 5,411
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,411
We will have to see how it works out in the next beta.

Joined: Jan 2004
Posts: 1,358
L
Hoopy frood
OP Offline
Hoopy frood
L
Joined: Jan 2004
Posts: 1,358
Forgot this item in the beta notes

The following now works as expected in the original post:


Code:
//echo -ag $json.unescape(\ud83d\udd1f)

alias json.unescape {
  return $regsubex($1-,/\\(?:u(....)|(.))/Fig,$escape.map(\1,\2))
}

alias -l escape.map {
  if ($1) return $chr($base($1,16,10))
  if ($2 !isalnum) return $2
  return $chr(160)
}

Joined: Jul 2006
Posts: 4,145
W
Hoopy frood
Offline
Hoopy frood
W
Joined: Jul 2006
Posts: 4,145
And as expected the change is breaking backward compatibility, (*UTF8) is now not recognized by pcre. I believe this should be implemented with a $prop: $regsubex().utf16

$regsubex(é,/(.)/g,a) should be sending two bytes to pcre and since (*UTF8) is not used, two matches should happen and the output should be "aa". On the latest beta this is only "a"

Last edited by Wims; 19/06/18 01:22 PM.

#mircscripting @ irc.swiftirc.net == the best mIRC help channel
Joined: Dec 2008
Posts: 1,515
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2008
Posts: 1,515
It breaking backward compatibility, i agree with Wims to add an property for that purpose.

Code:
//echo -a $regsubex(é,/(.)/g,a)


Need Online mIRC help or an mIRC Scripting Freelancer? -> https://irc.chathub.org <-
Joined: Dec 2002
Posts: 5,411
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,411
Quote:
$regsubex(é,/(.)/g,a) should be sending two bytes to pcre and since (*UTF8) is not used, two matches should happen and the output should be "aa". On the latest beta this is only "a"

Right, so basically there is no way to resolve this because scripters may have assumed that an accented e character will be seen as two characters in PCRE, ie. as UTF-8, instead of one character, athough I am not sure whether that is a reasonable assumption or how important it is. But for the sake of backwards compatibility I will be reversing this change in the next beta and noting the issue down in the code comments.

Joined: Jul 2006
Posts: 4,145
W
Hoopy frood
Offline
Hoopy frood
W
Joined: Jul 2006
Posts: 4,145
I do not agree.
on 6.x it is expected by mIRC to use the 8bits pcre lib so the byte 233 would be sent.
on 7.x mIRC still use (until this change) the 8bits lib but use utf8 so the two bytes 195 169 are passed to pcre.
This is what scripters rely on, there is no assumption as to how characters will be seen by pcre.
Using the 16bits lib is not a bad idea as it allows no conversion, (and therefore no loss of lone surrogate?), do you think adding a property to $regsubex is a bad idea?


#mircscripting @ irc.swiftirc.net == the best mIRC help channel
Joined: Dec 2002
Posts: 5,411
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,411
Quote:
do you think adding a property to $regsubex is a bad idea?

Although I have done this before, I have never really liked using a property to change the behaviour of an identifier because it precludes the use/addition of other properties. It would be better to find an alternative method. In this case, custom regex modifiers have been added in the past, S and F, for other uses. So a better option would probably be to add a custom regex modifier.

Update: I have decided to defer support of this to a future version. Changing the current routines to support both ANSI and Unicode PCRE calls at the same time requires far more changes than supporting just one or the other. So the next beta will revert to ANSI PCRE for now.

Last edited by Khaled; 26/06/18 10:51 AM.
Joined: Feb 2003
Posts: 2,812
Hoopy frood
Offline
Hoopy frood
Joined: Feb 2003
Posts: 2,812
What if we just allow $chr/asc() and $utfxcode() identifiers to support characters beyond U+FFFF

That way, if you are doing something really goofy in $regsubex, that most people would never do, we'll have those proper tools to deal with it.


Well. At least I won lunch.
Good philosophy, see good in bad, I like!
Joined: Dec 2002
Posts: 5,411
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,411
Quote:
What if we just allow $chr/asc() and $utfxcode() identifiers to support characters beyond U+FFFF

How would this be implemented? All of the Windows APIs support only 16bit characters. All of mIRC's features, storage methods, routines, commands, identifiers, etc. use 16bit characters and process them that way.

Joined: Feb 2003
Posts: 2,812
Hoopy frood
Offline
Hoopy frood
Joined: Feb 2003
Posts: 2,812
I tried doing a bit of research on this, but it's not a strong area of mine. I know there's a solution, since AutoHotkey_L appears to handle supplemental plane and surrogates just fine, and even specifically with PCRE. There's a lot of WideCharToMultiByte / MultiByteToWideChar function use.

https://github.com/Lexikos/AutoHotkey_L/search?q=WideCharToMultiByte

https://github.com/Lexikos/AutoHotkey_L/search?q=surrogate

https://github.com/Lexikos/AutoHotkey_L/search?q=0x10FFFF

...might be able to give you some ideas, anyway. AHK does a lot of what mSL attempts to do.

Example: https://github.com/Lexikos/AutoHotkey_L/..._ord2utf8.c#L41
/* This file contains a private PCRE function that converts an ordinal character value into a UTF8 string. */


Well. At least I won lunch.
Good philosophy, see good in bad, I like!
Joined: Jul 2006
Posts: 4,145
W
Hoopy frood
Offline
Hoopy frood
W
Joined: Jul 2006
Posts: 4,145
It's weird that you say that after claiming/knowing mirc doesn't support it.

Autohotkey is not relevant, there is no denying that an application can correctly provide a way to user to use/generate these chars and provide, let's say, a string library, like mirc is doing for the BMP. The issue is that, as mentioned, all functions in mIRC deal with utf16. Suppose $chr() is extended and nothing else is done, given that the character in this thread is code point 128287, $len($chr(128287)) would, just like it does right now with combining two surrogates, return the number of element in the array of 16bits, so the two surrogates, which is problematic.


The solution would be to use 32 bits, but there, Khaled already decided against it when converting mIRC to be unicode compatible by choosing the 16 bits design.

I'm also sad that we can't deal with characters others planes in scripts frown


#mircscripting @ irc.swiftirc.net == the best mIRC help channel
Joined: Feb 2003
Posts: 2,812
Hoopy frood
Offline
Hoopy frood
Joined: Feb 2003
Posts: 2,812
It's funny how Ouims says AutoHotkey is not relevant, when it's exactly the opposite of that; it's super relevant. lolz.

Yeah, let's ignore the fact that they're both written in the same language, compiled by the same tools, do the same things, interpret the same scripts, interact with the same functions and libraries. But, no, not relevant. *laughing-emoji*


Well. At least I won lunch.
Good philosophy, see good in bad, I like!
Joined: Jul 2006
Posts: 4,145
W
Hoopy frood
Offline
Hoopy frood
W
Joined: Jul 2006
Posts: 4,145
Language and compiler have nothing to do with it, mIRC internally has an array of 16bits, so we, in our script, can only deal with codepoint up to 65535. We have seen how this isn't preventing 16 bits api used by mIRC (and likely by autohotkey, if you think it's relevant) from themselves handling surrogates but this is different from exposing the characters in a script: autohotkey probably has a 32bits array, allowing you to 'control' 32bits in your script, and when autohotkey is about to use the same 16 bits api/function as mIRC, it's then just converting to utf16.


#mircscripting @ irc.swiftirc.net == the best mIRC help channel
Page 1 of 2 1 2

Link Copied to Clipboard