mIRC Home    About    Download    Register    News    Help

Print Thread
#242694 21/08/13 05:56 AM
Joined: Feb 2003
Posts: 2,737
Raccoon Offline OP
Hoopy frood
OP Offline
Hoopy frood
Joined: Feb 2003
Posts: 2,737
I just found out through much hassle that the Regular Expression support (eg: $regex) in mIRC does not support UTF-8 by default. This leads to scripts going awry [potentially dangerously so] whenever user input (eg: chat text) is processed via a regular expression pattern that does not utilize the (*UTF8) sequence.
Code:
$regex($1-,/(.)/g)
versus
$regex($1-,/(*UTF8)(.)/g)

Without the (*UTF8) sequence, any UTF8 characters that are encountered are treated as 2 or more characters when processed by $regex, often leading to garbled outputs.

If (*UTF8) cannot be included by default, I would at least ask that the flag /8 be created to simplify patterns.
Code:
$regex($1-,/(.)/g8)


- Raccoon


Well. At least I won lunch.
Good philosophy, see good in bad, I like!
Joined: Oct 2003
Posts: 3,641
A
Hoopy frood
Offline
Hoopy frood
A
Joined: Oct 2003
Posts: 3,641
I believe it's opt-in for backwards compatibility. As far as a //8 switch, I believe (*UTF8) makes for a much better documentation story, since (*UTF8) can be Googled, whereas //8 can't, and mIRC doesn't actually document any of the regex syntax, so there's that.

Also having a 100% PCRE compatible implementation is much more portable and easier for users to learn and share expressions. Messing with the syntax sets a bad precedent-- namely it's no longer PCRE.

Joined: Feb 2003
Posts: 2,737
Raccoon Offline OP
Hoopy frood
OP Offline
Hoopy frood
Joined: Feb 2003
Posts: 2,737
If you want to get technical, //g is not PCRE. Nor is //S. What's worse, //S only works for RegEx patterns in some places in mIRC, but not others. I'd ask that //S be extended to all RegEx pattern usage, as well as //8.


Well. At least I won lunch.
Good philosophy, see good in bad, I like!
Joined: Jul 2006
Posts: 4,022
W
Hoopy frood
Offline
Hoopy frood
W
Joined: Jul 2006
Posts: 4,022
You ignored or missed his points, which are that regex calls do not force the decoding of utf8 for backward compatibilities (a script working on 6.35 without (UTF8*) should still work the same on 7.x) and that the (UTF8*) syntax is documented.
The extra 'S' modifier was added in combination to the '$' prefix event because you couldn't efficiently remove the control code from the event when matching using regex, whereas you can always use $strip with $regex/$regsubex. And 'S' is documented correctly under /help event prefixes.
Only the 'g' modifier is not documented (not even in versions.txt, to my surprise) and only that is a problem.
Now you asked about that '8' modifier saying it would simplify the pattern, but frankly, would it?

$regex(...,/(UTF8*)mypattern/)
vs
$regex(...,/mypattern/8)

is not really simplifying anything to me, just making it a bit shorter, not worth the addition in my opinion

Note: I don't understand your last post, but this is definitely not a bug report, what are you reporting?


#mircscripting @ irc.swiftirc.net == the best mIRC help channel
Joined: Feb 2003
Posts: 2,737
Raccoon Offline OP
Hoopy frood
OP Offline
Hoopy frood
Joined: Feb 2003
Posts: 2,737
FYI, you can't use $strip in /filter or $hfind and a number of other locations that RegEx is used.

Yes, I'm aware that PCRE is ASCII and not UTF-8 by default. mIRC on the other hand is not. Nor is any of this documented.

As an mIRC user, and conceivably a rather decent coder, it took a hell of a lot of code breaking and bother to narrow down the problem, and the only people who had any idea why $regex wasn't working correctly where some kids in a Runescape channel that use mIRC. This is why I'm reporting it as a bug.

I'm sorry about your reading comprehension. Let me explain. I'm reporting that mIRC's implementation of PCRE is "bugged" because it does not conform to mIRC's universal UTF-8 standard. PCRE has a compile option to make it UTF-8 by default, which is the recommendation of this bug report.

The suggestion of a //8 flag is just a suggestion, not a bug.

The implementation of the //S flag is also bugged, as it is not consistently supported in places that it's needed. I'll refrain from drafting a new bug report on this matter.


Well. At least I won lunch.
Good philosophy, see good in bad, I like!
Joined: Jul 2006
Posts: 4,022
W
Hoopy frood
Offline
Hoopy frood
W
Joined: Jul 2006
Posts: 4,022
Quote:
Yes, I'm aware that PCRE is ASCII and not UTF-8 by default. mIRC on the other hand is not. Nor is any of this documented.
mIRC is documented as being unicode only, using utf8, you can read up about it there
Quote:
As an mIRC user, and conceivably a rather decent coder, it took a hell of a lot of code breaking and bother to narrow down the problem, and the only people who had any idea why $regex wasn't working correctly where some kids in a Runescape channel that use mIRC. This is why I'm reporting it as a bug.
Next time ask on swiftirc grin
Quote:
I'm sorry about your reading comprehension. Let me explain. I'm reporting that mIRC's implementation of PCRE is "bugged" because it does not conform to mIRC's universal UTF-8 standard. PCRE has a compile option to make it UTF-8 by default, which is the recommendation of this bug report.
Ok, I just agree with what has been said, it's not a good idea imo.

Note that adding '8' as a modifier becomes useless if regexes decode utf8 by default on newer versions of mIRC, unless it's a switch not to decode utf8 for this regex.

/filter has the -b switch to strip control codes but you are right, $hfind should support the 'S' modifier.


#mircscripting @ irc.swiftirc.net == the best mIRC help channel
Joined: Feb 2003
Posts: 2,737
Raccoon Offline OP
Hoopy frood
OP Offline
Hoopy frood
Joined: Feb 2003
Posts: 2,737
Originally Posted By: Wims
Originally Posted By: Raccoon
Yes, I'm aware that PCRE is ASCII and not UTF-8 by default.
mIRC on the other hand is not. Nor is any of this documented.
mIRC is documented as being unicode only, using utf8,
you can read up about it there


And because mIRC is Unicode Only, PCRE needs to be as well (unless the user manually flags the expression as Non-UTF8).


Well. At least I won lunch.
Good philosophy, see good in bad, I like!

Link Copied to Clipboard