mIRC Home    About    Download    Register    News    Help

Print Thread
Regular expression bugs!!! #30454 17/06/03 08:26 PM
Joined: Jun 2003
Posts: 12
Y
Yil Offline OP
Pikka bird
OP Offline
Pikka bird
Y
Joined: Jun 2003
Posts: 12
Here's something cut and pasted from a script I was working on...

RANT: Did I mention that mirc can't [bleep] parse regular expressions correctly!?!?!?!?!

Try - Result
\) - if fails to match )'s...
(\)) - fails
[)] - fails
([)]) - fails as well
(\(\)) - works to find ()'s
(\() - fails to even be valid code at runtime!
(\)\() - works to find )('s

In fact any attempt to search for any unequal number of open and close parens either fails to compile(!) or fails to find anything!!!!!!! GRRRRR.... took a while to diagnose this one as it was HUGE regex when I started... I'm guessing its a parsing error in the actual script compiler/interpretter and not a regular expression error.



This is using 6.03. I'd also like to point out that while documentation of regular expressions is missing it WOULD have been nice to point to which model was being used as not all of them have the same features... Compare emacs to perl for example. A list of valid identifiers and post operators would eliminate a lot of trial and error...


OH... and while I'm at it...

Mirc in my opinion mishandles matched strings enclosed in parethesis for reference with the $regml function! I'm sure this is a sideffect of the way mirc handles arguments and again not a regex issue... but anyway its wrong... here's why.

Goto perl and do something like this:
($a, $b, $c) = teststring =~ /(test)st(d)?(ring)/;
printf("%s - %s - %s\n", $a, $b, $c);
Returns: test - - ring

Now do something similiar in mirc:
$regex(TEST, teststring, /(test)st(d)?(ring)/ )
/echo $regml(TEST, 1) - $regml(TEST, 2) - $regml(TEST, 3)
Returns: test - ring -

Thus any complex regular expression in which one wishes to test what fields match is almost USELESS because the matched positional information is completely lost in mirc... You would end up having to retest submatches to figure out what the heck really matched. Perl is far far more powerful...

-Yil

Re: Regular expression bugs!!! #30455 17/06/03 09:52 PM
Joined: Jan 2003
Posts: 2,523
Q
qwerty Offline
Hoopy frood
Offline
Hoopy frood
Q
Joined: Jan 2003
Posts: 2,523
It would be wiser if you were a bit more subtle, since all of your accusations are actually your fault.

1) The bracket-related problems you have probably occur because you use the literal characters ( and ) as parameters to $regex. mirc doesn't accept that though: the way mirc works makes it impossible to use the actual characters in the script. They need to be "escaped" in a way; either by setting a local %variable or using $chr(40) and $chr(41). Here's what I mean:
Code:
//var %a = ), %b = /[)]/ | echo -s $regex(%a,%b)
This works fine. However, this:
Code:
//echo -s $regex(),/[)]/)
doesn't, because mirc treats the first ) char as the closing bracket of $regex().



2) I'd also like to point out that while documentation of regular expressions is missing it WOULD have been nice to point to which model was being used as not all of them have the same features

/help acknowledgements
It mentions there that the regex library used is PCRE, and it's a matter of seconds to find the appropriate URL on google: http://www.pcre.org/man.txt



3) Regarding the $regml() problem, it's just the way PCRE works (and, imo, it's the correct way).

$regex(TEST, teststring, /(test)st(d)?(ring)/ )
//echo $regml(TEST, 1) - $regml(TEST, 2) - $regml(TEST, 3)

Since the "?" is outside the capturing subpattern, I would expect the results mirc gives. If you don't want to lose the positions the way you described you should simply change your pattern to this:
$regex(TEST, teststring, /(test)st(d?)(ring)/ )

I find it odd that Perl works the way you described, but in any way, it feels both more powerful and correct the PCRE way.


/.timerQ 1 0 echo /.timerQ 1 0 $timer(Q).com
Re: Regular expression bugs!!! #30456 18/06/03 12:23 AM
Joined: Jun 2003
Posts: 12
Y
Yil Offline OP
Pikka bird
OP Offline
Pikka bird
Y
Joined: Jun 2003
Posts: 12
>> It would be wiser if you were a bit more subtle, since all of your accusations are actually your fault.

In actuallity its you who are mistaken...

>> The bracket-related problems you have probably occur because you use the literal characters ( and ) as parameters to $regex.

And how exactly would that be my fault when regular expressions CLEARLY state how to escape the special ( and ) characters with a backslash inside regular expressions and they are usable there as a grouping operator already?!? As I said, its a parsing error, but its still a bug in mirc. The fact that its possible to workaround it doesn't diminish this fact.

If you would try an example without assuming I did something idiotic like place d the literals ( and ) OUTSIDE of a regex like you did...

//var %a = ) | echo -s $regex(%a, /\)/)

This fails, and in fact any combination I gave above fails and ALL are valid regular expression syntaxes...


The location of the actual libary was a help though smile

Lifted directly from the library documentation:

Quote:
It is possible for an capturing subpattern number n+1 to
match some part of the subject when subpattern n has not
been used at all. For example, if the string "abc" is
matched against the pattern (a|(z))(bc) subpatterns 1 and 3
are matched, but 2 is not. When this happens, both offset
values corresponding to the unused subpattern are set to -1.

...
2. It sets up the subpattern as a capturing subpattern (as
defined above). When the whole pattern matches, that por-
tion of the subject string that matched the subpattern is
passed back to the caller via the ovector argument of
pcre_exec(). Opening parentheses are counted from left to
right (starting from 1) to obtain the numbers of the captur-
ing subpatterns.





Thus I stand by my BUG comment. This ALSO means that if you were to use the \DIGIT syntax (supported by the library) to refer to captured subpatterns you would immediately see the problem between the DEFINED positional syntax and the returned value of $regml() as I did...

-Yil

Re: Regular expression bugs!!! #30457 18/06/03 12:58 AM
Joined: Jan 2003
Posts: 2,523
Q
qwerty Offline
Hoopy frood
Offline
Hoopy frood
Q
Joined: Jan 2003
Posts: 2,523
*sigh* It seems that you didn't understand anything of what I tried to explain...

And how exactly would that be my fault when regular expressions CLEARLY state how to escape the special ( and ) characters with a backslash inside regular expressions and they are usable there as a grouping operator already?!? As I said, its a parsing error, but its still a bug in mirc.

You don't understand how mirc works. PCRE clearly states how to escape the special characters but that has nothing to do with the script parsing; it occurs on a totally different level. In a call to $regex(), mirc parses the parameters passed to it, evaluates their content (so any identifiers/vars are evaluated) and AFTER it has done that, it passes the regex string to the appropriate PCRE function. The problem with the ()'s occurs on the parsing level, not the PCRE level. If you still don't get it, I'm sorry, I really don't feel like writing a couple of pages to explain to you how mirc parses a script. The fact that YOU don't understand how mirc works doesn't mean it's a bug though.


If you would try an example without assuming I did something idiotic like place d the literals ( and ) OUTSIDE of a regex like you did...
//var %a = ) | echo -s $regex(%a, /\)/)
This fails, and in fact any combination I gave above fails and ALL are valid regular expression syntaxes...


Don't you see you just made the same mistake? Let me colour some parts to make it clearer:
//var %a = ) | echo -s $regex(%a, /\)/)
The opening bracket of $regex is the red one and the closing bracket is the green one. What mirc considers the regex pattern is the blue part, which of course doesn't match the input. This is my last attempt to explain this, so I hope you got it.


This ALSO means that if you were to use the \DIGIT syntax (supported by the library) to refer to captured subpatterns you would immediately see the problem between the DEFINED positional syntax and the returned value of $regml() as I did...

There is NO such problem. Try these two commands:
Code:
//var %a | !.echo -q $regsub(teststring,/(test)st(d)?(ring)/,\2,%a) | echo -s $!regml(1) = $regml(test,1) , $!regml(2) = $regml(2) , $!regml(3) = $regml(3) , % $+ a = %a
//var %a | !.echo -q $regsub(teststring,/(test)st(d?)(ring)/,\2,%a) | echo -s $!regml(1) = $regml(test,1) , $!regml(2) = $regml(2) , $!regml(3) = $regml(3) , % $+ a = %a
Notice how $regml(2) and %a (which is filled with \2) are always the same.


/.timerQ 1 0 echo /.timerQ 1 0 $timer(Q).com
Re: Regular expression bugs!!! #30458 18/06/03 04:23 AM
Joined: Jun 2003
Posts: 12
Y
Yil Offline OP
Pikka bird
OP Offline
Pikka bird
Y
Joined: Jun 2003
Posts: 12
Gee... I'm pretty sure we are talking but not communicating. Hehe.. Try #3...

Its obvious (and I stated this originally) that the problem is in the way mIRC parses its input. I think you and I are agreeing on this. HOWEVER, what you appear to be missing is that the /expression/ syntax doesn't adhere to normal parsing rules already. Does the expression /%temp/ evaluate the %temp variable or match the string "%temp"? Does the [ ] define character sets or force evaluation? Does $+ concatonate? If all of these don't have their script meaning then why should I expect ()'s to be handled differently until the terminating / is encountered? Its a bug... deal with it.


And finally, you only see a simple answer to a trivial example that illustrates the problem. Its that forest and tree problem... Consider your proposed solution:

$regex(teststring,/(test)st(d?)(ring)/

I think that actually worked (which surprised me) but you missed the point because its NOT extendable to ANYTHING more complex than a single character... Try these...

//var %a teststring | echo -a $regex(TEST, %a, /(test)st(dd)?(ring)/) | echo a $regml(TEST, 2)
//var %a teststddring | echo -a $regex(TEST, %a, /(test)st(dd)?(ring)/) | echo a $regml(TEST, 2)

Notice how the fields change now?

Or for a PERFECT example... try this:
//var %a teststringring | echo %a $regex(TEST, %a, /(test)st(dd)?(ring)\3/) | echo a $regml(TEST, 3)

You'll note that \3 (which is the 3rd left paren subexpression correctly matches "ring") and thus it returns a successful match against its input but attempts to access the 3rd match field returns nothing. Examination of $regml(TEST, 2) contains "ring" but this is incorrect. And thus I once again state that $regml returns the incorrect answer but that the regular expression library knows what its doing...

What more do I need to say?

-Yil

Re: Regular expression bugs!!! #30459 18/06/03 10:26 AM
Joined: Jan 2003
Posts: 2,523
Q
qwerty Offline
Hoopy frood
Offline
Hoopy frood
Q
Joined: Jan 2003
Posts: 2,523
HOWEVER, what you appear to be missing is that the /expression/ syntax doesn't adhere to normal parsing rules already. Does the expression /%temp/ evaluate the %temp variable or match the string "%temp"? Does the [ ] define character sets or force evaluation? Does $+ concatonate? If all of these don't have their script meaning then why should I expect ()'s to be handled differently until the terminating / is encountered? Its a bug... deal with it.

Umm, what on earth are you talking about? Anything inside the pattern is evaluated and THEN passed to PCRE. $regex() parameters are treated in the same way as with every other identifier. Whatever rules define the evaluation of variables/identifiers present in parameters generally apply to $regex too.


I think that actually worked (which surprised me) but you missed the point because its NOT extendable to ANYTHING more complex than a single character... Try these...

//var %a teststring | echo -a $regex(TEST, %a, /(test)st(dd)?(ring)/) | echo a $regml(TEST, 2)
//var %a teststddring | echo -a $regex(TEST, %a, /(test)st(dd)?(ring)/) | echo a $regml(TEST, 2)


Your point? You did the same thing like in your first post: the "?" is outside the capturing subpattern. This is NOT what I proposed as a solution. This is:
//var %a teststring | echo -a $regex(TEST, %a, /(test)st((?:dd)?)(ring)/) | echo a $regml(TEST, 2)
//var %a teststddring | echo -a $regex(TEST, %a, /(test)st((?:dd)?)(ring)/) | echo a $regml(TEST, 2)

The 'secret' is to use the quantifier INSIDE the capturing subpattern: for complex subpatterns, just enclose them in a non-capturing brackets pair (the (?:) element) and apply the quantifier to it.


Or for a PERFECT example... try this:
//var %a teststringring | echo %a $regex(TEST, %a, /(test)st(dd)?(ring)\3/) | echo a $regml(TEST, 3)
You'll note that \3 (which is the 3rd left paren subexpression correctly matches "ring") and thus it returns a successful match against its input but attempts to access the 3rd match field returns nothing.


You're correct on this one actually, there is indeed a difference between \3 and $regml(3). Apparently, \3 refers to the 3rd subpattern used, even if a previous subpattern matched zero times (because of a "?" quantifier). However, I think this difference between \3 and $regml(3) is intentional: mirc doesn't fill $regml(N) in some tricky way, it just receives pointers from PCRE. If you think this is a bug (I don't), blame PCRE.


/.timerQ 1 0 echo /.timerQ 1 0 $timer(Q).com
Re: Regular expression bugs!!! #30460 18/06/03 09:41 PM
Joined: Jun 2003
Posts: 12
Y
Yil Offline OP
Pikka bird
OP Offline
Pikka bird
Y
Joined: Jun 2003
Posts: 12
Hmm... Well, first off you've convinced that mirc is parsing the regular expressions more like normal scripts than I had originally thought. It turns out that examining my code I don't use spaces anywhere in my regular expressions but instead use the whitespace identifiers and thus all my %'s, []'s, etc aren't evaluated because they aren't space separated so I never saw any of them do anything... The fact that ()'s don't need spaces around them is why I only saw them as a problem. I also foolishly thought /%temp/ and / %temp / would evaluate the variable if it were going to work when I quickly tested this before posting. It turns out the second was evaluated but kept the spaces and thus failed to match! Doh...

Since mIRC doesn't seem to be special caseing the / / syntax as I had orginally supposed this might be catagorized as an "undocumented feature" and not a bug. In this case it should be noted in the helpfile that the /expression/ is still evaluated instead of treated as a literal string as in everything else I've ever used like perl, emacs, awk, etc...

My solution will be to never inline anything ever again in the $regex call... I'll set a variable to the expression I want and then use that because at least thats consistent... Not sure what I'll do about postfix operators yet but I guess I'll figure it out with some simple $+'s hehe...

As far as the rest goes... Its NOT valid to propose I solve my subexpression issue by use a NON-GROUPING operator like ($?:). I admire your persistence in changing the meaning of my regular expression time and time again to something that sidesteps the $regml bug but you are completely ignoring the fact that I wish to test whether a subexpression matched by examining $regml(). Your solution expressly forbids this! Perhaps I haven't been clear on this because I certainly know how to use non-grouping operators...

Quote:
However, I think this difference between \3 and $regml(3) is intentional: mirc doesn't fill $regml(N) in some tricky way, it just receives pointers from PCRE. If you think this is a bug (I don't), blame PCRE.


I can't blame PCRE because if you scroll back you will see that I cut and pasted the actual documentation from the PRCR man page where it indicates it returns empty results as placeholders for unmatched subexpressions because not all of them will match. mIRC is handling the returned result incorrectly by not preserving them. Period. The way it currently works even the use of a single operator like ? makes $regml absolutely meaningless for any later subexpression testing. Its a bug...

-Yil

Re: Regular expression bugs!!! #30461 19/06/03 11:03 AM
Joined: Jan 2003
Posts: 2,523
Q
qwerty Offline
Hoopy frood
Offline
Hoopy frood
Q
Joined: Jan 2003
Posts: 2,523
but you are completely ignoring the fact that I wish to test whether a subexpression matched by examining $regml(). Your solution expressly forbids this!

Again, I don't think you understood me, since my solution permits exactly this... Here's another example:
//echo -s $regex(ad,/(a)(bc)?(d)/) : $regml(0) - $regml(1) - $regml(2) - $regml(3)
You claim that you can't know if (bc) matched because $regml(0) returns 2 and $regml(2) returns "d", right?
My solution to this problem was this:
//echo -s $regex(ad,/(a)((?:bc)?)(d)/) : $regml(0) - $regml(1) - $regml(2) - $regml(3)
Take a good look at both lines. Notice any difference? The second way, $regml(0) returns 3, not 2, and $regml(2) is empty. Isn't that exactly what you wanted?


Regarding the last problem, I don't see how the paragraphs you pasted support your opinion but here's what I found in man.txt:
Quote:
When any of these functions encounter a substring that is
unset, which can happen when capturing subpattern number n+1
matches some part of the subject, but subpattern n has not
been used at all, they return an empty string. This can be
distinguished from a genuine zero-length substring by
inspecting the appropriate offset in ovector, which is nega-
tive for unset substrings.

So, maybe mirc does something like this, ie deliberately fill $regml(N) only when the offset is NOT negative (which indicates unset substrings). If a positive offset is found, $regml(N) is filled, even if the matched string is zero-length. Anyway, we don't actually know what mirc does internally, but the whole thing certainly does not look like there's a bug. Of course, that's just me.


/.timerQ 1 0 echo /.timerQ 1 0 $timer(Q).com