mIRC Home    About    Download    Register    News    Help

Print Thread
Joined: Feb 2006
Posts: 546
J
jaytea Offline OP
Fjord artisan
OP Offline
Fjord artisan
J
Joined: Feb 2006
Posts: 546
there is a small 'off by 1' bug in mIRC's current implementation of //g that presents itself when \K is used at the end of certain patterns:

Code:
//echo -a $regsubex(abcd, /.\K/g, <>)


= a<>bc<>d

since \K effectively 'dumps' the portion of the subject consumed up until the point it's used, preventing it from being substituted, this should insert '<>' after every character.. but misses both 'b' and 'd'.

my guess is that, when //g is used, mIRC places successive calls to the pcre_exec() function with a startoffset value that is determined solely by looking at the return values associated with the last call to the function (namely, the start and end offsets ovector[0] and ovector[1] respectively).

if ovector[0] == ovector[1], implying that an empty substring was consumed, mIRC advances the start offset to 1 more than it normally would (ovector[1]) so as to avoid potentially endless calling of the function. however, it should only add 1 if ovector[1] == the original startoffset value that was passed to pcre_exec(). so the new offset should be 'ovector[1] + (oldoffset == ovector[1])' rather than 'ovector[1] + (ovector[0] == ovector[1])', so to speak :P


"The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde
Joined: Dec 2002
Posts: 5,411
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,411
Thanks this has been fixed for the next version.

Joined: Feb 2006
Posts: 546
J
jaytea Offline OP
Fjord artisan
OP Offline
Fjord artisan
J
Joined: Feb 2006
Posts: 546
there is an off by 1 error that appears to have come about through implementing that fix eek

Code:
//echo -a $regex(a a, /(?=a)./g) $regex(a a, /(?=a)/g)


2 and 3, should be 2 and 2.


"The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde
Joined: Dec 2002
Posts: 5,411
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,411
Thanks, so far I have only been able to resolve this issue by reverting the previous change. I have taken a look at several other PCRE implementations and unfortunately they vary quite a bit in the way they call PCRE, so it is difficult to know whether they can handle the above expressions correctly. I have also tried testing your expressions on several regex online testers (such as this one) and their results vary. Can you find an online regex tester that reproduces the desired behaviour? Are you sure that this should be the correct behaviour?

Joined: Feb 2006
Posts: 546
J
jaytea Offline OP
Fjord artisan
OP Offline
Fjord artisan
J
Joined: Feb 2006
Posts: 546
i'm fairly certain about the behaviour being correct, despite the fact that even the current version of PHP seems to get it wrong! what i shouldn't have been sure about, and i apologize if it misled you, was the solution i proposed at the end of my original post. in truth, the next position mIRC should try cannot be determined solely by the original start offset and ovector[0]/[1].

for example, given the string "aa" and an original offset value of 0:
  • the expression /a\K/g results in {1,1} returned in the ovector, but the correct move is to re-try the pattern at position 1.
  • the expression /(?<=a)/g also results in {1,1} returned, but the correct move in this case is to move to position 2, since it will match the same 'a' again if kept at position 1.

man.txt has some advice regarding this issue:

Originally Posted By: "pcre.org/man.txt"

Finding all the matches in a subject is tricky when the pattern can
match an empty string. It is possible to emulate Perl's /g behaviour by
first trying the match again at the same offset, with the
PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and then if that
fails, advancing the starting offset and trying an ordinary match
again. There is some code that demonstrates how to do this in the pcre-
demo sample program. In the most general case, you have to check to see
if the newline convention recognizes CRLF as a newline, and if so, and
the current character is CR followed by LF, advance the starting offset
by two characters instead of one.

so when you try it again with those flags enabled, /(?<=a)/ causes a failure whereas /a\K/g succeeds but this time the vector returned is {2,2}. i haven't seen the demo code, but i assume the method is:
  • if previous call returned {N,N}, try the next match at offset N with both PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED enabled, and flip a switch.
  • if the match failed and the switch is on, try again at offset N+1
  • if the match succeeded, go back to step 1


"The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde
Joined: Dec 2002
Posts: 5,411
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,411
Thanks for looking into it further - for the time being I am going to revert the previous change, however I have added this to my to-do list. Making changes to the regex routine can have tricky side-effects, so I would like any new changes to be tested longer before making it into a final release.

Joined: Jul 2006
Posts: 4,145
W
Hoopy frood
Offline
Hoopy frood
W
Joined: Jul 2006
Posts: 4,145
I'm not sure if the /g implementation is related or not but I've found something weird:

//echo -a $regex(test1...ctest1...,/test1(.*?)(?:c)?/g) $regml(0) -- $regml(1) - $regml(2) - $regml(3) - $regml(4)

This correctly finds two matches ($regex returns 2), correctly finds there are two backreferences ($regml(0) is 2) but returns empty value for them. Tested on 7.34.


#mircscripting @ irc.swiftirc.net == the best mIRC help channel
Joined: Aug 2013
Posts: 81
I
Babel fish
Offline
Babel fish
I
Joined: Aug 2013
Posts: 81
There are indeed two matches, but they are not what you might think they are.

The following pattern

Code:
/test1(.*?)(?:c)?/g


(Or in this case, more simply:

Code:
/test1(.*?)c?/g
)

Captures zero or more of any character, immediately preceded by test1, and immediately followed by either a c or nothing - all as lazily as possible. Because the group can capture zero characters followed by nothing, it does so, and so the match starts and stops immediately after the 1 of the test1s:

Code:
//echo -ga $regsubex(test1...ctest1...,/(test1)(.*?)c?/g,\1<\2>)


Returns test1<>...ctest1<>...

(Note that here I also capture the test1s so that they won't be, er, "thrown away" in the final substituted string...)

Joined: Jul 2006
Posts: 4,145
W
Hoopy frood
Offline
Hoopy frood
W
Joined: Jul 2006
Posts: 4,145
You are probably right, I considered this but I've been told it works in .net, I didn't realize .net wasn't pcre and thought it could be an issue. I guess I was thinking it would backtrack .*? to try to find a match with 'c' before stating c? is optional, meaning .*? can simply not match anything, probably because that's what I was looking for..


#mircscripting @ irc.swiftirc.net == the best mIRC help channel
Joined: Jul 2006
Posts: 4,145
W
Hoopy frood
Offline
Hoopy frood
W
Joined: Jul 2006
Posts: 4,145
I recently discovered #regex on freenode, where the author of regex101.com is hanging, I asked him about this issue since it's bothering me (regex101.com handles the \K example and the (?=) lookahead example correctly).
What jaytea quoted from the manpage is exactly what the author is doing, and is what should be done.
With the recent addition of the custom /F modifier which basically makes regexes working as they should regarding capturing group matching 0 time, I think it would be a good idea to try to implement the manpage recommendation right now only when /F is provided, and possibly then supports that implementation even without /F, later on.


#mircscripting @ irc.swiftirc.net == the best mIRC help channel
Joined: Dec 2002
Posts: 5,411
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,411
Thanks, this has been fixed for the next beta. It required some subtle changes to the regex code, however the new code passes the above /K test, as well as the 50+ existing unit tests. Let's see how it goes in the next beta.

Joined: Jul 2006
Posts: 4,145
W
Hoopy frood
Offline
Hoopy frood
W
Joined: Jul 2006
Posts: 4,145
Very nice.


#mircscripting @ irc.swiftirc.net == the best mIRC help channel
Joined: Jul 2006
Posts: 4,145
W
Hoopy frood
Offline
Hoopy frood
W
Joined: Jul 2006
Posts: 4,145
Code:
//echo -a $regex(éa,/(*UTF8)/g)
returns -11:
Originally Posted By: pcre.txt

PCRE_ERROR_BADUTF8_OFFSET (-11)

The UTF-8 byte sequence that was passed as a subject was checked and
found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the
value of startoffset did not point to the beginning of a UTF-8 charac-
ter or the end of the subject.
mIRC is probably advancing the offset to 1 instead of 2 in this case

It seems like this bug goes back to this fix/thread, I don't have an older version than 7.22 at hands so I only tried up to 7.22 (6.35 behaves correctly), which has the issue, and 7.21 has a fix for the original reported issue
Originally Posted By: 7.21
5.Fixed $regsubex() bug when handling empty substrings.


#mircscripting @ irc.swiftirc.net == the best mIRC help channel
Joined: Feb 2003
Posts: 2,812
Hoopy frood
Offline
Hoopy frood
Joined: Feb 2003
Posts: 2,812
Khaled:

//echo -a $regsubex(abcd,//gu,/) == $regex(abcd,//gu) (vs.) $regsubex(abcÐ,//gu,/) == $regex(abcÐ,//gu)
Returns: /a/b/c/d/ == 5 (vs.) abcÐ == -11

vs

//echo -a $regsubex(abcd,/(.)|$/gu,/\1) == $regex(abcd,/(.)|$/gu) (vs.) $regsubex(abcÐ,/(.)|$/gu,/\1) == $regex(abcÐ,/(.)|$/gu)
Returns: /a/b/c/d/ == 5 (vs.) /a/b/c/Ð/ == 5

Last edited by Raccoon; 29/01/18 07:06 PM.

Well. At least I won lunch.
Good philosophy, see good in bad, I like!
Joined: Dec 2002
Posts: 5,411
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,411
Thanks this issue has been fixed for the next version.

Joined: Jul 2006
Posts: 4,145
W
Hoopy frood
Offline
Hoopy frood
W
Joined: Jul 2006
Posts: 4,145
Hello, I got the bad utf8 offset error again using jaytea's alias to get the fullmatch of a regex, I cannot repeat it enough but we badly need a way to get the full match, aka what has been matched by the regex engine, outside of captures.

Works without /g: $regex(aü, m/(?:[a-z]\E)(?(R)|\K)/u) is 1
Doesn't work with /g: $regex(aü, m/(?:[a-z]\E)(?(R)|\K)/gu) is -11


#mircscripting @ irc.swiftirc.net == the best mIRC help channel
Joined: Jul 2006
Posts: 4,145
W
Hoopy frood
Offline
Hoopy frood
W
Joined: Jul 2006
Posts: 4,145
Well, I was hoping for a quick fix and beta release because this issue is breaking my bot (this issue is present since 7.55 iirc).

While the above issue is related to /g, I believe it's still a wrong increment that has nothing to do with the /g implementation as a whole.
But, I found another issue regarding /g itself (thanks to the author of regex101.com), $regex(a,/a??/g) should result in 3 matches, the first one should be an empty string, the second one should be on "a", and the third one should be an empty string.

In mIRC, the first two matches are empty, and the third match is on "a".

The reason for this result is that the second '?' makes the first quantifier '?' non greedy, i will first attempt to match 'a' 0 time, and since that suceeds, an empty string is returned.
What should be done in this case is retry the match at the same position with the option PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and since that's the exact same match, PCRE_NOTEMPTY_ATSTART invalidate that match, meaning pcre is now trying to match 'a' 1 time, and succeeds, which is why the second match should be on 'a'. At this point mIRC should try the next position and find an empty match at the end of the string.
regex101.com and PHP behave correctly, and this is still according to the recommendation from pcre I pasted in previous posts in this thread.


#mircscripting @ irc.swiftirc.net == the best mIRC help channel
Joined: Dec 2002
Posts: 5,411
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,411
Quote
Works without /g: $regex(aü, m/(?:[a-z]\E)(?(R)|\K)/u) is 1
Doesn't work with /g: $regex(aü, m/(?:[a-z]\E)(?(R)|\K)/gu) is -11

Thanks this has been fixed for the next version.

Joined: Jul 2006
Posts: 4,145
W
Hoopy frood
Offline
Hoopy frood
W
Joined: Jul 2006
Posts: 4,145
maroon just pointed this out: mIRC is correct on /a??/g there, it's jaytea's alias that is broken and incorrectly reporting the matches, adding a capture around a?? shows the correct content for mIRC.


#mircscripting @ irc.swiftirc.net == the best mIRC help channel

Link Copied to Clipboard