mIRC Homepage
Posted By: jaytea //g and empty substring matches - 26/06/11 01:44 PM
there is a small 'off by 1' bug in mIRC's current implementation of //g that presents itself when \K is used at the end of certain patterns:

Code:
//echo -a $regsubex(abcd, /.\K/g, <>)


= a<>bc<>d

since \K effectively 'dumps' the portion of the subject consumed up until the point it's used, preventing it from being substituted, this should insert '<>' after every character.. but misses both 'b' and 'd'.

my guess is that, when //g is used, mIRC places successive calls to the pcre_exec() function with a startoffset value that is determined solely by looking at the return values associated with the last call to the function (namely, the start and end offsets ovector[0] and ovector[1] respectively).

if ovector[0] == ovector[1], implying that an empty substring was consumed, mIRC advances the start offset to 1 more than it normally would (ovector[1]) so as to avoid potentially endless calling of the function. however, it should only add 1 if ovector[1] == the original startoffset value that was passed to pcre_exec(). so the new offset should be 'ovector[1] + (oldoffset == ovector[1])' rather than 'ovector[1] + (ovector[0] == ovector[1])', so to speak :P
Posted By: Khaled Re: //g and empty substring matches - 18/08/11 11:19 AM
Thanks this has been fixed for the next version.
Posted By: jaytea Re: //g and empty substring matches - 28/01/12 11:02 AM
there is an off by 1 error that appears to have come about through implementing that fix eek

Code:
//echo -a $regex(a a, /(?=a)./g) $regex(a a, /(?=a)/g)


2 and 3, should be 2 and 2.
Posted By: Khaled Re: //g and empty substring matches - 02/02/12 09:06 PM
Thanks, so far I have only been able to resolve this issue by reverting the previous change. I have taken a look at several other PCRE implementations and unfortunately they vary quite a bit in the way they call PCRE, so it is difficult to know whether they can handle the above expressions correctly. I have also tried testing your expressions on several regex online testers (such as this one) and their results vary. Can you find an online regex tester that reproduces the desired behaviour? Are you sure that this should be the correct behaviour?
Posted By: jaytea Re: //g and empty substring matches - 03/02/12 03:15 PM
i'm fairly certain about the behaviour being correct, despite the fact that even the current version of PHP seems to get it wrong! what i shouldn't have been sure about, and i apologize if it misled you, was the solution i proposed at the end of my original post. in truth, the next position mIRC should try cannot be determined solely by the original start offset and ovector[0]/[1].

for example, given the string "aa" and an original offset value of 0:
  • the expression /a\K/g results in {1,1} returned in the ovector, but the correct move is to re-try the pattern at position 1.
  • the expression /(?<=a)/g also results in {1,1} returned, but the correct move in this case is to move to position 2, since it will match the same 'a' again if kept at position 1.

man.txt has some advice regarding this issue:

Originally Posted By: "pcre.org/man.txt"

Finding all the matches in a subject is tricky when the pattern can
match an empty string. It is possible to emulate Perl's /g behaviour by
first trying the match again at the same offset, with the
PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and then if that
fails, advancing the starting offset and trying an ordinary match
again. There is some code that demonstrates how to do this in the pcre-
demo sample program. In the most general case, you have to check to see
if the newline convention recognizes CRLF as a newline, and if so, and
the current character is CR followed by LF, advance the starting offset
by two characters instead of one.

so when you try it again with those flags enabled, /(?<=a)/ causes a failure whereas /a\K/g succeeds but this time the vector returned is {2,2}. i haven't seen the demo code, but i assume the method is:
  • if previous call returned {N,N}, try the next match at offset N with both PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED enabled, and flip a switch.
  • if the match failed and the switch is on, try again at offset N+1
  • if the match succeeded, go back to step 1
Posted By: Khaled Re: //g and empty substring matches - 08/02/12 11:47 AM
Thanks for looking into it further - for the time being I am going to revert the previous change, however I have added this to my to-do list. Making changes to the regex routine can have tricky side-effects, so I would like any new changes to be tested longer before making it into a final release.
Posted By: Wims Re: //g and empty substring matches - 23/07/14 12:16 AM
I'm not sure if the /g implementation is related or not but I've found something weird:

//echo -a $regex(test1...ctest1...,/test1(.*?)(?:c)?/g) $regml(0) -- $regml(1) - $regml(2) - $regml(3) - $regml(4)

This correctly finds two matches ($regex returns 2), correctly finds there are two backreferences ($regml(0) is 2) but returns empty value for them. Tested on 7.34.
Posted By: Iire Re: //g and empty substring matches - 23/07/14 03:11 AM
There are indeed two matches, but they are not what you might think they are.

The following pattern

Code:
/test1(.*?)(?:c)?/g


(Or in this case, more simply:

Code:
/test1(.*?)c?/g
)

Captures zero or more of any character, immediately preceded by test1, and immediately followed by either a c or nothing - all as lazily as possible. Because the group can capture zero characters followed by nothing, it does so, and so the match starts and stops immediately after the 1 of the test1s:

Code:
//echo -ga $regsubex(test1...ctest1...,/(test1)(.*?)c?/g,\1<\2>)


Returns test1<>...ctest1<>...

(Note that here I also capture the test1s so that they won't be, er, "thrown away" in the final substituted string...)
Posted By: Wims Re: //g and empty substring matches - 23/07/14 03:26 AM
You are probably right, I considered this but I've been told it works in .net, I didn't realize .net wasn't pcre and thought it could be an issue. I guess I was thinking it would backtrack .*? to try to find a match with 'c' before stating c? is optional, meaning .*? can simply not match anything, probably because that's what I was looking for..
Posted By: Wims Re: //g and empty substring matches - 28/06/16 11:11 AM
I recently discovered #regex on freenode, where the author of regex101.com is hanging, I asked him about this issue since it's bothering me (regex101.com handles the \K example and the (?=) lookahead example correctly).
What jaytea quoted from the manpage is exactly what the author is doing, and is what should be done.
With the recent addition of the custom /F modifier which basically makes regexes working as they should regarding capturing group matching 0 time, I think it would be a good idea to try to implement the manpage recommendation right now only when /F is provided, and possibly then supports that implementation even without /F, later on.
Posted By: Khaled Re: //g and empty substring matches - 01/07/16 12:32 PM
Thanks, this has been fixed for the next beta. It required some subtle changes to the regex code, however the new code passes the above /K test, as well as the 50+ existing unit tests. Let's see how it goes in the next beta.
Posted By: Wims Re: //g and empty substring matches - 01/07/16 05:41 PM
Very nice.
Posted By: Wims Re: //g and empty substring matches - 28/01/18 09:26 PM
Code:
//echo -a $regex(éa,/(*UTF8)/g)
returns -11:
Originally Posted By: pcre.txt

PCRE_ERROR_BADUTF8_OFFSET (-11)

The UTF-8 byte sequence that was passed as a subject was checked and
found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the
value of startoffset did not point to the beginning of a UTF-8 charac-
ter or the end of the subject.
mIRC is probably advancing the offset to 1 instead of 2 in this case

It seems like this bug goes back to this fix/thread, I don't have an older version than 7.22 at hands so I only tried up to 7.22 (6.35 behaves correctly), which has the issue, and 7.21 has a fix for the original reported issue
Originally Posted By: 7.21
5.Fixed $regsubex() bug when handling empty substrings.
Posted By: Raccoon Re: //g and empty substring matches - 29/01/18 06:52 PM
Khaled:

//echo -a $regsubex(abcd,//gu,/) == $regex(abcd,//gu) (vs.) $regsubex(abcÐ,//gu,/) == $regex(abcÐ,//gu)
Returns: /a/b/c/d/ == 5 (vs.) abcÐ == -11

vs

//echo -a $regsubex(abcd,/(.)|$/gu,/\1) == $regex(abcd,/(.)|$/gu) (vs.) $regsubex(abcÐ,/(.)|$/gu,/\1) == $regex(abcÐ,/(.)|$/gu)
Returns: /a/b/c/d/ == 5 (vs.) /a/b/c/Ð/ == 5
Posted By: Khaled Re: //g and empty substring matches - 29/01/18 07:41 PM
Thanks this issue has been fixed for the next version.
Posted By: Wims Re: //g and empty substring matches - 21/12/20 01:33 PM
Hello, I got the bad utf8 offset error again using jaytea's alias to get the fullmatch of a regex, I cannot repeat it enough but we badly need a way to get the full match, aka what has been matched by the regex engine, outside of captures.

Works without /g: $regex(aü, m/(?:[a-z]\E)(?(R)|\K)/u) is 1
Doesn't work with /g: $regex(aü, m/(?:[a-z]\E)(?(R)|\K)/gu) is -11
Posted By: Wims Re: //g and empty substring matches - 30/12/20 01:03 PM
Well, I was hoping for a quick fix and beta release because this issue is breaking my bot (this issue is present since 7.55 iirc).

While the above issue is related to /g, I believe it's still a wrong increment that has nothing to do with the /g implementation as a whole.
But, I found another issue regarding /g itself (thanks to the author of regex101.com), $regex(a,/a??/g) should result in 3 matches, the first one should be an empty string, the second one should be on "a", and the third one should be an empty string.

In mIRC, the first two matches are empty, and the third match is on "a".

The reason for this result is that the second '?' makes the first quantifier '?' non greedy, i will first attempt to match 'a' 0 time, and since that suceeds, an empty string is returned.
What should be done in this case is retry the match at the same position with the option PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and since that's the exact same match, PCRE_NOTEMPTY_ATSTART invalidate that match, meaning pcre is now trying to match 'a' 1 time, and succeeds, which is why the second match should be on 'a'. At this point mIRC should try the next position and find an empty match at the end of the string.
regex101.com and PHP behave correctly, and this is still according to the recommendation from pcre I pasted in previous posts in this thread.
Posted By: Khaled Re: //g and empty substring matches - 03/01/21 12:27 PM
Quote
Works without /g: $regex(aü, m/(?:[a-z]\E)(?(R)|\K)/u) is 1
Doesn't work with /g: $regex(aü, m/(?:[a-z]\E)(?(R)|\K)/gu) is -11

Thanks this has been fixed for the next version.
Posted By: Wims Re: //g and empty substring matches - 17/01/21 07:35 PM
maroon just pointed this out: mIRC is correct on /a??/g there, it's jaytea's alias that is broken and incorrectly reporting the matches, adding a capture around a?? shows the correct content for mIRC.
© mIRC Discussion Forums