mIRC Home    About    Download    Register    News    Help

Print Thread
Joined: Feb 2006
Posts: 546
J
jaytea Offline OP
Fjord artisan
OP Offline
Fjord artisan
J
Joined: Feb 2006
Posts: 546
I'm afraid the ol' \K nightmare has come back to haunt us! mIRC crashes when a 'g' modified regex is supplied that matches an empty string and includes (?=___\K) to reset the start of the match to a position later in the subject:

Code:
//noop $regex(a, /(?=.\K)/g)


*kaboom*


"The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde
Joined: Dec 2002
Posts: 5,411
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,411
Thanks for your bug report. As far as I can tell, this expression searches infinitely for a match. I have tested this on several online regex testers and they all report either a timeout or an error. I think this may have been discussed before: mIRC could halt expressions that continue for more than a few seconds, however this would break valid expressions that need time to work.

Joined: Feb 2006
Posts: 546
J
jaytea Offline OP
Fjord artisan
OP Offline
Fjord artisan
J
Joined: Feb 2006
Posts: 546
i think i found the problem, but it is mostly speculation since i don't have access to low-level PCRE implementations smile

in the PCRE doc and more plainly in the pcre demo code, the author checks for an empty match using:

Code:
if (ovector[0] == ovector[1])


this is not the correct way to check for an empty match, however, since ovector[0] (the reported starting position) can be completely distorted in both directions using \K.

the correct way, as far as i can see, would be to compare ovector[1] (the reported ending position) to the original value of start_offset, since this cannot be mutilated by the user. i'm going to file a report with PCRE's Philip Hazel and see what he has to say smile


"The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde
Joined: Feb 2006
Posts: 546
J
jaytea Offline OP
Fjord artisan
OP Offline
Fjord artisan
J
Joined: Feb 2006
Posts: 546
so, this is indeed a bug that exists within the PCRE demo code - which mIRC presumably uses - and the author has adapted the code to handle these cases. you can see some of the discussion here: https://bugs.exim.org/show_bug.cgi?id=2211

we continued the discussion privately thereafter (since i have my own opinions on how the edge cases of \K in assertions should play out), but the long and short of it is that until the Perl community comes to a consensus on how \K should be handled in assertions, the updated demo code should be used to avoid these infinite loops. though the updates were given in the demo code for PCRE2, they are fully compatible with the version of PCRE that mIRC uses.


"The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde
Joined: Dec 2002
Posts: 5,411
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,411
Thanks for looking into this. Would you know the meaning of the following new code in the updated pcre2demo.3 file:

Code:
  /* If the previous match was not an empty string, there is one tricky case to
  consider. If a pattern contains \eK within a lookbehind assertion at the
  start, the end of the matched string can be at the offset where the match
  started. Without special action, this leads to a loop that keeps on matching
  the same substring. We must detect this case and arrange to move the start on
  by one character. The pcre2_get_startchar() function returns the starting
  offset that was passed to pcre2_match(). */

  else
    {
    PCRE2_SIZE startchar = pcre2_get_startchar(match_data);
    if (start_offset <= startchar)
      {
      ...
      }
    }

There is no pcre2_get_startchar() in PCRE1. It only exists in PCRE2. The description states that pcre2_get_startchar() returns the starting offset that was passed to pcre2_match(). But the starting offset appears to be "start_offset", so that doesn't make sense. Note that PCRE1 uses pcre_exec().

Would you know how I would reproduce this code in PCRE1?

Update: Right, I think pcre2_get_startchar() refers to ovector[1] but will need confirmation of that.

Last edited by Khaled; 13/01/18 12:07 PM.
Joined: Dec 2002
Posts: 5,411
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,411
Okay, I have implemented the changes and they resolve the infinite loop issue. However, they also break one of the \K examples you provided some years ago, which is one of my test cases:

Code:
text = abc
pattern = /(?=.\K)/
replace = X

Expected "aXabc". But now returns "abc".

Unfortunately, I have nothing to compare these results against because none of the regular expression testers I have tried handle \K properly.

Here are the test cases I have for \K, which were provided by you or others in the past:

Code:
text = abcd
pattern = /.\K/g
replace = <>
expect = 

text = abc
pattern = /(?=.\K)/
replace = X
expect = 

text = abc
pattern = /(?<=\K.)/
replace = X
expect = 

text = abcd
pattern = /.\K/
replace = <>
expect = 

; previously resulted in infinite loop
text = abcd
pattern = /.\K/g
replace = <>
expect = 

; previously resulted in infinite loop
text = abc
pattern = /(?<=\K.).*/g
replace =
expect =

; previously resulted in infinite loop
text = abc
pattern = /(?=.\K)/g
replace = X
expect =

Can you fill in the "expect=" value for each item so that I can confirm whether mIRC is reporting the correct result? Or better yet, provide a link to a regex tester that returns the correct results?

Joined: Feb 2006
Posts: 546
J
jaytea Offline OP
Fjord artisan
OP Offline
Fjord artisan
J
Joined: Feb 2006
Posts: 546
i assume you found that get_startchar() refers to the old value of start_offset that was passed in order to yield the current results (otherwise the infinite loop issue would not be resolved)?

regarding the examples that have become broken.. this is (unfortunately?) how PCRE recommends they be handled until such a time that the Perl community is able to define exactly how \K should be handled in all cases.

think of the expression "/(?=.\K)|a/g" against "abc". in the first round, the reported match is [1,0]. PCRE demo code throws a fit. ideally, we would want it to re-try the match at position 0, but this time avoiding that same [1,0] match, favouring the [0,1] match instead ("|a"). however, there is no way to instruct the engine to do this.

you're quite right that no online testers have got this right, that's because most of them will be using the PCRE demo code or slight modifications of it. Perl itself hangs on expressions such as "/(?<=\K.)/g" and also throws an error on "(?=.\K)", so there is no real precedent for "correct" behaviour. i have my own opinions, which differ from PCRE's upgraded implementation in the following ways:
  • support for "/(?=.\K)/g" and similar
  • avoid duplicate matches in expressions such as "/(?<=\K.)/g"
  • match 'b' in expressions such as "/(?<=\Ka)|b/g" against "abc"

the problem is that to fully implement these results, a small change to the core match() routine needs to be made. the measures taken against matching additional empty strings need to be revised; the NOTEMPTY_ATSTART and ANCHORED option combo needs to be wiped in favour of a new option ("NONADV") that can be toggled to indicate a "non-advancing" state, ie. one in which the end of match did not previously advance. since ideal g-option propagation relies solely on the end of the match rather than its start, it is the end of the matches that should be studied in order to properly stave off infinite loops and avoid duplicate matches.

anyway, sorry, i'm rambling. the long and short of it is that these results you now see are to be expected.


"The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde
Joined: Dec 2002
Posts: 5,411
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,411
Quote:
i assume you found that get_startchar() refers to the old value of start_offset that was passed in order to yield the current results (otherwise the infinite loop issue would not be resolved)?

Thanks, that helped!

Quote:
regarding the examples that have become broken.. this is (unfortunately?) how PCRE recommends they be handled until such a time that the Perl community is able to define exactly how \K should be handled in all cases.

Okay, I have implemented the changes and they do not appear to have affected my other regex test cases, only the \K test cases. The inifite loop issue should be resolved for the next beta.


Link Copied to Clipboard