mIRC Homepage

Regex \k

Posted By: Chessnut

Regex \k - 28/08/08 12:26 AM

Ok, so I stumbled across something very weird today, and I was wondering if anyone could explain it. As far as I know, \k on its own in regex means absolutely nothing apart from an unnecessary escape of the literal "k".

It DOES have a special meaning when followed by a named group such as \k<NAME> or \k'name', and these work as expected.

Yet mIRC does something strange with \k. For some reason the ONLY thing \k on its own matches in the ascii range 1-255 is ascii 85: the letter "U". The following example makes little sense to me:

Code:
//echo -a $regsubex(ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz,/(\k)/gi,$+($chr(3),4@,$chr(15))) - $regml(1) - $regml(1).pos


Which results in:

ABCDEFGHIJKLMNOPQRST@VWXYZabcdefghijklmnopqrstuvwxyz - -

It obviously captures the U since it replaces it, yet it does not echo the backreference as "U" or the position of it.

Why "U"? And why such weird behaviour? Any ideas anyone?

Edit: Another very odd example is:

Code:
//echo -ag $regsubex(aUa,/a(\k)a/gi,@)


Which echoes "@a"
Posted By: hixxy

Re: Regex \k - 28/08/08 12:38 AM

That's the weirdest thing I've ever seen.

Afaik though all mIRC does is use the PCRE libraries to do the pattern matching, it's probably a PCRE quirk.
Posted By: Chessnut

Re: Regex \k - 28/08/08 12:57 AM

Yeah, I had a feeling it might be a PCRE quirk as opposed to a mIRC one, but I was hoping the community here would be able to shed some light on it.

After some extensive testing, I've made another discovery. When \k is used, it works as a sort of retarded lookahead. For example:

Code:
/a(\k)abc/


works EXACTLY the same way as:

Code:
/aU(?=.*abc)/
Posted By: Thrull

Re: Regex \k - 03/09/08 07:29 AM

I'm at a complete loss to how you discovered this. Care to shed some light on what you were doing that you found such an odd hiccup?

I laughed out loud. That's just WEIRD.
Posted By: Mpdreamz

Re: Regex \k - 05/09/08 10:26 AM

This might very well be a PCRE bug, this \k behaviour doesnt exist on mIRC 6.3 which doesnt use PCRE 7.7, which the current version uses, but 7.2.

The lookahead behaviour seems abit like \K behaviour
Code:
//echo -a $regsubex(hello,/(h\Kello)/,-\1-) $regsubex(hello,/(hello)/,-\1-)


If this isn't a mIRC bug we should probably report it here as well.
Posted By: Chessnut

Re: Regex \k - 05/09/08 05:51 PM

I saw the link with \K myself, but it does the opposite - instead of disregarding everything BEFORE it, it disregards everything AFTER it. Had it not been for the way it matches U as well, I would have put it down to a forgotten/undocumented feature of PCRE.

Also, the following example seems to suggest that it doesn't just look ahead, it looks through the ENTIRE pattern again:

Code:
//echo -ag $regex(aU,/(\K)a/)


Thrull: I was looping through \a \b \c etc etc to save every character they match. I noticed that \K on it's own just matched U, which I thought was odd so I experimented a bit wink

Is anyone able to test on another application also using the updated PCRE? If it does the same there, I'll file a bug report with the creators.
Posted By: Chessnut

Re: Regex \k - 05/04/11 10:30 PM

At the risk of getting shot down for resurrecting this topic, I was wondering if anyone can shed any light on it as the problem/quirk still exists after 2.5 years!

To shed some more light on exactly what the hell is going on, I've made the following discoveries, each of which just makes it more mysterious.

  • aUabc,/a(\k)abc/ matches but aUabc,/a(\k)abcd/ does not, so it clearly takes account what follows the (\k)
  • aUabc,/a(\k)gabc/ matches but /a(\k)gabcd/ does not. It seems to match SOME of the expression after the (\k)! It would seem like it matches one or more characters from the end of the expression after (\k).
  • It gets even more unusual! aUooabcoo,/a(\k)gabc/ matches too! this would imply that one or more characters from the end of the string after (\k) can match at any position after the U in the input.
  • Now even more trippy. aUooabcoo,/a(\k)ga(*F)bc/ also matches. If the PCRE engine had ever evaluated the a(*F)bc, it would fail (since (*F) automatically causes a failure).
  • And what about this? aUabc,/a(\k)ab[zz]/ DOES match, but aUabc,/a(\k)ab[z]/ DOESN'T!


The last bullet point in particular I just cannot explain. If anyone has any other observations or comments, please post.

I've tried this in PHP, which uses PCRE, but doesn't exhibit the same behaviour and just matches a 'k' as you'd expect.

I did wonder if this may be something to do with the pre-processing that I mIRC must do before things are passed to the PCRE engine, for example stripping the colour codes with /S and the extra \co \cb escapes, which would mean it was a mIRC bug.
Posted By: Sherip

Re: Regex \k - 06/04/11 10:19 PM

I think you've found an anomoly in PCRE itself. I confirmed with a variety of other software running under Windows (including pcretest and PHP) that \k by itself is matching capital U. I've reported it and I'll let you know where it goes.
Posted By: Chessnut

Re: Regex \k - 07/04/11 01:55 PM

That's strange - when I tested it in PHP it just seemed to match a 'k' as you'd expect. Any idea what PCRE versions the things you tested it on were at?

And yes, please keep me informed with any response you get!

Cheers
Posted By: Sherip

Re: Regex \k - 07/04/11 08:06 PM

Most of the software I tested with used the latest or a fairly recent PCRE release. PHP version 5.3.6 has PCRE 8.11 aka 8.12 in it.

Just tested in PHP 5.2.5 which had PCRE 7.3 in it. In that version, \k by itself seems to match the empty string at the end of a subject!

Also just tried Perl. Perl 5.10 gives an error "Sequence \k... not terminated in regex". A much older version (5.8.8 from 2006) matches 'k'

Here's the reply to my inquiry:

On Wed, 6 Apr 2011, Sheri wrote:
> Does this make any sense? By itself \k is matching a capital U

In principle, what it should do is what Perl does (unless Perl does something stupid, and matching U seems to me to be stupid). I have noted the issue and will look at it in due course. If all goes well, I am expecting to start a round of work on PCRE after Easter. There are a number of issues to deal with, so it will a while before the next release.

Philip
--
Philip Hazel

Posted By: Sherip

Re: Regex \k - 04/08/11 03:39 PM

Not sure how soon testing will be complete but the first Release Candidate for PCRE version 8.13 is out and the issue is fixed. If \k is used incorrectly in a pattern, the pattern compiler now gives the error "PCRE compilation failed at offset X: \k is not followed by a braced, angle-bracketed, or quoted name"
Posted By: Khaled

Re: Regex \k - 18/08/11 11:27 AM

Thanks for the update, I'll make sure mIRC uses the latest PCRE release for the next version.
Posted By: Sherip

Re: Regex \k - 20/08/11 06:59 PM

Originally Posted By: Khaled
Thanks for the update, I'll make sure mIRC uses the latest PCRE release for the next version.


Hi Khaled, it was released but you might want to hold off for just a bit. Seems something got broken with 8.13 and another update is expected soon. See here: http://bugs.exim.org/show_bug.cgi?id=1136
Posted By: Sherip

Re: Regex \k - 02/11/11 06:33 PM

FYI PCRE 8.20 was released on Oct 21 and is stable. Has a new optional feature for JIT compiling selected regex patterns that will be matched repeatedly at the expense of added lead time.
Posted By: Khaled

Re: Regex \k - 03/11/11 08:15 AM

Thanks, the new version of PCRE will be included in the next release.
© 2022 mIRC Discussion Forums