mIRC Home    About    Download    Register    News    Help

Print Thread
#203811 28/08/08 12:26 AM
Joined: May 2007
Posts: 37
C
Ameglian cow
OP Offline
Ameglian cow
C
Joined: May 2007
Posts: 37
Ok, so I stumbled across something very weird today, and I was wondering if anyone could explain it. As far as I know, \k on its own in regex means absolutely nothing apart from an unnecessary escape of the literal "k".

It DOES have a special meaning when followed by a named group such as \k<NAME> or \k'name', and these work as expected.

Yet mIRC does something strange with \k. For some reason the ONLY thing \k on its own matches in the ascii range 1-255 is ascii 85: the letter "U". The following example makes little sense to me:

Code:
//echo -a $regsubex(ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz,/(\k)/gi,$+($chr(3),4@,$chr(15))) - $regml(1) - $regml(1).pos


Which results in:

ABCDEFGHIJKLMNOPQRST@VWXYZabcdefghijklmnopqrstuvwxyz - -

It obviously captures the U since it replaces it, yet it does not echo the backreference as "U" or the position of it.

Why "U"? And why such weird behaviour? Any ideas anyone?

Edit: Another very odd example is:

Code:
//echo -ag $regsubex(aUa,/a(\k)a/gi,@)


Which echoes "@a"

Last edited by Chessnut; 28/08/08 12:34 AM.
Chessnut #203813 28/08/08 12:38 AM
Joined: Sep 2005
Posts: 2,881
H
Hoopy frood
Offline
Hoopy frood
H
Joined: Sep 2005
Posts: 2,881
That's the weirdest thing I've ever seen.

Afaik though all mIRC does is use the PCRE libraries to do the pattern matching, it's probably a PCRE quirk.

Chessnut #203815 28/08/08 12:57 AM
Joined: May 2007
Posts: 37
C
Ameglian cow
OP Offline
Ameglian cow
C
Joined: May 2007
Posts: 37
Yeah, I had a feeling it might be a PCRE quirk as opposed to a mIRC one, but I was hoping the community here would be able to shed some light on it.

After some extensive testing, I've made another discovery. When \k is used, it works as a sort of retarded lookahead. For example:

Code:
/a(\k)abc/


works EXACTLY the same way as:

Code:
/aU(?=.*abc)/

Chessnut #203980 03/09/08 07:29 AM
Joined: Aug 2006
Posts: 183
T
Vogon poet
Offline
Vogon poet
T
Joined: Aug 2006
Posts: 183
I'm at a complete loss to how you discovered this. Care to shed some light on what you were doing that you found such an odd hiccup?

I laughed out loud. That's just WEIRD.


Yar
Chessnut #204014 05/09/08 10:26 AM
Joined: Apr 2004
Posts: 759
M
Hoopy frood
Offline
Hoopy frood
M
Joined: Apr 2004
Posts: 759
This might very well be a PCRE bug, this \k behaviour doesnt exist on mIRC 6.3 which doesnt use PCRE 7.7, which the current version uses, but 7.2.

The lookahead behaviour seems abit like \K behaviour
Code:
//echo -a $regsubex(hello,/(h\Kello)/,-\1-) $regsubex(hello,/(hello)/,-\1-)


If this isn't a mIRC bug we should probably report it here as well.


$maybe
Mpdreamz #204018 05/09/08 05:51 PM
Joined: May 2007
Posts: 37
C
Ameglian cow
OP Offline
Ameglian cow
C
Joined: May 2007
Posts: 37
I saw the link with \K myself, but it does the opposite - instead of disregarding everything BEFORE it, it disregards everything AFTER it. Had it not been for the way it matches U as well, I would have put it down to a forgotten/undocumented feature of PCRE.

Also, the following example seems to suggest that it doesn't just look ahead, it looks through the ENTIRE pattern again:

Code:
//echo -ag $regex(aU,/(\K)a/)


Thrull: I was looping through \a \b \c etc etc to save every character they match. I noticed that \K on it's own just matched U, which I thought was odd so I experimented a bit wink

Is anyone able to test on another application also using the updated PCRE? If it does the same there, I'll file a bug report with the creators.

Last edited by Chessnut; 05/09/08 05:52 PM.
Chessnut #231154 05/04/11 10:30 PM
Joined: May 2007
Posts: 37
C
Ameglian cow
OP Offline
Ameglian cow
C
Joined: May 2007
Posts: 37
At the risk of getting shot down for resurrecting this topic, I was wondering if anyone can shed any light on it as the problem/quirk still exists after 2.5 years!

To shed some more light on exactly what the hell is going on, I've made the following discoveries, each of which just makes it more mysterious.

  • aUabc,/a(\k)abc/ matches but aUabc,/a(\k)abcd/ does not, so it clearly takes account what follows the (\k)
  • aUabc,/a(\k)gabc/ matches but /a(\k)gabcd/ does not. It seems to match SOME of the expression after the (\k)! It would seem like it matches one or more characters from the end of the expression after (\k).
  • It gets even more unusual! aUooabcoo,/a(\k)gabc/ matches too! this would imply that one or more characters from the end of the string after (\k) can match at any position after the U in the input.
  • Now even more trippy. aUooabcoo,/a(\k)ga(*F)bc/ also matches. If the PCRE engine had ever evaluated the a(*F)bc, it would fail (since (*F) automatically causes a failure).
  • And what about this? aUabc,/a(\k)ab[zz]/ DOES match, but aUabc,/a(\k)ab[z]/ DOESN'T!


The last bullet point in particular I just cannot explain. If anyone has any other observations or comments, please post.

I've tried this in PHP, which uses PCRE, but doesn't exhibit the same behaviour and just matches a 'k' as you'd expect.

I did wonder if this may be something to do with the pre-processing that I mIRC must do before things are passed to the PCRE engine, for example stripping the colour codes with /S and the extra \co \cb escapes, which would mean it was a mIRC bug.

Last edited by Chessnut; 05/04/11 11:57 PM.
Chessnut #231168 06/04/11 10:19 PM
Joined: Mar 2011
Posts: 23
S
Ameglian cow
Offline
Ameglian cow
S
Joined: Mar 2011
Posts: 23
I think you've found an anomoly in PCRE itself. I confirmed with a variety of other software running under Windows (including pcretest and PHP) that \k by itself is matching capital U. I've reported it and I'll let you know where it goes.

Sherip #231186 07/04/11 01:55 PM
Joined: May 2007
Posts: 37
C
Ameglian cow
OP Offline
Ameglian cow
C
Joined: May 2007
Posts: 37
That's strange - when I tested it in PHP it just seemed to match a 'k' as you'd expect. Any idea what PCRE versions the things you tested it on were at?

And yes, please keep me informed with any response you get!

Cheers

Chessnut #231191 07/04/11 08:06 PM
Joined: Mar 2011
Posts: 23
S
Ameglian cow
Offline
Ameglian cow
S
Joined: Mar 2011
Posts: 23
Most of the software I tested with used the latest or a fairly recent PCRE release. PHP version 5.3.6 has PCRE 8.11 aka 8.12 in it.

Just tested in PHP 5.2.5 which had PCRE 7.3 in it. In that version, \k by itself seems to match the empty string at the end of a subject!

Also just tried Perl. Perl 5.10 gives an error "Sequence \k... not terminated in regex". A much older version (5.8.8 from 2006) matches 'k'

Here's the reply to my inquiry:

On Wed, 6 Apr 2011, Sheri wrote:
> Does this make any sense? By itself \k is matching a capital U

In principle, what it should do is what Perl does (unless Perl does something stupid, and matching U seems to me to be stupid). I have noted the issue and will look at it in due course. If all goes well, I am expecting to start a round of work on PCRE after Easter. There are a number of issues to deal with, so it will a while before the next release.

Philip
--
Philip Hazel


Sherip #233330 04/08/11 03:39 PM
Joined: Mar 2011
Posts: 23
S
Ameglian cow
Offline
Ameglian cow
S
Joined: Mar 2011
Posts: 23
Not sure how soon testing will be complete but the first Release Candidate for PCRE version 8.13 is out and the issue is fixed. If \k is used incorrectly in a pattern, the pattern compiler now gives the error "PCRE compilation failed at offset X: \k is not followed by a braced, angle-bracketed, or quoted name"

Sherip #233482 18/08/11 11:27 AM
Joined: Dec 2002
Posts: 5,493
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,493
Thanks for the update, I'll make sure mIRC uses the latest PCRE release for the next version.

Khaled #233509 20/08/11 06:59 PM
Joined: Mar 2011
Posts: 23
S
Ameglian cow
Offline
Ameglian cow
S
Joined: Mar 2011
Posts: 23
Originally Posted By: Khaled
Thanks for the update, I'll make sure mIRC uses the latest PCRE release for the next version.


Hi Khaled, it was released but you might want to hold off for just a bit. Seems something got broken with 8.13 and another update is expected soon. See here: http://bugs.exim.org/show_bug.cgi?id=1136

Sherip #234544 02/11/11 06:33 PM
Joined: Mar 2011
Posts: 23
S
Ameglian cow
Offline
Ameglian cow
S
Joined: Mar 2011
Posts: 23
FYI PCRE 8.20 was released on Oct 21 and is stable. Has a new optional feature for JIT compiling selected regex patterns that will be matched repeatedly at the expense of added lead time.

Sherip #234555 03/11/11 08:15 AM
Joined: Dec 2002
Posts: 5,493
Hoopy frood
Offline
Hoopy frood
Joined: Dec 2002
Posts: 5,493
Thanks, the new version of PCRE will be included in the next release.


Link Copied to Clipboard