mIRC Homepage
Posted By: Horstl Regex question: \b\Qxx\E\b - 14/05/08 04:35 PM
Hi,
now and then I'm using quotas (\QTheLiteralText\E) in regex, for example to /filter case sensitive, and/or to /filter -g multiple expressions like /(?:\QexpressionA\E)|(?:\QexpressionB\E)|(?:\QexpressionC\E)/ (sceme only).

Using \Q\E had been handy, as I didn't need to excape metachars separately - especially if the regex matchtext isn't statical (users input, part of $fulladdresses etc).
Now, this isn't working if the escaped text (or what I expect to be escaped) contains metachars AND word boundaries come into play.

No word boundary in regex; all true:
Code:
//var %txt = test | echo -a $regex(%txt,/\Qtest\E/) 
//var %txt = xx test xx | echo -a $regex(%txt,/\Qtest\E/) 

//var %txt = \d^ | echo -a $regex(%txt,/\Q\d^\E/) 
//var %txt = xx [a] xx | echo -a $regex(%txt,/\Q[a]\E/)

Still true, word boundary in regex:
Code:
//var %txt = test | echo -a $regex(%txt,/\b\Qtest\E\b/) 
//var %txt = xx test xx | echo -a $regex(%txt,/\b\Qtest\E\b/)

FALSE, word boundary in regex:
Code:
//var %txt = \d^ | echo -a $regex(%txt,/\b\Q\d^\E\b/) 
//var %txt = xx [a] xx | echo -a $regex(%txt,/\b\Q[a]\E\b/)

....even more irritating: this one IS true:
Code:
//var %txt = xx [a] xx | echo -a $regex(%txt,/\B\Q[a]\E\B/)

I'd really appreciate an explanation of this behaviour (looks like I misconceive the way \Q\E work) and, furthermore, a workaround.
I'd like to keep on using \Q\E instead of an ugly $replacex(string of all the possible metachars to escape)
Thanks!
Posted By: qwerty Re: Regex question: \b\Qxx\E\b - 14/05/08 05:04 PM
This is not related to \Q\E, but to the way \b works. \b matches the position between a \w character and a \W character, in either order (or between a \w character and the beginning/end of the string). So /\b@test\b/ would never match "hello @test hi" because the first \b is between two \W characters: the space and @. In contrast, /@\btest\b/ would match the previous string, because the first \b is between a \W (@) and and a \w (t). Whether @test is inside \Q\E or not doesn't matter at all.

What you really mean to match is not a position between \w and \W but between \s and \S. There is no shorthand assertion for this sort of match; you have to use the usual, lengthy assertion construct:

(?<=^|\s)STRINGHERE(?=$|\s)

Examples:

//echo -ag $regex(hello @test hi,/(?<=^|\s)@test(?=$|\s)/)

//echo -ag $regex(hello @testhi,/(?<=^|\s)@test(?=$|\s)/)

//echo -ag $regex(hello@test hi,/(?<=^|\s)@test(?=$|\s)/)
Posted By: Horstl Re: Regex question: \b\Qxx\E\b - 14/05/08 06:11 PM
Thanks, I just didn't realize that my "metachars" are word boundaries themselves. smirk

Now, to get the "bounary effect" in some circumstances none the less (e.g. text bordering punctuation, brackets...), I'm temped to use: /(?<=^|\W)\Qmy expression\E(?=$|\W)/

That aside: you may have noticed that I used (?:...) to set "non captive" patterns. You're using (?<=...) and (?=...). I never figured out how to use these atomic things properly - could you please explain why you used the one at the beginning and the other at the end?
Posted By: qwerty Re: Regex question: \b\Qxx\E\b - 14/05/08 07:11 PM
(?<=) and (?=) are called assertions ("atomic" elements are an entirely different matter) and they are essentially the same sort of thing as \b, in that they do not consume characters when matching: meaning that another part (earlier or later) of the pattern can match the same characters that were matched by the assertions.

Another way to look at assertions is to treat them as positions, ie assertions match a position in the input that is preceded or followed by whatever is inside the assertion. Sort of like the cursor in an editbox: a cursor can be put anywhere in the string, but it's always between characters. In contrast, "normal" regex patterns can be thought of as a selected area of text in an editbox. A selection always covers one or more characters.

Assertions can be a little hard to get at first (although they are straightforward in reality) and this is not the place for a full explanation, examples etc. This tutorial should be helpful.

The reason I used assertions instead of (?:) will become apparent if your expressions use the /g modifier. In such cases, the same part of the input can match an assertion in more than one /g rounds, for example once in a lookahead, in one round, and a second time in a lookbehind, in the next round. Not sure how much of this makes sense now, but I hope it will once you read the tutorial and play with some examples yourself.
Posted By: Horstl Re: Regex question: \b\Qxx\E\b - 14/05/08 09:31 PM
Now I've an image of what assertions can do (at least a vague one). Your "cursor" analogy was helpful in particular - Thanks again smile

© mIRC Discussion Forums