mIRC Home    About    Download    Register    News    Help

Print Thread
Joined: Jul 2007
Posts: 1,129
T
Tomao Offline OP
Hoopy frood
OP Offline
Hoopy frood
T
Joined: Jul 2007
Posts: 1,129
I was wondering if you guys could shed me some light on utilizing regex to get the sentence like so. I've tried the multi-line switch, the line breaks and all that... but unfortunatnetly they won't work with mIRC for some reason:
Code:
     <title>
This is the sentence I want to capture.
  </title>
the tabs also exist.

Joined: Oct 2004
Posts: 8,330
Hoopy frood
Offline
Hoopy frood
Joined: Oct 2004
Posts: 8,330
Not sure about regex, but I just set a variable when the first line is seen, then only capture the second line when the variable is set and unset when done.


Invision Support
#Invision on irc.irchighway.net
Joined: Mar 2011
Posts: 23
S
Ameglian cow
Offline
Ameglian cow
S
Joined: Mar 2011
Posts: 23
Try this pattern:

(*ANYCRLF)<title>\s*\K.+?(?=\s*<\/title>)

I'm no expert in mirc's regex functions, can you retrieve the whole match? Normally that is preferable. If it must be in a substring for retrieval with a mirc identifier, you could use:

(*ANYCRLF)<title>\s*\K(.+?)(?=\s*<\/title>)

and the sentence should be in $regml(1). As written, the pattern should also work when the tags and content are on the same line.

No need here for the multiline option. The only time you need the multiline option is when you want to use ^ to match at the beginnings of lines and/or use $ to match at ends of lines. What constitutes a "line" and what matches dot depends on the linebreak option. Since mirc is using LF by default for its linebreak option, I would suggest to include (*ANYCRLF) at the start of patterns that will work with multi-line subjects. That way carriage returns won't be treated like normal characters.

Joined: Jul 2007
Posts: 1,129
T
Tomao Offline OP
Hoopy frood
OP Offline
Hoopy frood
T
Joined: Jul 2007
Posts: 1,129
I appreciate you two for the response!

Sherip, I applied your regex pattern to my script using mIRC's $regex() identifier and return the match via $regml(1), but it's still not working to give me a desired result. It's still the same as before. I suppose I may need to, as you've suggested, retrieve the whole thing and capture the sentence I'm after.

Joined: Mar 2011
Posts: 23
S
Ameglian cow
Offline
Ameglian cow
S
Joined: Mar 2011
Posts: 23
Can you demonstrate the failure? This alias seems to work fine. The i makes it case insensitive, in case your tags are actually in upper case.

testaregex {
var %testdata = $str($chr(32),4) $+ <title> $+ $crlf
%testdata = %testdata $+ This is the sentence. $+ $crlf
%testdata = %testdata $+ $str($chr(32),2) $+ </title>
var %pat = /(*ANYCRLF)<title>\s*\K(.+?)(?=\s*<\/title>)/i
echo count= $+ $regex(%testdata, %pat) $+ ;string= $+ $regml(1)
}

Joined: Feb 2006
Posts: 546
J
Fjord artisan
Offline
Fjord artisan
J
Joined: Feb 2006
Posts: 546
Originally Posted By: Sherip

I'm no expert in mirc's regex functions, can you retrieve the whole match?


you mean the portion of the string consumed by the engine in a successful match, ie. the first two integers of the vector returned by PCRE? if so, then unfortunately it isn't possible. for a static expression, such as those that we are dealing with here, we can easily modify the expression to get the result we want as you did, by using a capturing group and $regml(). for a dynamic expression, it's a bit trickier, and i've had cases where an extra $regml() property, or $regml(-1) support, or new $regsubex() marker, would have been very convenient indeed!

Originally Posted By: Sherip
I would suggest to include (*ANYCRLF) at the start of patterns that will work with multi-line subjects. That way carriage returns won't be treated like normal characters.


the effect of using (*ANYCRLF) here, having CR treated as a new line character as well as the default LF, is to prevent the '.' in '.+?' from being able to match a lone CR. so, whereas "<title>text \r here</title>" would normally be matched and 'text \r here' captured as $regml(1), with (*ANYCRLF) this doesn't happen.

so all (*ANYCRLF) is serving to do in this case is restrict the set of strings which are matched by the rest of the expression. this is presumably not what the Tomao wants; what he wants is quite the opposite, to broaden this set to include titles that may possibly span multiple lines. we need quite the opposite effect: instruct PCRE to allow '.' to match a new line character by enabling the DOTALL option via the 's' modifier:

Code:
(?s)<title>\s*\K.+?(?=\s*<\/title>)


Tomao, this assumes you're sending the entire series of lines separated by CRLF or LF to $regex(). it isn't especially clear to me that this is what you're doing, though i suppose it is the most sensible interpretation of your problem


"The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde
Joined: Feb 2006
Posts: 546
J
Fjord artisan
Offline
Fjord artisan
J
Joined: Feb 2006
Posts: 546
Originally Posted By: Sherip
Can you demonstrate the failure? This alias seems to work fine. The i makes it case insensitive, in case your tags are actually in upper case.

testaregex {
var %testdata = $str($chr(32),4) $+ <title> $+ $crlf
%testdata = %testdata $+ This is the sentence. $+ $crlf
%testdata = %testdata $+ $str($chr(32),2) $+ </title>
var %pat = /(*ANYCRLF)<title>\s*\K(.+?)(?=\s*<\/title>)/i
echo count= $+ $regex(%testdata, %pat) $+ ;string= $+ $regml(1)
}


don't forget \s matches \r and \n :P \s* is matching those characters between the coloured parentheses:

Quote:

<title>(\r\n)This is the sentence.(\r\n )</title>


the problem therefore arises when 'This is the sentence.' contains a line break


"The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde
Joined: Mar 2011
Posts: 23
S
Ameglian cow
Offline
Ameglian cow
S
Joined: Mar 2011
Posts: 23
Well, the \r, \n and other \s characters that precede and follow the "sentence" are being excluded from the whole match. If "This is a sentence." has interior line breaks the only change necessary is to enable dotall, by including an "s" in the regex options.

(*ANYCRLF) is not strictly required for this particular pattern, the same result is achieved without it. I personally would always include it when working with multiple lines as that is the default I prefer. I very often use features where it does matter, e.g., using \R to match linebreaks in a subject regardless whether they are crlf's, cr's or lf's.

Parentheses are drawn around the whole match in the pattern so that mirc can retrieve it as substring 1. It strikes me as odd that mirc doesn't allow substring zero to be specified to retrieve the whole match.

Last edited by Sherip; 11/04/11 04:01 PM.
Joined: Jul 2007
Posts: 1,129
T
Tomao Offline OP
Hoopy frood
OP Offline
Hoopy frood
T
Joined: Jul 2007
Posts: 1,129
Originally Posted By: Sherip
The i makes it case insensitive, in case your tags are actually in upper case.
Yes, I did include the (?i) modifier. It didn't work as expected.

jaytea's suggested regex pattern works a treat!

Again, thanks to both of you that took your time solving my puzzle.


Link Copied to Clipboard