Using regex to capture a phrase between html tags

I was wondering if you guys could shed me some light on utilizing regex to get the sentence like so. I've tried the multi-line switch, the line breaks and all that... but unfortunatnetly they won't work with mIRC for some reason:

Code:

     <title>
This is the sentence I want to capture.
  </title>

the tabs also exist.

Not sure about regex, but I just set a variable when the first line is seen, then only capture the second line when the variable is set and unset when done.

Try this pattern:

(*ANYCRLF)<title>\s*\K.+?(?=\s*<\/title>)

I'm no expert in mirc's regex functions, can you retrieve the whole match? Normally that is preferable. If it must be in a substring for retrieval with a mirc identifier, you could use:

(*ANYCRLF)<title>\s*\K(.+?)(?=\s*<\/title>)

and the sentence should be in $regml(1). As written, the pattern should also work when the tags and content are on the same line.

No need here for the multiline option. The only time you need the multiline option is when you want to use ^ to match at the beginnings of lines and/or use $ to match at ends of lines. What constitutes a "line" and what matches dot depends on the linebreak option. Since mirc is using LF by default for its linebreak option, I would suggest to include (*ANYCRLF) at the start of patterns that will work with multi-line subjects. That way carriage returns won't be treated like normal characters.

I appreciate you two for the response!

Sherip, I applied your regex pattern to my script using mIRC's $regex() identifier and return the match via $regml(1), but it's still not working to give me a desired result. It's still the same as before. I suppose I may need to, as you've suggested, retrieve the whole thing and capture the sentence I'm after.

Can you demonstrate the failure? This alias seems to work fine. The i makes it case insensitive, in case your tags are actually in upper case.

testaregex {
var %testdata = $str($chr(32),4) $+ <title> $+ $crlf
%testdata = %testdata $+ This is the sentence. $+ $crlf
%testdata = %testdata $+ $str($chr(32),2) $+ </title>
var %pat = /(*ANYCRLF)<title>\s*\K(.+?)(?=\s*<\/title>)/i
echo count= $+ $regex(%testdata, %pat) $+ ;string= $+ $regml(1)
}

Originally Posted By: Sherip

I'm no expert in mirc's regex functions, can you retrieve the whole match?

you mean the portion of the string consumed by the engine in a successful match, ie. the first two integers of the vector returned by PCRE? if so, then unfortunately it isn't possible. for a static expression, such as those that we are dealing with here, we can easily modify the expression to get the result we want as you did, by using a capturing group and $regml(). for a dynamic expression, it's a bit trickier, and i've had cases where an extra $regml() property, or $regml(-1) support, or new $regsubex() marker, would have been very convenient indeed!

Originally Posted By: Sherip

I would suggest to include (*ANYCRLF) at the start of patterns that will work with multi-line subjects. That way carriage returns won't be treated like normal characters.

the effect of using (*ANYCRLF) here, having CR treated as a new line character as well as the default LF, is to prevent the '.' in '.+?' from being able to match a lone CR. so, whereas "<title>text \r here</title>" would normally be matched and 'text \r here' captured as $regml(1), with (*ANYCRLF) this doesn't happen.

so all (*ANYCRLF) is serving to do in this case is restrict the set of strings which are matched by the rest of the expression. this is presumably not what the Tomao wants; what he wants is quite the opposite, to broaden this set to include titles that may possibly span multiple lines. we need quite the opposite effect: instruct PCRE to allow '.' to match a new line character by enabling the DOTALL option via the 's' modifier:

Code:

(?s)<title>\s*\K.+?(?=\s*<\/title>)

Tomao, this assumes you're sending the entire series of lines separated by CRLF or LF to $regex(). it isn't especially clear to me that this is what you're doing, though i suppose it is the most sensible interpretation of your problem

Originally Posted By: Sherip

Can you demonstrate the failure? This alias seems to work fine. The i makes it case insensitive, in case your tags are actually in upper case.

testaregex {
var %testdata = $str($chr(32),4) $+ <title> $+ $crlf
%testdata = %testdata $+ This is the sentence. $+ $crlf
%testdata = %testdata $+ $str($chr(32),2) $+ </title>
var %pat = /(*ANYCRLF)<title>\s*\K(.+?)(?=\s*<\/title>)/i
echo count= $+ $regex(%testdata, %pat) $+ ;string= $+ $regml(1)
}

don't forget \s matches \r and \n :P \s* is matching those characters between the coloured parentheses:

Quote:

<title>(\r\n)This is the sentence.(\r\n )</title>

the problem therefore arises when 'This is the sentence.' contains a line break

Well, the \r, \n and other \s characters that precede and follow the "sentence" are being excluded from the whole match. If "This is a sentence." has interior line breaks the only change necessary is to enable dotall, by including an "s" in the regex options.

(*ANYCRLF) is not strictly required for this particular pattern, the same result is achieved without it. I personally would always include it when working with multiple lines as that is the default I prefer. I very often use features where it does matter, e.g., using \R to match linebreaks in a subject regardless whether they are crlf's, cr's or lf's.

Parentheses are drawn around the whole match in the pattern so that mirc can retrieve it as substring 1. It strikes me as odd that mirc doesn't allow substring zero to be specified to retrieve the whole match.

Originally Posted By: Sherip

The i makes it case insensitive, in case your tags are actually in upper case.

Yes, I did include the (?i) modifier. It didn't work as expected.

jaytea's suggested regex pattern works a treat!

Again, thanks to both of you that took your time solving my puzzle.