Register Log In

Forums mIRC Help Using regex to capture a phrase between html tags

Print Thread

Using regex to capture a phrase between html tags #231288 10/04/11 08:21 PM
Joined: Jul 2007 Posts: 1,129 United States T Tomao OP Hoopy frood
OP Tomao Hoopy frood T Joined: Jul 2007 Posts: 1,129 United States	I was wondering if you guys could shed me some light on utilizing regex to get the sentence like so. I've tried the multi-line switch, the line breaks and all that... but unfortunatnetly they won't work with mIRC for some reason: Code: <title> This is the sentence I want to capture. </title> the tabs also exist.

Re: Using regex to capture a phrase between html tags Tomao #231290 11/04/11 12:16 AM
Joined: Oct 2004 Posts: 8,330 MA, USA Riamus2 Hoopy frood
Riamus2 Hoopy frood Joined: Oct 2004 Posts: 8,330 MA, USA	Not sure about regex, but I just set a variable when the first line is seen, then only capture the second line when the variable is set and unset when done. Invision Support #Invision on irc.irchighway.net

Re: Using regex to capture a phrase between html tags Tomao #231292 11/04/11 02:36 AM
Joined: Mar 2011 Posts: 23 S Sherip Ameglian cow
Sherip Ameglian cow S Joined: Mar 2011 Posts: 23	Try this pattern: (ANYCRLF)<title>\s\K.+?(?=\s<\/title>) I'm no expert in mirc's regex functions, can you retrieve the whole match? Normally that is preferable. If it must be in a substring for retrieval with a mirc identifier, you could use: (ANYCRLF)<title>\s\K(.+?)(?=\s<\/title>) and the sentence should be in $regml(1). As written, the pattern should also work when the tags and content are on the same line. No need here for the multiline option. The only time you need the multiline option is when you want to use ^ to match at the beginnings of lines and/or use $ to match at ends of lines. What constitutes a "line" and what matches dot depends on the linebreak option. Since mirc is using LF by default for its linebreak option, I would suggest to include (*ANYCRLF) at the start of patterns that will work with multi-line subjects. That way carriage returns won't be treated like normal characters.

Re: Using regex to capture a phrase between html tags Sherip #231296 11/04/11 05:31 AM
Joined: Jul 2007 Posts: 1,129 United States T Tomao OP Hoopy frood
OP Tomao Hoopy frood T Joined: Jul 2007 Posts: 1,129 United States	I appreciate you two for the response! Sherip, I applied your regex pattern to my script using mIRC's $regex() identifier and return the match via $regml(1), but it's still not working to give me a desired result. It's still the same as before. I suppose I may need to, as you've suggested, retrieve the whole thing and capture the sentence I'm after.

Re: Using regex to capture a phrase between html tags Tomao #231298 11/04/11 10:13 AM
Joined: Mar 2011 Posts: 23 S Sherip Ameglian cow
Sherip Ameglian cow S Joined: Mar 2011 Posts: 23	Can you demonstrate the failure? This alias seems to work fine. The i makes it case insensitive, in case your tags are actually in upper case. testaregex { var %testdata = $str($chr(32),4) $+ <title> $+ $crlf %testdata = %testdata $+ This is the sentence. $+ $crlf %testdata = %testdata $+ $str($chr(32),2) $+ </title> var %pat = /(ANYCRLF)<title>\s\K(.+?)(?=\s*<\/title>)/i echo count= $+ $regex(%testdata, %pat) $+ ;string= $+ $regml(1) }

Re: Using regex to capture a phrase between html tags Sherip #231299 11/04/11 10:51 AM
Joined: Feb 2006 Posts: 546 J jaytea Fjord artisan
jaytea Fjord artisan J Joined: Feb 2006 Posts: 546	Originally Posted By: Sherip I'm no expert in mirc's regex functions, can you retrieve the whole match? you mean the portion of the string consumed by the engine in a successful match, ie. the first two integers of the vector returned by PCRE? if so, then unfortunately it isn't possible. for a static expression, such as those that we are dealing with here, we can easily modify the expression to get the result we want as you did, by using a capturing group and $regml(). for a dynamic expression, it's a bit trickier, and i've had cases where an extra $regml() property, or $regml(-1) support, or new $regsubex() marker, would have been very convenient indeed! Originally Posted By: Sherip I would suggest to include (ANYCRLF) at the start of patterns that will work with multi-line subjects. That way carriage returns won't be treated like normal characters. the effect of using (ANYCRLF) here, having CR treated as a new line character as well as the default LF, is to prevent the '.' in '.+?' from being able to match a lone CR. so, whereas "<title>text \r here</title>" would normally be matched and 'text \r here' captured as $regml(1), with (ANYCRLF) this doesn't happen. so all (ANYCRLF) is serving to do in this case is restrict the set of strings which are matched by the rest of the expression. this is presumably not what the Tomao wants; what he wants is quite the opposite, to broaden this set to include titles that may possibly span multiple lines. we need quite the opposite effect: instruct PCRE to allow '.' to match a new line character by enabling the DOTALL option via the 's' modifier: Code: (?s)<title>\s\K.+?(?=\s<\/title>) Tomao, this assumes you're sending the entire series of lines separated by CRLF or LF to $regex(). it isn't especially clear to me that this is what you're doing, though i suppose it is the most sensible interpretation of your problem "The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde

Re: Using regex to capture a phrase between html tags Sherip #231300 11/04/11 10:56 AM
Joined: Feb 2006 Posts: 546 J jaytea Fjord artisan
jaytea Fjord artisan J Joined: Feb 2006 Posts: 546	Originally Posted By: Sherip Can you demonstrate the failure? This alias seems to work fine. The i makes it case insensitive, in case your tags are actually in upper case. testaregex { var %testdata = $str($chr(32),4) $+ <title> $+ $crlf %testdata = %testdata $+ This is the sentence. $+ $crlf %testdata = %testdata $+ $str($chr(32),2) $+ </title> var %pat = /(ANYCRLF)<title>\s\K(.+?)(?=\s<\/title>)/i echo count= $+ $regex(%testdata, %pat) $+ ;string= $+ $regml(1) } don't forget \s matches \r and \n :P \s is matching those characters between the coloured parentheses: Quote: <title>(\r\n)This is the sentence.(\r\n )</title> the problem therefore arises when 'This is the sentence.' contains a line break "The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde

Re: Using regex to capture a phrase between html tags jaytea #231302 11/04/11 03:59 PM
Joined: Mar 2011 Posts: 23 S Sherip Ameglian cow
Sherip Ameglian cow S Joined: Mar 2011 Posts: 23	Well, the \r, \n and other \s characters that precede and follow the "sentence" are being excluded from the whole match. If "This is a sentence." has interior line breaks the only change necessary is to enable dotall, by including an "s" in the regex options. (*ANYCRLF) is not strictly required for this particular pattern, the same result is achieved without it. I personally would always include it when working with multiple lines as that is the default I prefer. I very often use features where it does matter, e.g., using \R to match linebreaks in a subject regardless whether they are crlf's, cr's or lf's. Parentheses are drawn around the whole match in the pattern so that mirc can retrieve it as substring 1. It strikes me as odd that mirc doesn't allow substring zero to be specified to retrieve the whole match. Last edited by Sherip; 11/04/11 04:01 PM.

Re: Using regex to capture a phrase between html tags Sherip #231305 11/04/11 05:47 PM
Joined: Jul 2007 Posts: 1,129 United States T Tomao OP Hoopy frood
OP Tomao Hoopy frood T Joined: Jul 2007 Posts: 1,129 United States	Originally Posted By: Sherip The i makes it case insensitive, in case your tags are actually in upper case. Yes, I did include the (?i) modifier. It didn't work as expected. jaytea's suggested regex pattern works a treat! Again, thanks to both of you that took your time solving my puzzle.

Link Copied to Clipboard