Register Log In

Forums Scripts & Popups $htmlfree update needed

Print Thread

Page 1 of 2

1

2

$htmlfree update needed #209543 17/02/09 01:38 AM
Joined: Oct 2004 Posts: 8,061 MA, USA R Riamus2 OP Hoopy frood
OP Riamus2 Hoopy frood R Joined: Oct 2004 Posts: 8,061 MA, USA	Ok, I have this identifier to remove html. It has always worked fine until now. Code: alias htmlfree { var %x, %i = $regsub($1-,/(^[^<]>\|<[^>]>\|<[^>]$)/g,$null,%x), %x = $replace(%x, ,$chr(32)) return %x } I have the following Source data (I cut it down to just the part that is causing the problem): Code: <sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup> Now, that should* result in just [a] being returned. Instead, it leaves: a]'>[a] ... most likely due to the []'s in there. Can someone help to update the identifier so it doesn't miss that a]'> part? Note that I'll also accept any other good method to automatically remove that extra data. I know I could just $remove(%var,]'>) from it, but the "a" in there can be any letter and I don't really want to list out all letters in a $remove line... that's just not very efficient. Also note that there may be multiple footnotes on a line, so I can't just use $gettok to get the data. Last edited by Riamus2; 17/02/09 01:42 AM.

Re: $htmlfree update needed Riamus2 #209548 17/02/09 12:41 PM
Joined: Aug 2005 Posts: 1,052 Canada L Lpfix5 Hoopy frood
Lpfix5 Hoopy frood L Joined: Aug 2005 Posts: 1,052 Canada	For now Code: alias htmlfree { var %x, %i = $regsub($1-,/(^[^<]>\|<[^>]>\|<[^>]$)/g,$null,%x), %x = $replace(%x, ,$chr(32)) return $remove(%x,$wildtok(%x,]'>,1,91)) } im not at home to test the regex query

Re: $htmlfree update needed Lpfix5 #209551 17/02/09 01:25 PM
Joined: Oct 2004 Posts: 8,061 MA, USA R Riamus2 OP Hoopy frood
OP Riamus2 Hoopy frood R Joined: Oct 2004 Posts: 8,061 MA, USA	That definitely fixes that problem, but it leaves & nbsp; and perhaps others now. It also seems to remove more than it should. If there's text before that, it also gets removed. Quote: <p> <p /> <sup id="en-NIV-26127" class="vnum" value='16'>16</sup>Here is normal text.<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup> That leaves just [a] and nothing else. I still need the text before that part. It's only the "a]'>" part that should get removed. It should show: 16This is normal text. [a] Previously, it did, but also included the a]'> part. Thanks for helping. Last edited by Riamus2; 17/02/09 01:31 PM.

Re: $htmlfree update needed Riamus2 #209572 18/02/09 12:43 AM
Joined: Aug 2005 Posts: 1,052 Canada L Lpfix5 Hoopy frood
Lpfix5 Hoopy frood L Joined: Aug 2005 Posts: 1,052 Canada	Here is a more efficient regsub tag Code: var %x, %i = $regsub($1-,/( \|°\|]'>\|^[^<]>\|<[^>]>\|<[^>]*$)/g,$null,%x), %x = $replace(%x, ,$chr(32)) Also deals with Code:   and ° features you might come across, this will properly strip all HTML tags within your site dump the reason you'll see an extra a is because if you notice youll see without all the html tags that, that a is present.. remove the $remove + $wildtok stuff I gave you earlier just run %x alone

Re: $htmlfree update needed Lpfix5 #209573 18/02/09 02:55 AM
Joined: Oct 2004 Posts: 8,061 MA, USA R Riamus2 OP Hoopy frood
OP Riamus2 Hoopy frood R Joined: Oct 2004 Posts: 8,061 MA, USA	Well, that works except the other a. It shouldn't be there. It's part of the html just like a link would be. Quote: <sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'> Just to look at that part. The red, you agree shouldn't be there. The "a" that you said should be there, if you look closely, is part of the green value in the html the same way that 'footnote' is. value='a' is basically what you're seeing there. The a doesn't belong.

Re: $htmlfree update needed Riamus2 #209574 18/02/09 03:43 AM
Joined: Aug 2005 Posts: 1,052 Canada L Lpfix5 Hoopy frood
Lpfix5 Hoopy frood L Joined: Aug 2005 Posts: 1,052 Canada	Thats not the a I see :P Ill make red all HTML tags green = what should be :P Actual HTML tags are derived such as < or > therefore the a is not inside an HTML its (OUT) Quote: <p> <p /> <sup id="en-NIV-26127" class="vnum" value='16'>16</sup>Here is normal text.<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup> Originally Posted By: Riamus2 Well, that works except the other a. It shouldn't be there. It's part of the html just like a link would be. Quote: <sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'> Just to look at that part. The red, you agree shouldn't be there. The "a" that you said should be there, if you look closely, is part of the green value in the html the same way that 'footnote' is. value='a' is basically what you're seeing there. The a doesn't belong.

Re: $htmlfree update needed Lpfix5 #209575 18/02/09 03:54 AM
Joined: Aug 2005 Posts: 1,052 Canada L Lpfix5 Hoopy frood
Lpfix5 Hoopy frood L Joined: Aug 2005 Posts: 1,052 Canada	PS maybe if you see it this way youll know I mean, after running hte script echo the 10 REGML's of the strip //echo -a $regml(1) $regml(2) $regml(3) $regml(4) $regml(5) $regml(6) $regml(7) $regml(8) $regml(9) $regml(10) youll see how exactly the data is cut from the entire string

Re: $htmlfree update needed Riamus2 #209576 18/02/09 04:03 AM
Joined: Oct 2005 Posts: 1,671 G genius_at_work Hoopy frood
genius_at_work Hoopy frood G Joined: Oct 2005 Posts: 1,671	I think it should be like this: <sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup> Red = HTML tags Blue = HTML tag parameters Green = Plain text Which would appear as this on your browser screen: [a] -genius_at_work

Re: $htmlfree update needed Lpfix5 #209585 18/02/09 12:48 PM
Joined: Oct 2004 Posts: 8,061 MA, USA R Riamus2 OP Hoopy frood
OP Riamus2 Hoopy frood R Joined: Oct 2004 Posts: 8,061 MA, USA	It actually is inside the <>'s. Look carefully. There are 2 a's. One isn't between <>'s (the last one). The first one is between them as I showed. Look carefully at it. Yes, it's not between the <a href>'s <>'s. However, that is nested INSIDE the <sub>'s brackets. If you look, there is a < before the sub, and the closing > is after that first a. Look at genius_at_work's comment. He sees what I'm saying. If you looked at the code in a browser, you'd see [a], not a[a]. Note that I didn't correctly write what you'd see if you removed the <a href> in my previous post. By removing just the <a href> part, you'd see value='[a]' instead of value='a' . You're removing everything from that except the a... Taking out the nested <a href> in: Quote: <sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'> Results in: Quote: <sup class='footnote' value='[a]'> Now, that's definitely all html.

Re: $htmlfree update needed Riamus2 #209606 19/02/09 03:43 AM
Joined: Sep 2005 Posts: 2,630 H hixxy Hoopy frood
hixxy Hoopy frood H Joined: Sep 2005 Posts: 2,630	Is this HTML being returned by an outside source, or is it something that you can edit? I'm pretty sure that the part inside of the value='' tags should be escaped by using HTML entities like this: value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'

Re: $htmlfree update needed hixxy #209642 20/02/09 01:02 AM
Joined: Oct 2004 Posts: 8,061 MA, USA R Riamus2 OP Hoopy frood
OP Riamus2 Hoopy frood R Joined: Oct 2004 Posts: 8,061 MA, USA	No, I'm getting the source with sockets, so I have no real control over it. Granted, I could edit it while receiving it, but I'd rather find a good "$htmlfree" method of just removing all html aspects from the line. Also, I have a feeling that just using something like $replace(%text,a[a],[a],b[b],[b],c[c],[c],...) or using $remove on the original returned text $remove(%text,a]'>,b]'>,c]'>,...) would be more efficient then editing every line and then running it through $htmlfree, even though that's not very efficient itself. I know I could just do that and it would be "easy" to do... but I'd rather something efficient.

Re: $htmlfree update needed Riamus2 #209648 20/02/09 02:22 AM
Joined: Oct 2005 Posts: 1,671 G genius_at_work Hoopy frood
genius_at_work Hoopy frood G Joined: Oct 2005 Posts: 1,671	Try this: Code: alias htmlfree { var %x = $regsubex($1-,/^[^<](?:\'?[^\']\')[^<]>\|<[^>](?:\'[^\']\')[^<]>\|(<[^>](\'[^\']\'?)[^<]$)/g,) return %x } -genius_at_work

Re: $htmlfree update needed genius_at_work #209649 20/02/09 02:36 AM
Joined: Oct 2004 Posts: 8,061 MA, USA R Riamus2 OP Hoopy frood
OP Riamus2 Hoopy frood R Joined: Oct 2004 Posts: 8,061 MA, USA	Unfortunately, that removes all text before the [a] footnote and doesn't convert & nbsp; to a space. Quote: <p> <p /> <sup id="en-NIV-26127" class="vnum" value='16'>16</sup>Here is normal text.<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup> Trying it on that results in [a]... it doesn't show Here is normal text.

Re: $htmlfree update needed Riamus2 #209671 20/02/09 11:57 AM
Joined: Apr 2004 Posts: 755 Arnhem, The Netherlands M Mpdreamz Hoopy frood
Mpdreamz Hoopy frood M Joined: Apr 2004 Posts: 755 Arnhem, The Netherlands	This sort of problem always makes me think of this cartoon: Even if we force recursion [a] is all we'll get. Code: alias htmlfree { var %x, %i = $regsub($1-,/(^[^<]>\|<[^>]>\|<[^>]$)/g,$null,%x), %x = $replace(%x, ,$chr(32)) if ($regex(%x,/<\|>/)) { while (%x != $htmlfree.recurse(%x)) { echo -a %x %x = $v2 } } return %x } alias htmlfree.recurse return $htmlfree($1); //echo -a $htmlfree(<p> <p /> <sup id="en-NIV-26127" class="vnum" value='16'>16</sup>Here is normal text.<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup>) I opted for the lazy recursion rather then cracking at PCRE's (?R) first iteration => 16Here is normal text.a]'>[a] second iteration => [a] In these case cracking out your own parser is needed and happens elsewhere too in mIRC this would be a huge performance hit though it might be better to delage the work Code: alias nohtml { if (!$1) return .comopen h htmlfile .comclose h $com(h,write,1,bstr,$1) $com(h,body,3,dispatch b) $com(b,innertext,3) var %x = $com(b).result .comclose b return %x } //echo -a $htmlfree(<p> <p /> <sup id="en-NIV-26127" class="vnum" value='16'>16</sup>Here is normal text.<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup>) returns 16Here is normal text.[a] It connects with MSHTML.HTMLDocument which should be available since windows 95 and up. EDIT: Oh and someone name and shame the guy for nesting HTML tags within attribute values! Last edited by Mpdreamz; 20/02/09 11:58 AM.

Re: $htmlfree update needed Mpdreamz #209683 21/02/09 12:41 AM
Joined: Oct 2004 Posts: 8,061 MA, USA R Riamus2 OP Hoopy frood
OP Riamus2 Hoopy frood R Joined: Oct 2004 Posts: 8,061 MA, USA	Definitely not what I was hoping to hear. Sites are a real pain! Thanks for trying, though. Would it be faster just having 2 regex identifiers? A normal $htmlfree (that leave a]'> ), and one that removes any letter followed by ]'> ? I'm stuck without being able to just use $left/$right/$pos to remove it because there may be more than one. I mean, I could probably do something like that, but it wouldn't be all that efficient. Maybe a second regex that checks for that isn't very efficient either. I don't know. I just don't want to have to take a performance hit if I don't have to.

Re: $htmlfree update needed Riamus2 #209684 21/02/09 01:36 AM
Joined: Oct 2005 Posts: 1,671 G genius_at_work Hoopy frood
genius_at_work Hoopy frood G Joined: Oct 2005 Posts: 1,671	As MPdreamz suggested, the only real way to solve this problem is to break the fingers of whoever decided to put unescaped html tags within the parameters of another html tag. The problem lies with the fact that the html on a site usually comes back as lines that are too long for mIRC to handle internally. If a set of ' ' quotes (as in your example) fell across a linebreak, any code would be rendered useless. One possible solution would be to remove any text that is enclosed in ' ' quotes in one regex, and then deal with the remaining < > brackets in another regex. But then the potential problem pops up when the ' ' quotes show up outside the < > brackets (in plain text), or when a single ' quote shows up within a set of " " quotes (and another ' quote isn't there to match it). It seems that this rather simple request has turned into a very complex coding problem. I will try to think about it further. -genius_at_work

Re: $htmlfree update needed genius_at_work #209689 21/02/09 04:33 AM
Joined: Oct 2004 Posts: 8,061 MA, USA R Riamus2 OP Hoopy frood
OP Riamus2 Hoopy frood R Joined: Oct 2004 Posts: 8,061 MA, USA	I actually have very little problem with line length with this particular site/script.

Re: $htmlfree update needed Riamus2 #209698 21/02/09 06:37 AM
Joined: Oct 2005 Posts: 1,671 G genius_at_work Hoopy frood
genius_at_work Hoopy frood G Joined: Oct 2005 Posts: 1,671	Well, the ultimate goal of this thread would be to 'fix' the widely used $htmlfree alias so that it works better in more situations. We could make it work in this exact situation, but if the site you are using changed its format slightly, your code may not work anymore. -genius_at_work

Re: $htmlfree update needed Riamus2 #209706 21/02/09 09:32 AM
Joined: Apr 2004 Posts: 755 Arnhem, The Netherlands M Mpdreamz Hoopy frood
Mpdreamz Hoopy frood M Joined: Apr 2004 Posts: 755 Arnhem, The Netherlands	Is there a particular reason why my $nohtml falls short for you? I suggested it as a supplement rather then a complete replacement. You could still do an $regex check to determine wheter the sockread data is something you'd want to $nohtml. The cartoon jokes about situations almost identical to this one where parsing nested parenthesis is not something you'd do with a fine state automation (regex) or as a famous saying goes "I know, I’ll use regular expressions. Now I have two problems". The saying is applied too frequently though but in this situation it surely holds up. Your best bet at parsing this will be with the IHTMLDocument3 COM interface (also windows 95 and up) which already deals with all kinds of invalid data. I could whip something up that queries all the spans or divs with the classname 'footer' and call innerTEXT on them and call a callback alias you supply with the innerTEXT as $1-. I'd like to hear your reasons for dismissing a COM based approach first though before i'll spent my time on it. This guarantees that the text you recieve is as you see it in the browser, no regular expression will get you anywhere near the same coverage. You could as genius_at_work suggested of course make it work in this specific case with a decent ammount of work but it would still be a very quirky parser which is too easily broken. Even if speed is of importance the COM approach should'nt be any slower if not faster.

Re: $htmlfree update needed Riamus2 #209709 21/02/09 01:26 PM
Joined: Feb 2006 Posts: 523 J jaytea Fjord artisan
jaytea Fjord artisan J Joined: Feb 2006 Posts: 523	* jaytea ignores everything Mpdreamz just said and proposes a regex soltn! ;D this might be too basic for you Riamus, but the following is just a small expansion on $htmlfree which removes balanced tags nested inside other balanced tags (much like the type of regexes that handle nested parentheses) Code: alias htmlfree return $regsubex($1,/^[^<]>\|(<(?:[^<]\|(?1))>)\|<.*$/gU,) use it, build upon it, or ignore it ;p "The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde

Page 1 of 2

1

2

Link Copied to Clipboard