mIRC Homepage

$htmlfree update needed

Posted By: Riamus2

$htmlfree update needed - 17/02/09 01:38 AM

Ok, I have this identifier to remove html. It has always worked fine until now.

Code:
alias htmlfree {
  var %x, %i = $regsub($1-,/(^[^<]*>|<[^>]*>|<[^>]*$)/g,$null,%x), %x = $replace(%x, ,$chr(32))
  return %x
}


I have the following Source data (I cut it down to just the part that is causing the problem):
Code:
<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup>


Now, that *should* result in just [a] being returned. Instead, it leaves: a]'>[a] ... most likely due to the []'s in there. Can someone help to update the identifier so it doesn't miss that a]'> part?

Note that I'll also accept any other good method to automatically remove that extra data. I know I could just $remove(%var,]'>) from it, but the "a" in there can be any letter and I don't really want to list out all letters in a $remove line... that's just not very efficient. Also note that there may be multiple footnotes on a line, so I can't just use $gettok to get the data.
Posted By: Lpfix5

Re: $htmlfree update needed - 17/02/09 12:41 PM

For now

Code:
alias htmlfree {
  var %x, %i = $regsub($1-,/(^[^<]*>|<[^>]*>|<[^>]*$)/g,$null,%x), %x = $replace(%x, ,$chr(32))
  return $remove(%x,$wildtok(%x,*]'>,1,91))
}


im not at home to test the regex query
Posted By: Riamus2

Re: $htmlfree update needed - 17/02/09 01:25 PM

That definitely fixes that problem, but it leaves & nbsp; and perhaps others now. It also seems to remove more than it should. If there's text before that, it also gets removed.

Quote:
<p> <p /> <sup id="en-NIV-26127" class="vnum" value='16'>16</sup>Here is normal text.<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup>


That leaves just [a] and nothing else. I still need the text before that part. It's only the "a]'>" part that should get removed. It should show: 16This is normal text. [a]

Previously, it did, but also included the a]'> part.

Thanks for helping.
Posted By: Lpfix5

Re: $htmlfree update needed - 18/02/09 12:43 AM

Here is a more efficient regsub tag

Code:
var %x, %i = $regsub($1-,/(&nbsp;|&deg;|]'>|^[^<]*>|<[^>]*>|<[^>]*$)/g,$null,%x), %x = $replace(%x, ,$chr(32))


Also deals with
Code:
&nbsp; and &deg;
features you might come across, this will properly strip all HTML tags within your site dump the reason you'll see an extra a is because if you notice youll see without all the html tags that, that a is present..

remove the $remove + $wildtok stuff I gave you earlier just run %x alone
Posted By: Riamus2

Re: $htmlfree update needed - 18/02/09 02:55 AM

Well, that works except the other a. It *shouldn't* be there. It's part of the html just like a link would be.

Quote:

<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>


Just to look at that part. The red, you agree shouldn't be there. The "a" that you said should be there, if you look closely, is part of the green value in the html the same way that 'footnote' is. value='a' is basically what you're seeing there. The a doesn't belong.
Posted By: Lpfix5

Re: $htmlfree update needed - 18/02/09 03:43 AM

Thats not the a I see :P Ill make red all HTML tags green = what should be :P Actual HTML tags are derived such as < or > therefore the a is not inside an HTML its (OUT)

Quote:

<p> <p /> <sup id="en-NIV-26127" class="vnum" value='16'>16</sup>Here is normal text.<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup>


Originally Posted By: Riamus2
Well, that works except the other a. It *shouldn't* be there. It's part of the html just like a link would be.

Quote:

<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>


Just to look at that part. The red, you agree shouldn't be there. The "a" that you said should be there, if you look closely, is part of the green value in the html the same way that 'footnote' is. value='a' is basically what you're seeing there. The a doesn't belong.
Posted By: Lpfix5

Re: $htmlfree update needed - 18/02/09 03:54 AM

PS maybe if you see it this way youll know I mean, after running hte script echo the 10 REGML's of the strip

//echo -a $regml(1) $regml(2) $regml(3) $regml(4) $regml(5) $regml(6) $regml(7) $regml(8) $regml(9) $regml(10)

youll see how exactly the data is cut from the entire string
Posted By: genius_at_work

Re: $htmlfree update needed - 18/02/09 04:03 AM

I think it should be like this:

<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup>

Red = HTML tags
Blue = HTML tag parameters
Green = Plain text

Which would appear as this on your browser screen: [a]

-genius_at_work
Posted By: Riamus2

Re: $htmlfree update needed - 18/02/09 12:48 PM

It actually *is* inside the <>'s. Look carefully. There are 2 a's. One isn't between <>'s (the last one). The first one *is* between them as I showed. Look carefully at it. Yes, it's not between the <a href>'s <>'s. However, *that* is nested INSIDE the <sub>'s brackets. If you look, there is a < before the sub, and the closing > is after that first a.

Look at genius_at_work's comment. He sees what I'm saying. If you looked at the code in a browser, you'd see [a], not a[a].

Note that I didn't correctly write what you'd see if you removed the <a href> in my previous post. By removing just the <a href> part, you'd see value='[a]' instead of value='a' . You're removing everything from that except the a...

Taking out the nested <a href> in:
Quote:
<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>


Results in:
Quote:
<sup class='footnote' value='[a]'>


Now, that's definitely all html.
Posted By: hixxy

Re: $htmlfree update needed - 19/02/09 03:43 AM

Is this HTML being returned by an outside source, or is it something that you can edit?

I'm pretty sure that the part inside of the value='' tags should be escaped by using HTML entities like this:

value='[&lt;a href="#fen-NIV-26127a" title="See footnote a"&gt;a&lt;/a&gt;]'
Posted By: Riamus2

Re: $htmlfree update needed - 20/02/09 01:02 AM

No, I'm getting the source with sockets, so I have no real control over it. Granted, I could edit it while receiving it, but I'd rather find a good "$htmlfree" method of just removing all html aspects from the line. Also, I have a feeling that just using something like $replace(%text,a[a],[a],b[b],[b],c[c],[c],...) or using $remove on the original returned text $remove(%text,a]'>,b]'>,c]'>,...) would be more efficient then editing every line and then running it through $htmlfree, even though that's not very efficient itself. I know I could just do that and it would be "easy" to do... but I'd rather something efficient.
Posted By: genius_at_work

Re: $htmlfree update needed - 20/02/09 02:22 AM

Try this:

Code:

alias htmlfree {
  var %x = $regsubex($1-,/^[^<]*(?:\'?[^\']*\')*[^<]*>|<[^>]*(?:\'[^\']*\')*[^<]*>|(<[^>]*(\'[^\']*\'?)*[^<]*$)/g,)
  return %x
}



-genius_at_work
Posted By: Riamus2

Re: $htmlfree update needed - 20/02/09 02:36 AM

Unfortunately, that removes *all* text before the [a] footnote and doesn't convert & nbsp; to a space.

Quote:
<p> <p /> <sup id="en-NIV-26127" class="vnum" value='16'>16</sup>Here is normal text.<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup>


Trying it on that results in [a]... it doesn't show Here is normal text.
Posted By: Mpdreamz

Re: $htmlfree update needed - 20/02/09 11:57 AM

This sort of problem always makes me think of this cartoon:


Even if we force recursion [a] is all we'll get.
Code:
alias htmlfree {
  var %x, %i = $regsub($1-,/(^[^<]*>|<[^>]*>|<[^>]*$)/g,$null,%x), %x = $replace(%x,&nbsp;,$chr(32))
  if ($regex(%x,/<|>/)) {
    while (%x != $htmlfree.recurse(%x)) { 
      echo -a %x
      %x = $v2
    }
  }
  return %x
}
alias htmlfree.recurse return $htmlfree($1);

//echo -a $htmlfree(<p> <p /> <sup id="en-NIV-26127" class="vnum" value='16'>16</sup>Here is normal text.<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup>)

I opted for the lazy recursion rather then cracking at PCRE's (?R)

first iteration => 16Here is normal text.a]'>[a]
second iteration => [a]


In these case cracking out your own parser is needed and happens elsewhere too in mIRC this would be a huge performance hit though it might be better to delage the work wink

Code:
alias nohtml {
  if (!$1) return
  .comopen h htmlfile
  .comclose h $com(h,write,1,bstr,$1) $com(h,body,3,dispatch* b) $com(b,innertext,3)
  var %x = $com(b).result
  .comclose b
  return %x
}


//echo -a $htmlfree(<p> <p /> <sup id="en-NIV-26127" class="vnum" value='16'>16</sup>Here is normal text.<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup>)

returns 16Here is normal text.[a]

It connects with MSHTML.HTMLDocument which should be available since windows 95 and up.

EDIT: Oh and someone name and shame the guy for nesting HTML tags within attribute values!
Posted By: Riamus2

Re: $htmlfree update needed - 21/02/09 12:41 AM

Definitely not what I was hoping to hear. Sites are a real pain!

Thanks for trying, though.

Would it be faster just having 2 regex identifiers? A normal $htmlfree (that leave a]'> ), and one that removes any letter followed by ]'> ? I'm stuck without being able to just use $left/$right/$pos to remove it because there may be more than one. I mean, I could probably do something like that, but it wouldn't be all that efficient. Maybe a second regex that checks for that isn't very efficient either. I don't know. I just don't want to have to take a performance hit if I don't have to.
Posted By: genius_at_work

Re: $htmlfree update needed - 21/02/09 01:36 AM

As MPdreamz suggested, the only real way to solve this problem is to break the fingers of whoever decided to put unescaped html tags within the parameters of another html tag. The problem lies with the fact that the html on a site usually comes back as lines that are too long for mIRC to handle internally. If a set of ' ' quotes (as in your example) fell across a linebreak, any code would be rendered useless.

One possible solution would be to remove any text that is enclosed in ' ' quotes in one regex, and then deal with the remaining < > brackets in another regex. But then the potential problem pops up when the ' ' quotes show up outside the < > brackets (in plain text), or when a single ' quote shows up within a set of " " quotes (and another ' quote isn't there to match it).

It seems that this rather simple request has turned into a very complex coding problem. I will try to think about it further.

-genius_at_work
Posted By: Riamus2

Re: $htmlfree update needed - 21/02/09 04:33 AM

I actually have very little problem with line length with this particular site/script.
Posted By: genius_at_work

Re: $htmlfree update needed - 21/02/09 06:37 AM

Well, the ultimate goal of this thread would be to 'fix' the widely used $htmlfree alias so that it works better in more situations. We could make it work in this exact situation, but if the site you are using changed its format slightly, your code may not work anymore.

-genius_at_work
Posted By: Mpdreamz

Re: $htmlfree update needed - 21/02/09 09:32 AM

Is there a particular reason why my $nohtml falls short for you?

I suggested it as a supplement rather then a complete replacement. You could still do an $regex check to determine wheter the sockread data is something you'd want to $nohtml.

The cartoon jokes about situations almost identical to this one where parsing nested parenthesis is not something you'd do with a fine state automation (regex) or as a famous saying goes "I know, Iíll use regular expressions. Now I have two problems". The saying is applied too frequently though but in this situation it surely holds up.

Your best bet at parsing this will be with the IHTMLDocument3 COM interface (also windows 95 and up) which already deals with all kinds of invalid data. I could whip something up that queries all the spans or divs with the classname 'footer' and call innerTEXT on them and call a callback alias you supply with the innerTEXT as $1-.

I'd like to hear your reasons for dismissing a COM based approach first though before i'll spent my time on it.

This guarantees that the text you recieve is as you see it in the browser, no regular expression will get you anywhere near the same coverage.

You could as genius_at_work suggested of course make it work in this specific case with a decent ammount of work but it would still be a very quirky parser which is too easily broken.

Even if speed is of importance the COM approach should'nt be any slower if not faster.

Posted By: jaytea

Re: $htmlfree update needed - 21/02/09 01:26 PM

* jaytea ignores everything Mpdreamz just said and proposes a regex soltn!

;D

this might be too basic for you Riamus, but the following is just a small expansion on $htmlfree which removes balanced tags nested inside other balanced tags (much like the type of regexes that handle nested parentheses)

Code:
alias htmlfree return $regsubex($1,/^[^<]*>|(<(?:[^<]|(?1))*>)|<.*$/gU,)


use it, build upon it, or ignore it ;p
Posted By: Riamus2

Re: $htmlfree update needed - 21/02/09 04:44 PM

That works almost perfectly, jaytea. It doesn't remove & nbsp;, but it does make the line look right. I can more easily handle those non-breaking spaces than the other, even right within the identifier. It is all working great now. Thanks.

Mpdreamz, as you said yourself, it affects performance to jump into COM for something so minor as HTML removal. I do use COM in other scripts, but it's used in cases where there really isn't a better option to get the information that it provides (such as system or Windows information). That said, a DLL would probably still be better in those cases, but I understand COM... I don't really want to try to figure out how to write DLLs right now.

Also, as genius_at_work said, this problem for me gives the option to try and fix the often-used $htmlfree/$nohtml identifiers so they work in more situations. It's much easier for someone to use a widely used HTML removal identifier than to try and find a COM script that does exactly what they need. I admit I wasn't trying to fix it for anyone else, but it's not a bad idea to do so.

So I wasn't ignoring your suggestion. I just felt that I could have less of a performance problem using other methods. I didn't benchmark yours versus others, so maybe it isn't slower than other methods... it just seems like it would be.
Posted By: Mpdreamz

Re: $htmlfree update needed - 21/02/09 06:44 PM

Hehe i havent done any benchmarking myself but i doubt the endresult will be slower then the equivalent regex call. Which is not the one that jaytea posted which needs balanced brackets. Like i mentioned in my post PCRE recursion could get you there halfway. But then thats PCRE's awesomeness overcomming some Turing Completeness issues with regex.

If someone gets it working for nested tags and unbalanced (which they wont) i'll take up the challange with the COM solution.

Regex does not equal fast per se definatly not complex ones like the one is needed here, much like COM which doesnt have to be slow per se. Replying here makes me very curious how fast/slow the com approach will be so i'll do some benchmarking tomorrow laugh

Not to say an updated $htmlfree which handles nested balanced tags however wrong though is a bad thing :P (oxymoron ?)(Alias should really be called $striphtml or something btw come to think of it).

Glad jaytea's solution works for you smile


Posted By: argv0

Re: $htmlfree update needed - 21/02/09 07:01 PM

http://htmlparsing.icenine.ca/doku.php

http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html

http://wiki.tcl.tk/4164

http://oubliette.alpha-geek.com/2003/12/31/do_not_do_not_parse_html_with_regexs

© 2022 mIRC Discussion Forums