$htmlfree update needed - mIRC Discussion Forums

Ok, I have this identifier to remove html. It has always worked fine until now.

Code:

alias htmlfree {
  var %x, %i = $regsub($1-,/(^[^<]*>|<[^>]*>|<[^>]*$)/g,$null,%x), %x = $replace(%x, ,$chr(32))
  return %x
}

I have the following Source data (I cut it down to just the part that is causing the problem):

Code:

<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup>

Now, that *should* result in just [a] being returned. Instead, it leaves: a]'>[a] ... most likely due to the []'s in there. Can someone help to update the identifier so it doesn't miss that a]'> part?

Note that I'll also accept any other good method to automatically remove that extra data. I know I could just $remove(%var,]'>) from it, but the "a" in there can be any letter and I don't really want to list out all letters in a $remove line... that's just not very efficient. Also note that there may be multiple footnotes on a line, so I can't just use $gettok to get the data.

For now

Code:

alias htmlfree {
  var %x, %i = $regsub($1-,/(^[^<]*>|<[^>]*>|<[^>]*$)/g,$null,%x), %x = $replace(%x, ,$chr(32))
  return $remove(%x,$wildtok(%x,*]'>,1,91))
}

im not at home to test the regex query

That definitely fixes that problem, but it leaves & nbsp; and perhaps others now. It also seems to remove more than it should. If there's text before that, it also gets removed.

Quote:

16Here is normal text.a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]

That leaves just [a] and nothing else. I still need the text before that part. It's only the "a]'>" part that should get removed. It should show: 16This is normal text. [a]

Previously, it did, but also included the a]'> part.

Thanks for helping.

Here is a more efficient regsub tag

Code:

var %x, %i = $regsub($1-,/(&nbsp;|&deg;|]'>|^[^<]*>|<[^>]*>|<[^>]*$)/g,$null,%x), %x = $replace(%x, ,$chr(32))

Also deals with

Code:

&nbsp; and &deg;

features you might come across, this will properly strip all HTML tags within your site dump the reason you'll see an extra a is because if you notice youll see without all the html tags that, that a is present..

remove the $remove + $wildtok stuff I gave you earlier just run %x alone

Well, that works except the other a. It *shouldn't* be there. It's part of the html just like a link would be.

Quote:

a</a>]'>

Just to look at that part. The red, you agree shouldn't be there. The "a" that you said should be there, if you look closely, is part of the green value in the html the same way that 'footnote' is. value='a' is basically what you're seeing there. The a doesn't belong.

Thats not the a I see :P Ill make red all HTML tags green = what should be :P Actual HTML tags are derived such as < or > therefore the a is not inside an HTML its (OUT)

Quote:

16Here is normal text.a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]

Originally Posted By: Riamus2

Well, that works except the other a. It *shouldn't* be there. It's part of the html just like a link would be.

Quote:

a</a>]'>

Just to look at that part. The red, you agree shouldn't be there. The "a" that you said should be there, if you look closely, is part of the green value in the html the same way that 'footnote' is. value='a' is basically what you're seeing there. The a doesn't belong.

PS maybe if you see it this way youll know I mean, after running hte script echo the 10 REGML's of the strip

//echo -a $regml(1) $regml(2) $regml(3) $regml(4) $regml(5) $regml(6) $regml(7) $regml(8) $regml(9) $regml(10)

youll see how exactly the data is cut from the entire string

I think it should be like this:

a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]

Red = HTML tags
Blue = HTML tag parameters
Green = Plain text

Which would appear as this on your browser screen: [a]

-genius_at_work

It actually *is* inside the <>'s. Look carefully. There are 2 a's. One isn't between <>'s (the last one). The first one *is* between them as I showed. Look carefully at it. Yes, it's not between the <a href>'s <>'s. However, *that* is nested INSIDE the 's brackets. If you look, there is a < before the sub, and the closing > is after that first a.

Look at genius_at_work's comment. He sees what I'm saying. If you looked at the code in a browser, you'd see [a], not a[a].

Note that I didn't correctly write what you'd see if you removed the <a href> in my previous post. By removing just the <a href> part, you'd see value='[a]' instead of value='a' . You're removing everything from that except the a...

Taking out the nested <a href> in:

Quote:

a</a>]'>

Results in:

Quote:

Now, that's definitely all html.

Is this HTML being returned by an outside source, or is it something that you can edit?

I'm pretty sure that the part inside of the value='' tags should be escaped by using HTML entities like this:

value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'

No, I'm getting the source with sockets, so I have no real control over it. Granted, I could edit it while receiving it, but I'd rather find a good "$htmlfree" method of just removing all html aspects from the line. Also, I have a feeling that just using something like $replace(%text,a[a],[a],b[b],[b],c[c],[c],...) or using $remove on the original returned text $remove(%text,a]'>,b]'>,c]'>,...) would be more efficient then editing every line and then running it through $htmlfree, even though that's not very efficient itself. I know I could just do that and it would be "easy" to do... but I'd rather something efficient.

Try this:

Code:


alias htmlfree {
  var %x = $regsubex($1-,/^[^<]*(?:\'?[^\']*\')*[^<]*>|<[^>]*(?:\'[^\']*\')*[^<]*>|(<[^>]*(\'[^\']*\'?)*[^<]*$)/g,)
  return %x
}

-genius_at_work

Unfortunately, that removes *all* text before the [a] footnote and doesn't convert & nbsp; to a space.

Quote:

16Here is normal text.a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]

Trying it on that results in [a]... it doesn't show Here is normal text.

This sort of problem always makes me think of this cartoon:

Even if we force recursion [a] is all we'll get.

Code:

alias htmlfree {
  var %x, %i = $regsub($1-,/(^[^<]*>|<[^>]*>|<[^>]*$)/g,$null,%x), %x = $replace(%x,&nbsp;,$chr(32))
  if ($regex(%x,/<|>/)) {
    while (%x != $htmlfree.recurse(%x)) { 
      echo -a %x
      %x = $v2
    }
  }
  return %x
}
alias htmlfree.recurse return $htmlfree($1);

//echo -a $htmlfree( 16Here is normal text.a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>])

I opted for the lazy recursion rather then cracking at PCRE's (?R)

first iteration => 16Here is normal text.a]'>[a]
second iteration => [a]

In these case cracking out your own parser is needed and happens elsewhere too in mIRC this would be a huge performance hit though it might be better to delage the work

Code:

alias nohtml {
  if (!$1) return
  .comopen h htmlfile
  .comclose h $com(h,write,1,bstr,$1) $com(h,body,3,dispatch* b) $com(b,innertext,3)
  var %x = $com(b).result
  .comclose b
  return %x
}

//echo -a $htmlfree( 16Here is normal text.a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>])

returns 16Here is normal text.[a]

It connects with MSHTML.HTMLDocument which should be available since windows 95 and up.

EDIT: Oh and someone name and shame the guy for nesting HTML tags within attribute values!

Definitely not what I was hoping to hear. Sites are a real pain!

Thanks for trying, though.

Would it be faster just having 2 regex identifiers? A normal $htmlfree (that leave a]'> ), and one that removes any letter followed by ]'> ? I'm stuck without being able to just use $left/$right/$pos to remove it because there may be more than one. I mean, I could probably do something like that, but it wouldn't be all that efficient. Maybe a second regex that checks for that isn't very efficient either. I don't know. I just don't want to have to take a performance hit if I don't have to.

As MPdreamz suggested, the only real way to solve this problem is to break the fingers of whoever decided to put unescaped html tags within the parameters of another html tag. The problem lies with the fact that the html on a site usually comes back as lines that are too long for mIRC to handle internally. If a set of ' ' quotes (as in your example) fell across a linebreak, any code would be rendered useless.

One possible solution would be to remove any text that is enclosed in ' ' quotes in one regex, and then deal with the remaining < > brackets in another regex. But then the potential problem pops up when the ' ' quotes show up outside the < > brackets (in plain text), or when a single ' quote shows up within a set of " " quotes (and another ' quote isn't there to match it).

It seems that this rather simple request has turned into a very complex coding problem. I will try to think about it further.

-genius_at_work

I actually have very little problem with line length with this particular site/script.

Well, the ultimate goal of this thread would be to 'fix' the widely used $htmlfree alias so that it works better in more situations. We could make it work in this exact situation, but if the site you are using changed its format slightly, your code may not work anymore.

-genius_at_work

Is there a particular reason why my $nohtml falls short for you?

I suggested it as a supplement rather then a complete replacement. You could still do an $regex check to determine wheter the sockread data is something you'd want to $nohtml.

The cartoon jokes about situations almost identical to this one where parsing nested parenthesis is not something you'd do with a fine state automation (regex) or as a famous saying goes "I know, I’ll use regular expressions. Now I have two problems". The saying is applied too frequently though but in this situation it surely holds up.

Your best bet at parsing this will be with the IHTMLDocument3 COM interface (also windows 95 and up) which already deals with all kinds of invalid data. I could whip something up that queries all the spans or divs with the classname 'footer' and call innerTEXT on them and call a callback alias you supply with the innerTEXT as $1-.

I'd like to hear your reasons for dismissing a COM based approach first though before i'll spent my time on it.

This guarantees that the text you recieve is as you see it in the browser, no regular expression will get you anywhere near the same coverage.

You could as genius_at_work suggested of course make it work in this specific case with a decent ammount of work but it would still be a very quirky parser which is too easily broken.

Even if speed is of importance the COM approach should'nt be any slower if not faster.

* jaytea ignores everything Mpdreamz just said and proposes a regex soltn!

;D

this might be too basic for you Riamus, but the following is just a small expansion on $htmlfree which removes balanced tags nested inside other balanced tags (much like the type of regexes that handle nested parentheses)

Code:

alias htmlfree return $regsubex($1,/^[^<]*>|(<(?:[^<]|(?1))*>)|<.*$/gU,)

use it, build upon it, or ignore it ;p

That works almost perfectly, jaytea. It doesn't remove & nbsp;, but it does make the line look right. I can more easily handle those non-breaking spaces than the other, even right within the identifier. It is all working great now. Thanks.

Mpdreamz, as you said yourself, it affects performance to jump into COM for something so minor as HTML removal. I do use COM in other scripts, but it's used in cases where there really isn't a better option to get the information that it provides (such as system or Windows information). That said, a DLL would probably still be better in those cases, but I understand COM... I don't really want to try to figure out how to write DLLs right now.

Also, as genius_at_work said, this problem for me gives the option to try and fix the often-used $htmlfree/$nohtml identifiers so they work in more situations. It's much easier for someone to use a widely used HTML removal identifier than to try and find a COM script that does exactly what they need. I admit I wasn't trying to fix it for anyone else, but it's not a bad idea to do so.

So I wasn't ignoring your suggestion. I just felt that I could have less of a performance problem using other methods. I didn't benchmark yours versus others, so maybe it isn't slower than other methods... it just seems like it would be.

Hehe i havent done any benchmarking myself but i doubt the endresult will be slower then the equivalent regex call. Which is not the one that jaytea posted which needs balanced brackets. Like i mentioned in my post PCRE recursion could get you there halfway. But then thats PCRE's awesomeness overcomming some Turing Completeness issues with regex.

If someone gets it working for nested tags and unbalanced (which they wont) i'll take up the challange with the COM solution.

Regex does not equal fast per se definatly not complex ones like the one is needed here, much like COM which doesnt have to be slow per se. Replying here makes me very curious how fast/slow the com approach will be so i'll do some benchmarking tomorrow

Not to say an updated $htmlfree which handles nested balanced tags however wrong though is a bad thing :P (oxymoron ?)(Alias should really be called $striphtml or something btw come to think of it).

Glad jaytea's solution works for you

http://htmlparsing.icenine.ca/doku.php

http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html

http://wiki.tcl.tk/4164

http://oubliette.alpha-geek.com/2003/12/31/do_not_do_not_parse_html_with_regexs