mIRC Home    About    Download    Register    News    Help

Print Thread
Page 1 of 2 1 2
#209543 17/02/09 01:38 AM
Joined: Oct 2004
Posts: 8,330
Riamus2 Offline OP
Hoopy frood
OP Offline
Hoopy frood
Joined: Oct 2004
Posts: 8,330
Ok, I have this identifier to remove html. It has always worked fine until now.

Code:
alias htmlfree {
  var %x, %i = $regsub($1-,/(^[^<]*>|<[^>]*>|<[^>]*$)/g,$null,%x), %x = $replace(%x, ,$chr(32))
  return %x
}


I have the following Source data (I cut it down to just the part that is causing the problem):
Code:
<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup>


Now, that *should* result in just [a] being returned. Instead, it leaves: a]'>[a] ... most likely due to the []'s in there. Can someone help to update the identifier so it doesn't miss that a]'> part?

Note that I'll also accept any other good method to automatically remove that extra data. I know I could just $remove(%var,]'>) from it, but the "a" in there can be any letter and I don't really want to list out all letters in a $remove line... that's just not very efficient. Also note that there may be multiple footnotes on a line, so I can't just use $gettok to get the data.

Last edited by Riamus2; 17/02/09 01:42 AM.

Invision Support
#Invision on irc.irchighway.net
Joined: Aug 2005
Posts: 1,052
L
Hoopy frood
Offline
Hoopy frood
L
Joined: Aug 2005
Posts: 1,052
For now

Code:
alias htmlfree {
  var %x, %i = $regsub($1-,/(^[^<]*>|<[^>]*>|<[^>]*$)/g,$null,%x), %x = $replace(%x, ,$chr(32))
  return $remove(%x,$wildtok(%x,*]'>,1,91))
}


im not at home to test the regex query


Code:
if $reality > $fiction { set %sanity Sane }
Else { echo -a *voices* }
Joined: Oct 2004
Posts: 8,330
Riamus2 Offline OP
Hoopy frood
OP Offline
Hoopy frood
Joined: Oct 2004
Posts: 8,330
That definitely fixes that problem, but it leaves & nbsp; and perhaps others now. It also seems to remove more than it should. If there's text before that, it also gets removed.

Quote:
<p> <p /> <sup id="en-NIV-26127" class="vnum" value='16'>16</sup>Here is normal text.<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup>


That leaves just [a] and nothing else. I still need the text before that part. It's only the "a]'>" part that should get removed. It should show: 16This is normal text. [a]

Previously, it did, but also included the a]'> part.

Thanks for helping.

Last edited by Riamus2; 17/02/09 01:31 PM.

Invision Support
#Invision on irc.irchighway.net
Joined: Aug 2005
Posts: 1,052
L
Hoopy frood
Offline
Hoopy frood
L
Joined: Aug 2005
Posts: 1,052
Here is a more efficient regsub tag

Code:
var %x, %i = $regsub($1-,/(&nbsp;|&deg;|]'>|^[^<]*>|<[^>]*>|<[^>]*$)/g,$null,%x), %x = $replace(%x, ,$chr(32))


Also deals with
Code:
&nbsp; and &deg;
features you might come across, this will properly strip all HTML tags within your site dump the reason you'll see an extra a is because if you notice youll see without all the html tags that, that a is present..

remove the $remove + $wildtok stuff I gave you earlier just run %x alone


Code:
if $reality > $fiction { set %sanity Sane }
Else { echo -a *voices* }
Joined: Oct 2004
Posts: 8,330
Riamus2 Offline OP
Hoopy frood
OP Offline
Hoopy frood
Joined: Oct 2004
Posts: 8,330
Well, that works except the other a. It *shouldn't* be there. It's part of the html just like a link would be.

Quote:

<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>


Just to look at that part. The red, you agree shouldn't be there. The "a" that you said should be there, if you look closely, is part of the green value in the html the same way that 'footnote' is. value='a' is basically what you're seeing there. The a doesn't belong.


Invision Support
#Invision on irc.irchighway.net
Joined: Aug 2005
Posts: 1,052
L
Hoopy frood
Offline
Hoopy frood
L
Joined: Aug 2005
Posts: 1,052
Thats not the a I see :P Ill make red all HTML tags green = what should be :P Actual HTML tags are derived such as < or > therefore the a is not inside an HTML its (OUT)

Quote:

<p> <p /> <sup id="en-NIV-26127" class="vnum" value='16'>16</sup>Here is normal text.<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup>


Originally Posted By: Riamus2
Well, that works except the other a. It *shouldn't* be there. It's part of the html just like a link would be.

Quote:

<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>


Just to look at that part. The red, you agree shouldn't be there. The "a" that you said should be there, if you look closely, is part of the green value in the html the same way that 'footnote' is. value='a' is basically what you're seeing there. The a doesn't belong.


Code:
if $reality > $fiction { set %sanity Sane }
Else { echo -a *voices* }
Joined: Aug 2005
Posts: 1,052
L
Hoopy frood
Offline
Hoopy frood
L
Joined: Aug 2005
Posts: 1,052
PS maybe if you see it this way youll know I mean, after running hte script echo the 10 REGML's of the strip

//echo -a $regml(1) $regml(2) $regml(3) $regml(4) $regml(5) $regml(6) $regml(7) $regml(8) $regml(9) $regml(10)

youll see how exactly the data is cut from the entire string


Code:
if $reality > $fiction { set %sanity Sane }
Else { echo -a *voices* }
Joined: Oct 2005
Posts: 1,741
G
Hoopy frood
Offline
Hoopy frood
G
Joined: Oct 2005
Posts: 1,741
I think it should be like this:

<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup>

Red = HTML tags
Blue = HTML tag parameters
Green = Plain text

Which would appear as this on your browser screen: [a]

-genius_at_work

Joined: Oct 2004
Posts: 8,330
Riamus2 Offline OP
Hoopy frood
OP Offline
Hoopy frood
Joined: Oct 2004
Posts: 8,330
It actually *is* inside the <>'s. Look carefully. There are 2 a's. One isn't between <>'s (the last one). The first one *is* between them as I showed. Look carefully at it. Yes, it's not between the <a href>'s <>'s. However, *that* is nested INSIDE the <sub>'s brackets. If you look, there is a < before the sub, and the closing > is after that first a.

Look at genius_at_work's comment. He sees what I'm saying. If you looked at the code in a browser, you'd see [a], not a[a].

Note that I didn't correctly write what you'd see if you removed the <a href> in my previous post. By removing just the <a href> part, you'd see value='[a]' instead of value='a' . You're removing everything from that except the a...

Taking out the nested <a href> in:
Quote:
<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>


Results in:
Quote:
<sup class='footnote' value='[a]'>


Now, that's definitely all html.


Invision Support
#Invision on irc.irchighway.net
Joined: Sep 2005
Posts: 2,881
H
Hoopy frood
Offline
Hoopy frood
H
Joined: Sep 2005
Posts: 2,881
Is this HTML being returned by an outside source, or is it something that you can edit?

I'm pretty sure that the part inside of the value='' tags should be escaped by using HTML entities like this:

value='[&lt;a href="#fen-NIV-26127a" title="See footnote a"&gt;a&lt;/a&gt;]'

Joined: Oct 2004
Posts: 8,330
Riamus2 Offline OP
Hoopy frood
OP Offline
Hoopy frood
Joined: Oct 2004
Posts: 8,330
No, I'm getting the source with sockets, so I have no real control over it. Granted, I could edit it while receiving it, but I'd rather find a good "$htmlfree" method of just removing all html aspects from the line. Also, I have a feeling that just using something like $replace(%text,a[a],[a],b[b],[b],c[c],[c],...) or using $remove on the original returned text $remove(%text,a]'>,b]'>,c]'>,...) would be more efficient then editing every line and then running it through $htmlfree, even though that's not very efficient itself. I know I could just do that and it would be "easy" to do... but I'd rather something efficient.


Invision Support
#Invision on irc.irchighway.net
Joined: Oct 2005
Posts: 1,741
G
Hoopy frood
Offline
Hoopy frood
G
Joined: Oct 2005
Posts: 1,741
Try this:

Code:

alias htmlfree {
  var %x = $regsubex($1-,/^[^<]*(?:\'?[^\']*\')*[^<]*>|<[^>]*(?:\'[^\']*\')*[^<]*>|(<[^>]*(\'[^\']*\'?)*[^<]*$)/g,)
  return %x
}



-genius_at_work

Joined: Oct 2004
Posts: 8,330
Riamus2 Offline OP
Hoopy frood
OP Offline
Hoopy frood
Joined: Oct 2004
Posts: 8,330
Unfortunately, that removes *all* text before the [a] footnote and doesn't convert & nbsp; to a space.

Quote:
<p> <p /> <sup id="en-NIV-26127" class="vnum" value='16'>16</sup>Here is normal text.<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup>


Trying it on that results in [a]... it doesn't show Here is normal text.


Invision Support
#Invision on irc.irchighway.net
Joined: Apr 2004
Posts: 759
M
Hoopy frood
Offline
Hoopy frood
M
Joined: Apr 2004
Posts: 759
This sort of problem always makes me think of this cartoon:


Even if we force recursion [a] is all we'll get.
Code:
alias htmlfree {
  var %x, %i = $regsub($1-,/(^[^<]*>|<[^>]*>|<[^>]*$)/g,$null,%x), %x = $replace(%x,&nbsp;,$chr(32))
  if ($regex(%x,/<|>/)) {
    while (%x != $htmlfree.recurse(%x)) { 
      echo -a %x
      %x = $v2
    }
  }
  return %x
}
alias htmlfree.recurse return $htmlfree($1);

//echo -a $htmlfree(<p> <p /> <sup id="en-NIV-26127" class="vnum" value='16'>16</sup>Here is normal text.<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup>)

I opted for the lazy recursion rather then cracking at PCRE's (?R)

first iteration => 16Here is normal text.a]'>[a]
second iteration => [a]


In these case cracking out your own parser is needed and happens elsewhere too in mIRC this would be a huge performance hit though it might be better to delage the work wink

Code:
alias nohtml {
  if (!$1) return
  .comopen h htmlfile
  .comclose h $com(h,write,1,bstr,$1) $com(h,body,3,dispatch* b) $com(b,innertext,3)
  var %x = $com(b).result
  .comclose b
  return %x
}


//echo -a $htmlfree(<p> <p /> <sup id="en-NIV-26127" class="vnum" value='16'>16</sup>Here is normal text.<sup class='footnote' value='[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]'>[<a href="#fen-NIV-26127a" title="See footnote a">a</a>]</sup>)

returns 16Here is normal text.[a]

It connects with MSHTML.HTMLDocument which should be available since windows 95 and up.

EDIT: Oh and someone name and shame the guy for nesting HTML tags within attribute values!

Last edited by Mpdreamz; 20/02/09 11:58 AM.

$maybe
Joined: Oct 2004
Posts: 8,330
Riamus2 Offline OP
Hoopy frood
OP Offline
Hoopy frood
Joined: Oct 2004
Posts: 8,330
Definitely not what I was hoping to hear. Sites are a real pain!

Thanks for trying, though.

Would it be faster just having 2 regex identifiers? A normal $htmlfree (that leave a]'> ), and one that removes any letter followed by ]'> ? I'm stuck without being able to just use $left/$right/$pos to remove it because there may be more than one. I mean, I could probably do something like that, but it wouldn't be all that efficient. Maybe a second regex that checks for that isn't very efficient either. I don't know. I just don't want to have to take a performance hit if I don't have to.


Invision Support
#Invision on irc.irchighway.net
Joined: Oct 2005
Posts: 1,741
G
Hoopy frood
Offline
Hoopy frood
G
Joined: Oct 2005
Posts: 1,741
As MPdreamz suggested, the only real way to solve this problem is to break the fingers of whoever decided to put unescaped html tags within the parameters of another html tag. The problem lies with the fact that the html on a site usually comes back as lines that are too long for mIRC to handle internally. If a set of ' ' quotes (as in your example) fell across a linebreak, any code would be rendered useless.

One possible solution would be to remove any text that is enclosed in ' ' quotes in one regex, and then deal with the remaining < > brackets in another regex. But then the potential problem pops up when the ' ' quotes show up outside the < > brackets (in plain text), or when a single ' quote shows up within a set of " " quotes (and another ' quote isn't there to match it).

It seems that this rather simple request has turned into a very complex coding problem. I will try to think about it further.

-genius_at_work

Joined: Oct 2004
Posts: 8,330
Riamus2 Offline OP
Hoopy frood
OP Offline
Hoopy frood
Joined: Oct 2004
Posts: 8,330
I actually have very little problem with line length with this particular site/script.


Invision Support
#Invision on irc.irchighway.net
Joined: Oct 2005
Posts: 1,741
G
Hoopy frood
Offline
Hoopy frood
G
Joined: Oct 2005
Posts: 1,741
Well, the ultimate goal of this thread would be to 'fix' the widely used $htmlfree alias so that it works better in more situations. We could make it work in this exact situation, but if the site you are using changed its format slightly, your code may not work anymore.

-genius_at_work

Joined: Apr 2004
Posts: 759
M
Hoopy frood
Offline
Hoopy frood
M
Joined: Apr 2004
Posts: 759
Is there a particular reason why my $nohtml falls short for you?

I suggested it as a supplement rather then a complete replacement. You could still do an $regex check to determine wheter the sockread data is something you'd want to $nohtml.

The cartoon jokes about situations almost identical to this one where parsing nested parenthesis is not something you'd do with a fine state automation (regex) or as a famous saying goes "I know, I’ll use regular expressions. Now I have two problems". The saying is applied too frequently though but in this situation it surely holds up.

Your best bet at parsing this will be with the IHTMLDocument3 COM interface (also windows 95 and up) which already deals with all kinds of invalid data. I could whip something up that queries all the spans or divs with the classname 'footer' and call innerTEXT on them and call a callback alias you supply with the innerTEXT as $1-.

I'd like to hear your reasons for dismissing a COM based approach first though before i'll spent my time on it.

This guarantees that the text you recieve is as you see it in the browser, no regular expression will get you anywhere near the same coverage.

You could as genius_at_work suggested of course make it work in this specific case with a decent ammount of work but it would still be a very quirky parser which is too easily broken.

Even if speed is of importance the COM approach should'nt be any slower if not faster.



$maybe
Joined: Feb 2006
Posts: 546
J
Fjord artisan
Offline
Fjord artisan
J
Joined: Feb 2006
Posts: 546
* jaytea ignores everything Mpdreamz just said and proposes a regex soltn!

;D

this might be too basic for you Riamus, but the following is just a small expansion on $htmlfree which removes balanced tags nested inside other balanced tags (much like the type of regexes that handle nested parentheses)

Code:
alias htmlfree return $regsubex($1,/^[^<]*>|(<(?:[^<]|(?1))*>)|<.*$/gU,)


use it, build upon it, or ignore it ;p


"The only excuse for making a useless script is that one admires it intensely" - Oscar Wilde
Page 1 of 2 1 2

Link Copied to Clipboard