mIRC Home    About    Download    Register    News    Help

Print Thread
#269474 12/10/21 08:48 AM
Joined: Jan 2021
Posts: 31
T
Ameglian cow
OP Offline
Ameglian cow
T
Joined: Jan 2021
Posts: 31
Thanks to Talon for writing about $urlget(). And telling me another way to read links titles. Even Youtube.
I have copied Talon's code for reading all sort of links, and scripted so 1 link get extracted if the text contains as example: text link text.
I have also been reading maroons guide to remove $cr $lf $crlf from the urlget. Which mean no way to exploit this.

Since Talon and Maroons wrote this. I don't want to take credits for this script.


The new code is:
Code
on *:text:*http*:#: {
  if ($regex($1-,((?:http(s)?:\/\/)?[\w.-]+(?:\.[\w\.-]+)+[\w\-\._~:/?#[\]@!\$&'\(\)\*\+,;=.]+))) {
    noop $urlget($regml(1),gb,& $+ $ticks,ScrapeWebsiteData) 
  }
} 
alias ScrapeWebsiteData { 
  var %b = $urlget($1).target
  breplace %b 10 32 13 32
  if ($bfind(%b,1,/<title>(.*)<\/title>/i,Title).regex) { var %title = $replace($regml(Title,1),$cr,$chr(32),$lf,$chr(32)) }
  if ($bfind(%b,1,/<meta name="description".*content="([^"]+)"(?:[^>]+)?>/i,Desc).regex) { var %desc = $replace($regml(Desc,1),$cr,$chr(32),$lf,$chr(32)) }

  echo -a 1: Title: %title
  echo -a 2: Description: %desc
}

Joined: Feb 2003
Posts: 2,812
Hoopy frood
Offline
Hoopy frood
Joined: Feb 2003
Posts: 2,812
Tip. Whenever you see the verb .* in your regex pattern, you probably want .*? instead. At least 99% of the time.

.* -- Will gobble up the entire webpage from beginning to end, and then slowly back track one letter at a time until it finds a match.
.*? -- Will march forward one letter at a time until it finds a match, in the normal intuitive way you probably imagine it should.

Compare: .*?

//echo -a $regex(blah blah <title>The quick brown fox jumps over the lazy dog.</title>lol sucker!</title> blah blah, <title>(.*?)</title>) -> $regml(1)
1 -> The quick brown fox jumps over the lazy dog.

and: .* without ?

//echo -a $regex(blah blah <title>The quick brown fox jumps over the lazy dog.</title>lol sucker!</title> blah blah, <title>(.*)</title>) -> $regml(1)
1 -> The quick brown fox jumps over the lazy dog.</title>lol sucker!

You can also add a sanity check to tell the match to give up if it's going to be unrealistically long to find a match. Say, 256 or 512 or 1024 characters. Up to you.

.{1,1024}? is the same as .*? but better.

To make the final output variables a sane length as well, you should use $left() to trim them down if they're too long.

var %title = $left(%title,256), %desc = $left(%desc,256)


Well. At least I won lunch.
Good philosophy, see good in bad, I like!

Link Copied to Clipboard