mIRC Home    About    Download    Register    News    Help

Print Thread
Regex maybe? #129917 11/09/05 06:03 AM
Joined: Jun 2005
Posts: 127
H
HAMM3R Offline OP
Vogon poet
OP Offline
Vogon poet
H
Joined: Jun 2005
Posts: 127
Hey. Through sockets I have aquired the souce of a CNN webpage. (http://cnn.com/WORLD to be exact). The source is saved in %d. So, ive been trying to figure out how to aquire that top headline, and the link of it. This way I can message a chan the headline and the link so they can read the article. This is what contains those items:

<div class="cnnT1Hd"><h2><a href="*">*</a></h2></div>

The first wildcard contains the link, the second is the Headline text. For an example, this is the one at the time i began working on the script:

<div class="cnnT1Hd"><h2><a href="/2005/WORLD/meast/09/10/iraq.main/index.html">Tal Afar drive targets insurgents</a></h2></div>

So, how might I be able to read those wildcards so I can msg them to a channel? I think this involves $regex, which I have no knoledge of. Can someone help me? TIA!

Thanks,
Austin


-- HAMM3R (aka: alhammer)
http://www.HAMM3R.net
Re: Regex maybe? #129918 11/09/05 07:11 AM
Joined: Aug 2004
Posts: 7,252
R
RusselB Offline
Hoopy frood
Offline
Hoopy frood
R
Joined: Aug 2004
Posts: 7,252
or use tokens
Code:
 set %site $gettok(%d,4,34)
set %headline $right($gettok($gettok(%d,5,34),1,60),-1)
.msg #Channel %site %headline
 

Per your example, those two would work

Re: Regex maybe? #129919 11/09/05 08:03 AM
Joined: Jul 2003
Posts: 655
Om3n Offline
Fjord artisan
Offline
Fjord artisan
Joined: Jul 2003
Posts: 655
$regex can be used to match the string, but will not return anything. $regsub can be used to strip out html tags ( var %re = <[^<>]*>, var %count = $regsub($1-,%re,,%out) ). But a href is concidered a html tag and would strip it out, therefor the result of the regsub would return the headline only. This of corse can be modified to return both url and headline (see below), but then tokens are still needed in the output to seperate the two.

If the format is always consistant as above, using tokens is a good idea.

In any case... This will return the url and headline, syntax: $cnnsub($1-) where $1- is the html code like what you posted above.
Code:
alias cnnsub {
  var %text = $1-, %re = &lt;[^/^&lt;&gt;]*&gt;|&lt;/[^/^&lt;&gt;]*&gt;|&lt;a\shref\=\"|\"\&gt;
  while ($regsub(%text, %re, ,%text) != 0) {
    %test = $regsub(%text, %re, $chr(32),%text)
  }
  return %text
}

; somewhere in your code.. where $1- in the cnnsub is the original html data.
var %urlheadline = $cnnsub($1-), %urlprefix = http://www.cnn.com
msg $target URL: %urlprefix $+ $gettok(%urlheadline, 1, 32) HEADLINE: $gettok(%urlheadline, 2-, 32)

As you can see, this is an unneccersarily complex way to do it. So use tokens where possible.

Last edited by Om3n; 11/09/05 08:13 AM.

"Allen is having a small problem and needs help adjusting his attitude" - Flutterby
Re: Regex maybe? #129920 11/09/05 12:01 PM
Joined: Apr 2003
Posts: 701
K
Kelder Offline
Hoopy frood
Offline
Hoopy frood
K
Joined: Apr 2003
Posts: 701
I very much doubt %d contains the entire web page, and just checked, a complete line might be more than 800 characters long orbe split up in different reads. But suppose you can somehow get the entire line you need in a single %var:

Code:
if ($regex(%d,/&lt;div class="cnnT1Hd"&gt;&lt;h2&gt;&lt;a href="(.*?)"&gt;(.*?)&lt;\/a&gt;&lt;\/h2&gt;&lt;\/div&gt;/i)) {
  var %url = $regml(1)
  var %headline = $regml(2)
}

ps: I got this just now as only match of cnnT1Hd:
<div class="cnnT1"> <div class="cnnT1Blurb"><div class="cnnT1HdLS"><h2><a href="/2005/US/09/10/katrina.impact/index.html">New Katrina chief sets goals</a></h2></div><p>
The extra LS will not match above regex...

ps2: ever heard of RSS? http://rss.cnn.com/rss/cnn_latest.rss for example?

Re: Regex maybe? #129921 11/09/05 01:28 PM
Joined: Jul 2003
Posts: 655
Om3n Offline
Fjord artisan
Offline
Fjord artisan
Joined: Jul 2003
Posts: 655
My original code used a single regex and regml also, but for some reason it wasn't filling regml so it would return nothing, maybe mirc just handles back refrences a little different than i am used to in other languages.


"Allen is having a small problem and needs help adjusting his attitude" - Flutterby
Re: Regex maybe? #129922 11/09/05 02:05 PM
Joined: Jun 2005
Posts: 127
H
HAMM3R Offline OP
Vogon poet
OP Offline
Vogon poet
H
Joined: Jun 2005
Posts: 127
Yeah, I originally wanted to try RSS. I downloaded an rss reader here. But I couldnt find a way to limit it from showing all the stories. I only wanted it to show 1 headline, or even have a -cN switch in the command to choose how many to display. So i couldnt figure out a way to keep it from flooding the channel. By the way, the news is triggered on a !news command. Any idea how I can choose/limit how many headlines to display from the RSS feed?

Thanks,
Austin


EDIT: I just ran into this which seems to be just about exactly what im looking for. Im going to play around with it for a bit and see what I can come up with.

Last edited by HAMM3R; 11/09/05 02:10 PM.

-- HAMM3R (aka: alhammer)
http://www.HAMM3R.net
Re: Regex maybe? #129923 11/09/05 02:15 PM
Joined: Apr 2003
Posts: 701
K
Kelder Offline
Hoopy frood
Offline
Hoopy frood
K
Joined: Apr 2003
Posts: 701
You probably have a while loop going over all the lines of the file, or even directly when you're /sockreading it.

Each time you find a news item to send to the channel, add this:
inc %cnnnews.counter
and add an extra check
if (%cnnnews.counter <= 3) {
; msg #channel News headline: %d
}
around the part that send a message to the channel.

Make sure to /unset %cnnnews.counter before you open the socket, or you won't get any news the second time smile

Re: Regex maybe? #129924 11/12/05 07:37 PM
Joined: Feb 2004
Posts: 2,019
FiberOPtics Offline
Hoopy frood
Offline
Hoopy frood
Joined: Feb 2004
Posts: 2,019
You can use my Universal RSS Headline Retriever on most RSS News sites including CNN. It's just 1 alias, which fills the hash table that you specified with all the headlines. I've added an alias /listheadlines that allows you to output the Nth headline or the headlines within a range N-M.


Gone.