mIRC Home    About    Download    Register    News    Help

Print Thread
Page 1 of 2 1 2
#188505 25/10/07 01:49 AM
Joined: Oct 2007
Posts: 10
M
Pikka bird
OP Offline
Pikka bird
M
Joined: Oct 2007
Posts: 10
Hello, Could anyone show me a SIMPLE web grabber, that just gets text from a URL?

Joined: Oct 2004
Posts: 8,330
Hoopy frood
Offline
Hoopy frood
Joined: Oct 2004
Posts: 8,330
This is a very basic and general script to read data from a webpage.

Code:
alias Web {
  sockopen Web www.website.com 80
}

on *:sockopen:Web: {
  sockwrite -n GET /path/page.htm HTTP/1.0
  sockwrite -n Host: www.website.com
  sockwrite -n $sockname Accept: */* $+ $crlf $+ $crlf
}

on *:sockread:Web: {
  if ($sockerr) {
    echo -a Error.
    halt
  }
  else {
    var %temptext
    sockread %temptext
    echo -a %temptext
  }
}


Invision Support
#Invision on irc.irchighway.net
Joined: Oct 2007
Posts: 10
M
Pikka bird
OP Offline
Pikka bird
M
Joined: Oct 2007
Posts: 10
Thanks!

But is there a way that I don't need another copy of that per page I need to grab?

Joined: Oct 2004
Posts: 8,330
Hoopy frood
Offline
Hoopy frood
Joined: Oct 2004
Posts: 8,330
Huh? That doesn't make sense. Can you explain better what you are looking for? You wanted a basic script to grab a webpage and this is it. Obviously, you will probably want to parse the page for just the relevant data. If there's something specific, please be specific in what you ask and give an example or explanation to help us to understand what you want.

EDIT:
Okay, after re-reading that again, I think I understand what you're looking for. If you aren't going to parse the page's data and just want to get the entire page from multiple site, you can just put the host and path\page into variables.

Last edited by Riamus2; 25/10/07 02:05 AM.

Invision Support
#Invision on irc.irchighway.net
Joined: Dec 2002
Posts: 503
B
Fjord artisan
Offline
Fjord artisan
B
Joined: Dec 2002
Posts: 503
If you want to, you can have a look at the wwwget.mrc routine.

You just /getdata <url>, and it dumps it locally..

(Yeah, it's old, but the concepts are sound.. I should really rewrite it)

Joined: Oct 2007
Posts: 10
M
Pikka bird
OP Offline
Pikka bird
M
Joined: Oct 2007
Posts: 10
How would I use this.. (Im a newb to coding..)

Like if someone were to type !tellme <something> it would grab website.com/page.php?a=<something> and say it.

Last edited by Moptop650; 25/10/07 07:44 PM.
Joined: Oct 2004
Posts: 8,330
Hoopy frood
Offline
Hoopy frood
Joined: Oct 2004
Posts: 8,330
You see, that's where you have to be specific. You asked for a basic script to get an entire webpage. You did not ask for how to get specific information from a specific website and that would be necessary for us to help you other than to say to parse the HTML that you get using things like $regex or $gettok.


Invision Support
#Invision on irc.irchighway.net
Joined: Oct 2007
Posts: 10
M
Pikka bird
OP Offline
Pikka bird
M
Joined: Oct 2007
Posts: 10
I am using PHP on the website I am grabbing, to limit the received data to JUST what I want. I am loads better at php then this, Lol.

So anyways, I got it working sorta. Currently it just outputs the http headers - my code is-

Code:
on *:TEXT:m~test:#:{ 
  echo Connecting... 
  /sockopen home home.moptop.info 80 
} 
on *:sockopen:home:{ 
  echo Trying to communicate... 
  sockwrite -n home GET /index.php HTTP/1.1 
  sockwrite -n home Host: home.moptop.info 
  sockwrite -n home $crlf 
} 
on *:sockread:home:{ 
  echo Echoing Data... 
  sockread %temp 
  echo %temp 
} 


Its spits back...

HTTP/1.1 200 OK
Date: (Date of the request)
Server: Apache/2.0.59 (Unix)

And that stuff.

How do I fix that?

Last edited by Moptop650; 26/10/07 01:48 AM.
Joined: Aug 2005
Posts: 1,052
L
Hoopy frood
Offline
Hoopy frood
L
Joined: Aug 2005
Posts: 1,052
This following script will return whatever text you put in this section here if ($regex(%x,/(thistext|orthistxt etc... change those 3 words all in between for a wider search if you want or just replace the whole if statement with if (myword isin %x) the following code also removes HTML data (I know you said php) just go with it.

Code:
on *:TEXT:m~test:#:{ 
  echo Connecting... 
  /sockopen home home.moptop.info 80 
} 
on *:sockopen:home:{ 
  echo Trying to communicate... 
  sockwrite -n home GET /index.php HTTP/1.1 
  sockwrite -n home Host: home.moptop.info 
  sockwrite -n home $crlf 
} 
on 1:sockread:home:{
  if ($sockerr > 0) return 
  var %x | sockread %x
  if ($sockbr == 0) return 
  if (%x == $null) { return } 
  if ($regex(%x,/(thistext|orthistext|oreventhistext)/g) == 1) { 
    echo -a $nhtml(%x)
  }
}

alias -l nhtml { return $remove($regsubex($1-,/(^[^<]*>|<[^>]*>|<[^>]*$)/g,),$chr(9)) }


this next code returns the WHOLE line of text you search with without removing HTML code...

Code:
on *:TEXT:m~test:#:{ 
  echo Connecting... 
  /sockopen home home.moptop.info 80 
} 
on *:sockopen:home:{ 
  echo Trying to communicate... 
  sockwrite -n home GET /index.php HTTP/1.1 
  sockwrite -n home Host: home.moptop.info 
  sockwrite -n home $crlf 
} 
on 1:sockread:home:{
  if ($sockerr > 0) return 
  var %x | sockread %x
  if ($sockbr == 0) return 
  if (%x == $null) { return } 
  if ($regex(%x,/(thistext|orthistext|oreventhistext)/g) == 1) { 
    echo -a %x
  }
}


Code:
if $reality > $fiction { set %sanity Sane }
Else { echo -a *voices* }
Joined: Feb 2005
Posts: 342
R
Fjord artisan
Offline
Fjord artisan
R
Joined: Feb 2005
Posts: 342
Originally Posted By: Moptop650
I am using PHP on the website I am grabbing, to limit the received data to JUST what I want. I am loads better at php then this, Lol.

So anyways, I got it working sorta. Currently it just outputs the http headers - my code is-

Code:
on *:TEXT:m~test:#:{ 
  echo Connecting... 
  /sockopen home home.moptop.info 80 
} 
on *:sockopen:home:{ 
  echo Trying to communicate... 
  sockwrite -n home GET /index.php HTTP/1.1 
  sockwrite -n home Host: home.moptop.info 
  sockwrite -n home $crlf 
} 
on *:sockread:home:{ 
  echo Echoing Data... 
  sockread %temp 
  echo %temp 
} 


Its spits back...

HTTP/1.1 200 OK
Date: (Date of the request)
Server: Apache/2.0.59 (Unix)

And that stuff.

How do I fix that?



Well, since I can't connect to this website, I can't really test anything. I've modified the code a bit as well.

Code:
alias home {
  if ($sock(home)) { sockclose home }
  echo -s *** Trying to connect to home.moptop.info
  sockopen home home.moptop.info 80
}

on *:sockopen:home:{ 
  if ($sockerr) { echo -s *** Can't connect. | return }
  var %% = sockwrite -n $sockname
  %% GET /index.php HTTP/1.0
  %% Host: home.moptop.info 
  %% 
} 
on *:sockread:home:{
  if ($sockerr) { echo -s *** Sock error. | return }
  var %s | sockread -fn %s
  while ($sockbr) {
    echo -s  $+ %s 
    sockread -fn %s
  }
}

Joined: Oct 2004
Posts: 8,330
Hoopy frood
Offline
Hoopy frood
Joined: Oct 2004
Posts: 8,330
Your problem with receiving just the headers is almost guaranteed to be because you used HTTP/1.1 instead of HTTP/1.0 like I showed you. 1.1 gets data in chunks and can make trying to do anything with it a challenge. 1.0 is nice and easy to work with.


Invision Support
#Invision on irc.irchighway.net
Joined: Oct 2007
Posts: 10
M
Pikka bird
OP Offline
Pikka bird
M
Joined: Oct 2007
Posts: 10
Originally Posted By: Rand
Well, since I can't connect to this website, I can't really test anything.


Yeah that URL is linked to my home computer.. Thanks for the code, I'll try that in a few.

Riamus2, Ill try that too. Edit: Eh same thing =/

Edit: Rand, It works! It gets the content, but the headers are still there, how can I remove them?

Edit also: How can I get it to say its results to the channel it was called from?\

Last edited by Moptop650; 26/10/07 08:16 PM.
Joined: Feb 2005
Posts: 342
R
Fjord artisan
Offline
Fjord artisan
R
Joined: Feb 2005
Posts: 342
Originally Posted By: Moptop650
Originally Posted By: Rand
Well, since I can't connect to this website, I can't really test anything.


Yeah that URL is linked to my home computer.. Thanks for the code, I'll try that in a few.

Riamus2, Ill try that too. Edit: Eh same thing =/

Edit: Rand, It works! It gets the content, but the headers are still there, how can I remove them?

Edit also: How can I get it to say its results to the channel it was called from?\


Well.. I'll edit this in a bit if you don't figure out how to queue messages so that you don't flood the channel off. For now, I need to nap, so you'll have to deal with a partial edit. smile

Code:
alias home {
  ; /home <chan|nick>
  if (!$1) { echo -a *** Invalid parameters. /home <chan|nick> | return }
  if ($sock(home)) { sockclose home }
  unset %home.*

  echo -s *** Trying to connect to home.moptop.info

  sockopen home home.moptop.info 80
  sockmark home $1
}

on *:sockopen:home_headers:{ 
  if ($sockerr) { echo -s *** Can't connect. | return }
  var %% = sockwrite -n $sockname
  %% GET /index.php HTTP/1.0
  %% Host: home.moptop.info 
  %% 
} 

on *:sockread:home:{
  if ($sockerr) { echo -s *** Sock error. | return }
  var %s | sockread -fn %s
  while ($sockbr) {
    if (%home.headers) {
      msg $sock($sockname).mark  $+ %s
    }
    elseif (%s == $null) { set %home.headers 1 }
    sockread -fn %s
  }
}
on *:sockclose:home:{ unset %home.* }

Joined: Aug 2005
Posts: 1,052
L
Hoopy frood
Offline
Hoopy frood
L
Joined: Aug 2005
Posts: 1,052
Try this This will remove headers using HTTP 1.1 also it will display only the data entry from webpage and nothing HTML wise if you want that let me know else this will parse data has text base only no HTML tags

Code:
alias home { 
  echo -s *** Trying to connect to home.moptop.info 
  sockopen home home.moptop.info 80 
} 

on *:sockopen:home:{ 
  echo Trying to communicate... 
  sockwrite -n home GET / HTTP/1.1 
  sockwrite -n home Host: home.moptop.info
  sockwrite -n home $crlf 
} 

on 1:sockread:home:{
  if ($sockerr > 0) return 
  var %x | sockread -fn %x
  if ($right($gettok(%x,1,32),1) == :) || (http/1.1 == $gettok(%x,1,32)) { return }
  elseif ($nhtml(%x) == $null) { return }
  else {
    echo -a $nhtml(%x)
  }
}

alias -l nhtml { return $remove($regsubex($1-,/(&nbsp;|^[^<]*>|<[^>]*>|<[^>]*$)/g,),$chr(9)) }


Code:
if $reality > $fiction { set %sanity Sane }
Else { echo -a *voices* }
Joined: Oct 2007
Posts: 10
M
Pikka bird
OP Offline
Pikka bird
M
Joined: Oct 2007
Posts: 10
That doesn't seem to want to return the data.

Quote:
*** Trying to connect to home.moptop.info
Trying to communicate...


Joined: Aug 2005
Posts: 1,052
L
Hoopy frood
Offline
Hoopy frood
L
Joined: Aug 2005
Posts: 1,052
ya thats because you have 0 HTML and 1 word

try adding a few words and html tags like this

<HTML>
<TITLE>MYTEST</TITLE>
<FONT COLOR="RED">word</FONT>
</HTML>

it should pick up proper data

If you want a better test replace home.moptop.info with www.mirc.com and try the event it will return txt base only


Code:
if $reality > $fiction { set %sanity Sane }
Else { echo -a *voices* }
Joined: Oct 2007
Posts: 10
M
Pikka bird
OP Offline
Pikka bird
M
Joined: Oct 2007
Posts: 10
Yay that works! ^_^

Thanks for the help everyone!

Joined: Aug 2005
Posts: 1,052
L
Hoopy frood
Offline
Hoopy frood
L
Joined: Aug 2005
Posts: 1,052
:P MiRc is stubborn to fetch data if it equals 1 word or little text for some reason maybe mostly all the time if no HTML tags are in Don't forget that a true website would contain html tags even if its coded by php, now maybe an alternative route to fetching word data would be to have a .txt file online like

home.moptop.info/mytest.txt :P


Code:
if $reality > $fiction { set %sanity Sane }
Else { echo -a *voices* }
Joined: Oct 2007
Posts: 10
M
Pikka bird
OP Offline
Pikka bird
M
Joined: Oct 2007
Posts: 10
Yes, I know smile I plan to only use this to grab stuff from my site. (The scripts on my site are made for this, they get stuff from other sites.)

Joined: Oct 2004
Posts: 8,330
Hoopy frood
Offline
Hoopy frood
Joined: Oct 2004
Posts: 8,330
Originally Posted By: Lpfix5
:P MiRc is stubborn to fetch data if it equals 1 word or little text for some reason maybe mostly all the time if no HTML tags are in


I haven't had any issues where a txt file only has a version # as an update check without HTML or other text. Use HTML/1.0 as mentioned and there should be no problem with that. HTML/1.1 shouldn't really be used in most cases as HTML/1.0 usually works better for grabbing socket data.


Invision Support
#Invision on irc.irchighway.net
Page 1 of 2 1 2

Link Copied to Clipboard