mIRC Home    About    Download    Register    News    Help

Print Thread
Page 1 of 2 1 2
#188505 25/10/07 01:49 AM
M
Moptop650
Moptop650
M
Hello, Could anyone show me a SIMPLE web grabber, that just gets text from a URL?

#188506 25/10/07 01:58 AM
Joined: Oct 2004
Posts: 8,061
R
Hoopy frood
Offline
Hoopy frood
R
Joined: Oct 2004
Posts: 8,061
This is a very basic and general script to read data from a webpage.

Code:
alias Web {
  sockopen Web www.website.com 80
}

on *:sockopen:Web: {
  sockwrite -n GET /path/page.htm HTTP/1.0
  sockwrite -n Host: www.website.com
  sockwrite -n $sockname Accept: */* $+ $crlf $+ $crlf
}

on *:sockread:Web: {
  if ($sockerr) {
    echo -a Error.
    halt
  }
  else {
    var %temptext
    sockread %temptext
    echo -a %temptext
  }
}

M
Moptop650
Moptop650
M
Thanks!

But is there a way that I don't need another copy of that per page I need to grab?

#188510 25/10/07 02:04 AM
Joined: Oct 2004
Posts: 8,061
R
Hoopy frood
Offline
Hoopy frood
R
Joined: Oct 2004
Posts: 8,061
Huh? That doesn't make sense. Can you explain better what you are looking for? You wanted a basic script to grab a webpage and this is it. Obviously, you will probably want to parse the page for just the relevant data. If there's something specific, please be specific in what you ask and give an example or explanation to help us to understand what you want.

EDIT:
Okay, after re-reading that again, I think I understand what you're looking for. If you aren't going to parse the page's data and just want to get the entire page from multiple site, you can just put the host and path\page into variables.

Last edited by Riamus2; 25/10/07 02:05 AM.
Joined: Dec 2002
Posts: 503
B
Fjord artisan
Offline
Fjord artisan
B
Joined: Dec 2002
Posts: 503
If you want to, you can have a look at the wwwget.mrc routine.

You just /getdata <url>, and it dumps it locally..

(Yeah, it's old, but the concepts are sound.. I should really rewrite it)

M
Moptop650
Moptop650
M
How would I use this.. (Im a newb to coding..)

Like if someone were to type !tellme <something> it would grab website.com/page.php?a=<something> and say it.

Last edited by Moptop650; 25/10/07 07:44 PM.
#188579 25/10/07 11:51 PM
Joined: Oct 2004
Posts: 8,061
R
Hoopy frood
Offline
Hoopy frood
R
Joined: Oct 2004
Posts: 8,061
You see, that's where you have to be specific. You asked for a basic script to get an entire webpage. You did not ask for how to get specific information from a specific website and that would be necessary for us to help you other than to say to parse the HTML that you get using things like $regex or $gettok.

M
Moptop650
Moptop650
M
I am using PHP on the website I am grabbing, to limit the received data to JUST what I want. I am loads better at php then this, Lol.

So anyways, I got it working sorta. Currently it just outputs the http headers - my code is-

Code:
on *:TEXT:m~test:#:{ 
  echo Connecting... 
  /sockopen home home.moptop.info 80 
} 
on *:sockopen:home:{ 
  echo Trying to communicate... 
  sockwrite -n home GET /index.php HTTP/1.1 
  sockwrite -n home Host: home.moptop.info 
  sockwrite -n home $crlf 
} 
on *:sockread:home:{ 
  echo Echoing Data... 
  sockread %temp 
  echo %temp 
} 


Its spits back...

HTTP/1.1 200 OK
Date: (Date of the request)
Server: Apache/2.0.59 (Unix)

And that stuff.

How do I fix that?

Last edited by Moptop650; 26/10/07 01:48 AM.
#188588 26/10/07 03:31 AM
Joined: Aug 2005
Posts: 1,052
L
Hoopy frood
Offline
Hoopy frood
L
Joined: Aug 2005
Posts: 1,052
This following script will return whatever text you put in this section here if ($regex(%x,/(thistext|orthistxt etc... change those 3 words all in between for a wider search if you want or just replace the whole if statement with if (myword isin %x) the following code also removes HTML data (I know you said php) just go with it.

Code:
on *:TEXT:m~test:#:{ 
  echo Connecting... 
  /sockopen home home.moptop.info 80 
} 
on *:sockopen:home:{ 
  echo Trying to communicate... 
  sockwrite -n home GET /index.php HTTP/1.1 
  sockwrite -n home Host: home.moptop.info 
  sockwrite -n home $crlf 
} 
on 1:sockread:home:{
  if ($sockerr > 0) return 
  var %x | sockread %x
  if ($sockbr == 0) return 
  if (%x == $null) { return } 
  if ($regex(%x,/(thistext|orthistext|oreventhistext)/g) == 1) { 
    echo -a $nhtml(%x)
  }
}

alias -l nhtml { return $remove($regsubex($1-,/(^[^<]*>|<[^>]*>|<[^>]*$)/g,),$chr(9)) }


this next code returns the WHOLE line of text you search with without removing HTML code...

Code:
on *:TEXT:m~test:#:{ 
  echo Connecting... 
  /sockopen home home.moptop.info 80 
} 
on *:sockopen:home:{ 
  echo Trying to communicate... 
  sockwrite -n home GET /index.php HTTP/1.1 
  sockwrite -n home Host: home.moptop.info 
  sockwrite -n home $crlf 
} 
on 1:sockread:home:{
  if ($sockerr > 0) return 
  var %x | sockread %x
  if ($sockbr == 0) return 
  if (%x == $null) { return } 
  if ($regex(%x,/(thistext|orthistext|oreventhistext)/g) == 1) { 
    echo -a %x
  }
}

#188591 26/10/07 06:00 AM
Joined: Feb 2005
Posts: 342
R
Fjord artisan
Offline
Fjord artisan
R
Joined: Feb 2005
Posts: 342
Originally Posted By: Moptop650
I am using PHP on the website I am grabbing, to limit the received data to JUST what I want. I am loads better at php then this, Lol.

So anyways, I got it working sorta. Currently it just outputs the http headers - my code is-

Code:
on *:TEXT:m~test:#:{ 
  echo Connecting... 
  /sockopen home home.moptop.info 80 
} 
on *:sockopen:home:{ 
  echo Trying to communicate... 
  sockwrite -n home GET /index.php HTTP/1.1 
  sockwrite -n home Host: home.moptop.info 
  sockwrite -n home $crlf 
} 
on *:sockread:home:{ 
  echo Echoing Data... 
  sockread %temp 
  echo %temp 
} 


Its spits back...

HTTP/1.1 200 OK
Date: (Date of the request)
Server: Apache/2.0.59 (Unix)

And that stuff.

How do I fix that?



Well, since I can't connect to this website, I can't really test anything. I've modified the code a bit as well.

Code:
alias home {
  if ($sock(home)) { sockclose home }
  echo -s *** Trying to connect to home.moptop.info
  sockopen home home.moptop.info 80
}

on *:sockopen:home:{ 
  if ($sockerr) { echo -s *** Can't connect. | return }
  var %% = sockwrite -n $sockname
  %% GET /index.php HTTP/1.0
  %% Host: home.moptop.info 
  %% 
} 
on *:sockread:home:{
  if ($sockerr) { echo -s *** Sock error. | return }
  var %s | sockread -fn %s
  while ($sockbr) {
    echo -s  $+ %s 
    sockread -fn %s
  }
}

#188595 26/10/07 10:11 AM
Joined: Oct 2004
Posts: 8,061
R
Hoopy frood
Offline
Hoopy frood
R
Joined: Oct 2004
Posts: 8,061
Your problem with receiving just the headers is almost guaranteed to be because you used HTTP/1.1 instead of HTTP/1.0 like I showed you. 1.1 gets data in chunks and can make trying to do anything with it a challenge. 1.0 is nice and easy to work with.

M
Moptop650
Moptop650
M
Originally Posted By: Rand
Well, since I can't connect to this website, I can't really test anything.


Yeah that URL is linked to my home computer.. Thanks for the code, I'll try that in a few.

Riamus2, Ill try that too. Edit: Eh same thing =/

Edit: Rand, It works! It gets the content, but the headers are still there, how can I remove them?

Edit also: How can I get it to say its results to the channel it was called from?\

Last edited by Moptop650; 26/10/07 08:16 PM.
#188620 26/10/07 09:32 PM
Joined: Feb 2005
Posts: 342
R
Fjord artisan
Offline
Fjord artisan
R
Joined: Feb 2005
Posts: 342
Originally Posted By: Moptop650
Originally Posted By: Rand
Well, since I can't connect to this website, I can't really test anything.


Yeah that URL is linked to my home computer.. Thanks for the code, I'll try that in a few.

Riamus2, Ill try that too. Edit: Eh same thing =/

Edit: Rand, It works! It gets the content, but the headers are still there, how can I remove them?

Edit also: How can I get it to say its results to the channel it was called from?\


Well.. I'll edit this in a bit if you don't figure out how to queue messages so that you don't flood the channel off. For now, I need to nap, so you'll have to deal with a partial edit. smile

Code:
alias home {
  ; /home <chan|nick>
  if (!$1) { echo -a *** Invalid parameters. /home <chan|nick> | return }
  if ($sock(home)) { sockclose home }
  unset %home.*

  echo -s *** Trying to connect to home.moptop.info

  sockopen home home.moptop.info 80
  sockmark home $1
}

on *:sockopen:home_headers:{ 
  if ($sockerr) { echo -s *** Can't connect. | return }
  var %% = sockwrite -n $sockname
  %% GET /index.php HTTP/1.0
  %% Host: home.moptop.info 
  %% 
} 

on *:sockread:home:{
  if ($sockerr) { echo -s *** Sock error. | return }
  var %s | sockread -fn %s
  while ($sockbr) {
    if (%home.headers) {
      msg $sock($sockname).mark  $+ %s
    }
    elseif (%s == $null) { set %home.headers 1 }
    sockread -fn %s
  }
}
on *:sockclose:home:{ unset %home.* }

#188643 27/10/07 02:08 AM
Joined: Aug 2005
Posts: 1,052
L
Hoopy frood
Offline
Hoopy frood
L
Joined: Aug 2005
Posts: 1,052
Try this This will remove headers using HTTP 1.1 also it will display only the data entry from webpage and nothing HTML wise if you want that let me know else this will parse data has text base only no HTML tags

Code:
alias home { 
  echo -s *** Trying to connect to home.moptop.info 
  sockopen home home.moptop.info 80 
} 

on *:sockopen:home:{ 
  echo Trying to communicate... 
  sockwrite -n home GET / HTTP/1.1 
  sockwrite -n home Host: home.moptop.info
  sockwrite -n home $crlf 
} 

on 1:sockread:home:{
  if ($sockerr > 0) return 
  var %x | sockread -fn %x
  if ($right($gettok(%x,1,32),1) == :) || (http/1.1 == $gettok(%x,1,32)) { return }
  elseif ($nhtml(%x) == $null) { return }
  else {
    echo -a $nhtml(%x)
  }
}

alias -l nhtml { return $remove($regsubex($1-,/(&nbsp;|^[^<]*>|<[^>]*>|<[^>]*$)/g,),$chr(9)) }

M
Moptop650
Moptop650
M
That doesn't seem to want to return the data.

Quote:
*** Trying to connect to home.moptop.info
Trying to communicate...


#188650 27/10/07 02:25 AM
Joined: Aug 2005
Posts: 1,052
L
Hoopy frood
Offline
Hoopy frood
L
Joined: Aug 2005
Posts: 1,052
ya thats because you have 0 HTML and 1 word

try adding a few words and html tags like this

<HTML>
<TITLE>MYTEST</TITLE>
<FONT COLOR="RED">word</FONT>
</HTML>

it should pick up proper data

If you want a better test replace home.moptop.info with www.mirc.com and try the event it will return txt base only

M
Moptop650
Moptop650
M
Yay that works! ^_^

Thanks for the help everyone!

#188653 27/10/07 02:40 AM
Joined: Aug 2005
Posts: 1,052
L
Hoopy frood
Offline
Hoopy frood
L
Joined: Aug 2005
Posts: 1,052
:P MiRc is stubborn to fetch data if it equals 1 word or little text for some reason maybe mostly all the time if no HTML tags are in Don't forget that a true website would contain html tags even if its coded by php, now maybe an alternative route to fetching word data would be to have a .txt file online like

home.moptop.info/mytest.txt :P

M
Moptop650
Moptop650
M
Yes, I know smile I plan to only use this to grab stuff from my site. (The scripts on my site are made for this, they get stuff from other sites.)

Joined: Oct 2004
Posts: 8,061
R
Hoopy frood
Offline
Hoopy frood
R
Joined: Oct 2004
Posts: 8,061
Originally Posted By: Lpfix5
:P MiRc is stubborn to fetch data if it equals 1 word or little text for some reason maybe mostly all the time if no HTML tags are in


I haven't had any issues where a txt file only has a version # as an update check without HTML or other text. Use HTML/1.0 as mentioned and there should be no problem with that. HTML/1.1 shouldn't really be used in most cases as HTML/1.0 usually works better for grabbing socket data.

Page 1 of 2 1 2

Link Copied to Clipboard