Print Thread

Regex question #208398 20/01/09 01:43 AM
Joined: Oct 2004 Posts: 8,061 MA, USA R Riamus2 OP Hoopy frood
OP Riamus2 Hoopy frood R Joined: Oct 2004 Posts: 8,061 MA, USA	Hi guys... been awhile. I ran into a problem fixing a regex issue in a script for someone. The problem being that I know very little about regex (I can sometimes understand the basic meaning if I see it, but I can't write it and don't know anything complex). Anyhow, a friend has a script that searches quotes at imdb. It works find except for one small thing. Some change at the site broke the actual search part. Example search results page: http://www.imdb.com/find?q=Family%20Guy;s=tt Current Regex: Code: %regexp = /<a href="\x2ftitle\x2ftt(\d+)\x2f[^"]*"(?: onclick="set_args\x28'tt\1'\x2c1,1\x29")?>([^<]+)</a>/i One section of a long line that is stored in a variable: Code: <a href="/title/tt0847163/" onclick="(new Image()).src='/rg/find-title-1/title_exact/images/b.gif?link=/title/tt0847163/';">Family Guy</a> (2006) (VG) </td></tr></table> </p> <p> It used to create a match (multiple depending on the results), but no longer matches. From my limited understanding, I'm assuming it's the part in the second half of the regex that's failing to match. The final results are stored using: Code: $nohtml($regml(imdb,1) $regml(imdb,2)) ... where $nohtml is a standard html stripping regex identifier. Any help figuring out why it's not matching would be appreciated. Thanks.

Re: Regex question Riamus2 #208403 20/01/09 05:05 AM
Joined: Oct 2003 Posts: 3,641 Montreal, QC, Canada A argv0 Hoopy frood
argv0 Hoopy frood A Joined: Oct 2003 Posts: 3,641 Montreal, QC, Canada	A simple examination of the regex and example shows that they're nothing alike in terms of the "onclick" portion of the match..

Re: Regex question argv0 #208441 20/01/09 09:04 PM
Joined: Oct 2004 Posts: 8,061 MA, USA R Riamus2 OP Hoopy frood
OP Riamus2 Hoopy frood R Joined: Oct 2004 Posts: 8,061 MA, USA	Yeah, I figured it was the second part. I just don't know quite how to edit it to match that correctly.

Re: Regex question Riamus2 #208718 27/01/09 03:15 PM
Joined: Oct 2004 Posts: 8,061 MA, USA R Riamus2 OP Hoopy frood
OP Riamus2 Hoopy frood R Joined: Oct 2004 Posts: 8,061 MA, USA	So, anyone have any suggestions for this? I know there used to be a lot here who knew regex... where'd they go?

Re: Regex question

Riamus2 #208722 27/01/09 04:41 PM

Joined: Nov 2006

Posts: 1,552

Germany

Horstl

Hoopy frood

Horstl

Hoopy frood

Joined: Nov 2006

Posts: 1,552

Germany

Hm, to capture the highlighted parts of:
<a href="/title/tt0847163/" onclick="(new Image()).src='/rg/find-title-1/title_exact/images/b.gif?link=/title/tt0847163/';">Family Guy</a> (2006) (VG) </td></tr></table> </p> <p>

I played with:

Code:

  var %line = <a href="/title/tt0847163/" onclick="(new Image()).src='/rg/find-title-1/title_exact/images/b.gif?link=/title/tt0847163/';">Family Guy</a> (2006) (VG) </td></tr></table> </p> <p>
  var %reg = /<a href="\/title\/tt(\d+)\/" onclick=".+\/title_exact.+\?link=\/title\/tt(\d+)\/';">([^<]+)</a>/i
  if ($regex(%line,%reg)){ echo -a 1st linkN: $regml(1) 2nd linkN: $regml(2) Title: $regml(3)

Result: 1st linkN: 0847163 2nd linkN: 0847163 Title: Family Guy

The second capture may or may not be redundant...
I don't know what you mean with "multiple" matches though - All the "exact matches", and if, with or without the "(2006) (VG) in the Family Guy example?

____ EDIT:

Managed to get the individual "popular" and "exact matches" with this snippet, but it's pretty ugly (especially the whiles):

Code:

; /socktest <search string>

alias socktest {
  if ($sock(imdb)) sockclose imdb
  sockopen imdb imdb.com 80
  sockmark imdb $replace($$1-,$chr(32),$(%20,0))
  .timer 1 10 sockclose imdb
}

on *:sockopen:imdb: {
  sockwrite -n $sockname get $+(/find?q=,$sock($sockname).mark,;s=tt) HTTP/1.0
  sockwrite -n $sockname Host: www.imdb.com
  sockwrite $sockname $crlf
}

on *:sockread:imdb: {
  if ($sockerr) { echo -a error | return }
  var %read
  sockread %read
  if ($gettok(%read,1-2,32) == <p><b>Popular Titles</b>) { 

    var %reg = /\/find-title-(\d+)\/title_popular\/images\/b.gif\?link=\/title\/tt(\d+)\/';">(.+?)(?=<br>|<\/td>)/gi
    noop $regex(imdb,%read,%reg)
    if ($regml(imdb,1)) ECHO -a POPULAR MATCHES
    var %n = 1
    while ($regml(imdb,%n)) {
      ECHO -a No.: $regml(imdb,%n) LinkNo: $regml(imdb,$calc(%n +1)) $&
        Text: $regsubex( $regsubex($regml(imdb,$calc(%n +2)),/<.+?>/g,$null) ,/&#(\d+);/g,$chr(\1))
      inc %n 3
    }

    var %reg = /\/find-title-(\d+)\/title_exact\/images\/b.gif\?link=\/title\/tt(\d+)\/';">(.+?)(?=<br>|<\/td>)/gi
    noop $regex(imdb,%read,%reg)
    if ($regml(imdb,1)) ECHO -a EXACT MATCHES
    var %n = 1
    while ($regml(imdb,%n)) {
      ECHO -a No.: $regml(imdb,%n) LinkNo: $regml(imdb,$calc(%n +1)) $&
        Text: $regsubex( $regsubex($regml(imdb,$calc(%n +2)),/<.+?>/g,$null) ,/&#(\d+);/g,$chr(\1))
      inc %n 3
    }
    sockclose imdb
  }
}

Hope it's a step in the right direction

I'd display the exact matches (if any), followed by the popular matches (if any)...

Note that this snippet doesn't use binvars - but you'll need them, and I assume the script already uses 'em. More or less results will be cut off else.

(The regsubex strip html codes in the "text" part and convert to ascii)

Re: Regex question

Horstl #208742 27/01/09 08:05 PM

Joined: Oct 2004

Posts: 8,061

MA, USA

Riamus2

Hoopy frood

Riamus2

Hoopy frood

Joined: Oct 2004

Posts: 8,061

MA, USA

Thanks for the help. It definitely is getting me on the right track, but I'm not quite there.

Let's look at a Garfield search. Your second code does show both exact matches and the popular. I don't need popular, but need exact and partial (not approximate either). Partial would use title_substring in the regex. I made that change and it shows only the first 3 matches. That is probably due to not using binary variables.

Below is the current sockread event. The second section (the ELSE) is where the search part is handled. Any chance you can work your code into there so it will list all results for exact and partial titles?

Code:

on *:SOCKREAD:imdb.*:{
  if ($sockerr) {
    !echo -ecs info * Could not read from socket ' $+ $sockname $+ ' ( $+ $gettok($sock($sockname).wsmsg,2-,32) $+ )
    hfree -w $sockname
    return
  }
  var %data
  if (*.fetchquotes.* iswm $sockname) {
    sockread -n &data
    while ($sockbr) {
      %data = $bvar(&data,1-930).text
      var %id = $hget($sockname,id)
      if (<a name="qt*"></a> iswm %data) {
        var %quoteid = $gettok(%data,2,34)
        hadd -m $sockname quoteid %quoteid
        if (!$isdir(imdbquotes)) mkdir imdbquotes
        if (!$isdir(%id)) mkdir imdbquotes\ $+ %id
        var %file = imdbquotes\ $+ %id $+ \ $+ %quoteid $+ .dat
        if ($isfile(%file)) write -c %file
      }
      elseif (%data == <hr width="30%">) || (<div align="center"> * iswm %data) || (*Related Links* iswm %data) {
        if $hget($sockname,quoteid) {
          hinc -m $sockname count
          hdel -w $sockname quoteid
        }
      }
      elseif ($hget($sockname,quoteid)) {
        var %quoteid = $ifmatch, %file = imdbquotes\ $+ %id $+ \ $+ %quoteid $+ .dat
        if (%data != $null) {
          var %string = $nohtml(%data)
          if (%string != $null) {
            bset -t &info 1 %string
            bset &info $calc($bvar(&info,0) + 1) $iif($right(%data,1) == :,32,13 10)
            bwrite %file -1 -1 &info
            bunset &data &info
          }
        }
      }
      sockread -n &data
    }
  }
  else {
    sockread %data
    while ($sockbr) {
      var %regexp = /<a href="\x2ftitle\x2ftt(\d+)\x2f[^"]*"(?: onclick="set_args\x28'tt\1'\x2c1,1\x29")?>([^<]+)</a>/i
      if ($regex(imdb,%data,%regexp)) {
        hadd -m $sockname $calc($hget($sockname,0).item - 1) $nohtml($regml(imdb,1) $regml(imdb,2))
      }
      elseif (!$dialog(imdbquotes) && %data == <h2>Other Results</h2>) {
        hadd -m $sockname $calc($hget($sockname,0).item - 1) -
      }
      sockread %data
    }
  }
}

Like I said, only titles (exact and partial) are needed, not names or characters. Approximate matches aren't required either. And both the name and tt number are needed.

Thanks again for helping.

**NOTE: The part about other results in there that adds a - is probably related to separating exact and partial matches from one another. It doesn't have to remain, but if it can still throw in a separator between the two, that won't hurt.

Last edited by Riamus2; 27/01/09 08:07 PM.

Re: Regex question Riamus2 #208752 28/01/09 12:15 AM
Joined: Nov 2006 Posts: 1,552 Germany H Horstl Hoopy frood
Horstl Hoopy frood H Joined: Nov 2006 Posts: 1,552 Germany	Sorry, I don't get these [f*] binWars working How I think it should be done: - The line in question is the only line that starts with [SPACE]<p><b> ---> line is too long to do a greedy regex match, has to go to a binvar instead - On this line, all chunks to parse are pieces starting with: [SPACE]onclick="(new Image()).src='/rg/find-title- .... and ending with a </td> if they shall include the "year/aka" part, or an </a> if they shall not include the year and aka respectively. As far as I can see, there's no need to go for the initial <a href="/title/ttLINKDIGITS/". ---> loop all $bfind occurences of this starting piece, pick a chunk of sufficient length (short enough to be regex-checked). Each successive $bfind starts at $calc(last found location + $len(search term)). ---> In the loop, pick linkNr. and title of that chunk: $regml(1) and $regml(2). Regexes: with aka and year: /\/find-title-\d+\/title_exact\/images\/b.gif\?link=\/title\/tt(\d+)\/';">(.+?)<\/td>/i and /\/find-title-\d+\/title_substring\/images\/b.gif\?link=\/title\/tt(\d+)\/';">(.+?)<\/td>/i or main title only: /\/find-title-\d+\/title_exact\/images\/b.gif\?link=\/title\/tt(\d+)\/';">(.+?)<\/a>/i and /\/find-title-\d+\/title_substring\/images\/b.gif\?link=\/title\/tt(\d+)\/';">(.+?)<\/a>/i ---> To convert possible "special chars" in the title: e.g. $regsubex( $regml(2) ,/&#(\d+);/g,$chr(\1)) ---> To clean the html-tags of the title (only needed if aka/year are included), your $nohtml alias or e.g. another: $regsubex( char-cleared-title ,/<.+?>/g,$null) ____ I'd be glad if someone can give a nice, concrete example how to do a loop ($bfind chunks) of a lengthy, sockreaded &binvar

Re: Regex question

Horstl #208823 29/01/09 12:57 PM

Joined: Nov 2006

Posts: 1,552

Germany

Horstl

Hoopy frood

Horstl

Hoopy frood

Joined: Nov 2006

Posts: 1,552

Germany

Just an addition to show the $bfind loop I have in mind. Dumping all the data first, like:

Code:

on *:sockread:imdb: {
  :read
  sockread -fn &imdbtemp
  if ($sockbr) {
    bwrite temp.txt -1 -1 &imdbtemp
    goto read
  }
}

Now parsing that data in the sockclose event:

Code:

 
on *:sockclose:imdb:{
  if ($isfile(temp.txt)) { 
    ; read whole page into binvar
    bread temp.txt 0 $file(temp.txt).size &imdb

    ;    ECHO -a binvar has $bvar(&imdb,0) bytes

    ; relevant line starts with:       [SPACE]<p><b>
    ; set pointer at the start of the relevant line. set end of that line
    var %pos = $bfind(&imdb,1,$chr(32) $+ <p><b>).text, %end = $bfind(&imdb,%pos,0)

    ;    ECHO -a linestart +300: $bvar(&imdb,%pos,300).text
    ;    ECHO -a lineend -300: $bvar(&imdb,$calc(%end -300),300).text

    ; set indicating "start text" for all the chunks-to-get:       onclick="(new Image()).src='/rg/find-title-
    var %find = onclick="(new Image()).src='/rg/find-title-

    ; set regex for matching and capturing
    var %reg = /\/find-title-\d+\/title_(\w+)\/images\/b\.gif\?link=\/title\/tt(\d+)\/';">(.+)<\/a>/

    ; this reg is just an alternative to include "aka, year" etc.
    var %reg2 = /\/find-title-\d+\/title_(\w+)\/images\/b\.gif\?link=\/title\/tt(\d+)\/';">(.+)/

    ; loop all occurrences of %find, up to the end of the relevant line
    while ((%pos < %end) && ($bfind(&imdb,%pos,%find).text)) {

      ; set pointer to the "end marker of this chunk":       </td> 
      var %pos = $bfind(&imdb,$v1,</td>).text

      ; current chunk
      var %chunk = $bvar(&imdb,$v1,$calc(%pos - $v1)).text

      ;    ECHO -a current chunk: %chunk

      if ($regex(%chunk,%reg)) { ECHO -a Type: $regml(1) Nr: $regml(2) Text: $regml(3) }

      ;    else ECHO -a skipped a chunk
    }

    ; remove tempfile
    .remove temp.txt
  }
}

...Lacks error handling etc

You have all the matches with $regml(1) = "type" (exact/popular/substring/aprox), $regml(2) = "ttNr." and $regml(3) = "Title".

Note that if there's only one exact match, imdb will forward to that title. You can capture this ttNr. in the sockread event:
/^Set-Cookie: fd=tt(\d+)\// where $regml(1) is the ttNr. And if there's no match at all, you'll have a "<b>No Matches.</b>" somewhere in the sockread.

In the code above, the "find line's start and end" stuff is what I'd like to avoid - yet I don't manage to put only that single, lengthy line directly into a binvar/bwrite it (for an early sockclose and less data to parse).
The biggest downside regarding performance is /bwriting the complete sockread into a tempfile fist. Will throwing "the line" out into a separate binvar speed it up (?) I doubt...

Re: Regex question Horstl #208852 30/01/09 02:59 AM
Joined: Oct 2004 Posts: 8,061 MA, USA R Riamus2 OP Hoopy frood
OP Riamus2 Hoopy frood R Joined: Oct 2004 Posts: 8,061 MA, USA	Yeah, this is why I hate sites with such poor formatting. Though perhaps they do that purposely so people don't do something similar. I normally avoid such sites and find others that are better formatted, but there's really no alternative to imdb. Oh well, I'll take what you've done and play with it and see if I can make it work. If you or anyone else wants to take a stab (or another in your case), feel free. This won't be the easiest thing I've ever worked on. As far as performance, that's actually a minor consideration here. This particular part of the script is only used to find the right tt number for whatever you're looking for. Once found, the rest of the script does the rest and does it quickly. The find is rarely used... just to update things or add new information, so speed and performance don't matter a whole lot. It just shouldn't take a minute or two to do. Btw, I really appreciate the help.

Re: Regex question Horstl #208928 31/01/09 05:21 PM
Joined: Oct 2004 Posts: 8,061 MA, USA R Riamus2 OP Hoopy frood
OP Riamus2 Hoopy frood R Joined: Oct 2004 Posts: 8,061 MA, USA	I haven't tried yet, but I'm wondering if that last code you posted could be set up to only write the section that contains the matches rather than the entire page? I'll look into that and see if I can find something that is just before the matches that can be checked in the binvar and if it matches, it enables the /bwrite and then it keeps checking for something right after the matches that can disable the /bwrite. That should speed it up considerably, I'd think. Again, I haven't really had a chance to work with this yet, but that seems like it would work.

Re: Regex question

Riamus2 #208931 31/01/09 09:54 PM

Joined: Nov 2006

Posts: 1,552

Germany

Horstl

Hoopy frood

Horstl

Hoopy frood

Joined: Nov 2006

Posts: 1,552

Germany

Now I got only that line written (rough, but worked for me). Still using a tempfile, but it's smaller now at least

Code:

on *:SOCKREAD:YOURSOCKNAME: {

  ; - your error handling etc here -

  :read
  sockread -fn &YOURBINVAR
  if ($sockbr) {

    ; start writing at     [linestart][space]<p><b>     (skip ~5k-15k+ bytes at the start)
    if ($bvar(&YOURBINVAR,1,7).text == $chr(32) $+ <p><b>) { set -e %imdb.line $true }

    ; close socket at     [linestart]<b>Suggestions For Improving Your Results</b>     (skip ~7k bytes at the end)
    elseif ($bvar(&YOURBINVAR,1,45).text == <b>Suggestions For Improving Your Results</b>) {
      imdb.sockclose $sockname
      sockclose $sockname
      return
    }

    ; write read to tempfile
    if (%imdb.line == $true) { bwrite YOURTEMPFILE -1 -1 &YOURBINVAR }

    goto read
  }
}

On sockclose triggers for a remote sockclose. As the socked now isn't (always) closed remotely: move the "on sockclose" code to a custom alias. I kept the sockclose event only to play safe a bit.

Code:

on *:SOCKCLOSE:YOURSOCKNAME:{ imdb.sockclose $sockname }

alias -l imdb.sockclose {
  unset %imdb.line
  - the code you had in the sockclose event here. - 
  - $sockname is now $1; you could pass sockmarks as well -
}

EDIT: the (%pos < %end) comparison in the bfind loop is obsolete now. Just set %pos to 1.

Re: Regex question Horstl #209547 17/02/09 04:01 AM
Joined: Oct 2004 Posts: 8,061 MA, USA R Riamus2 OP Hoopy frood
OP Riamus2 Hoopy frood R Joined: Oct 2004 Posts: 8,061 MA, USA	Ok, I finally had some time to get this worked on. It's working really well. Took a little work to get it just right, but your code really helped out a lot. Thanks! The only thing I couldn't figure out how to handle without running a separate socket is getting the title of a show that takes you to the show's page instead of the search results page. I know you gave me a regex for finding the tt number in those cases (I had to change that slightly because they must have changed the site since you made it as it wasn't stored in a cookie as far as I could tell), but there doesn't appear to be any way of getting the title of the show that appears on the new page without doing a new socket to get that information. That probably is the only method, unfortunately. Well, I went ahead and made it pull up the new page to get the title. You can see a slight pause as it does it, if you put in something to display at each step, but it's under a second and you can't normally see it without the test display, so it's all good now. Thanks again! Last edited by Riamus2; 17/02/09 04:28 AM.

Link Copied to Clipboard

Forums Scripts & Popups Regex question