Hi guys... been awhile. I ran into a problem fixing a regex issue in a script for someone. The problem being that I know very little about regex (I can sometimes understand the basic meaning if I see it, but I can't write it and don't know anything complex).  Anyhow, a friend has a script that searches quotes at imdb. It works find except for one small thing. Some change at the site broke the actual search part. Example search results page: http://www.imdb.com/find?q=Family%20Guy;s=tt Current Regex: %regexp = /<a href="\x2ftitle\x2ftt(\d+)\x2f[^"]*"(?: onclick="set_args\x28'tt\1'\x2c1,1\x29")?>([^<]+)</a>/i One section of a long line that is stored in a variable: <a href="/title/tt0847163/" onclick="(new Image()).src='/rg/find-title-1/title_exact/images/b.gif?link=/title/tt0847163/';">Family Guy</a> (2006) (VG) </td></tr></table> </p> <p> It used to create a match (multiple depending on the results), but no longer matches. From my limited understanding, I'm assuming it's the part in the second half of the regex that's failing to match. The final results are stored using: $nohtml($regml(imdb,1) $regml(imdb,2)) ... where $nohtml is a standard html stripping regex identifier. Any help figuring out why it's not matching would be appreciated. Thanks.
A simple examination of the regex and example shows that they're nothing alike in terms of the "onclick" portion of the match..
Yeah, I figured it was the second part. I just don't know quite how to edit it to match that correctly.
So, anyone have any suggestions for this? I know there used to be a lot here who knew regex... where'd they go? 
Hm, to capture the highlighted parts of: <a href="/title/tt 0847163/" onclick="(new Image()).src='/rg/find-title-1/title_exact/images/b.gif?link=/title/tt 0847163/';"> Family Guy</a> (2006) (VG) </td></tr></table> </p> <p> I played with: var %line = <a href="/title/tt0847163/" onclick="(new Image()).src='/rg/find-title-1/title_exact/images/b.gif?link=/title/tt0847163/';">Family Guy</a> (2006) (VG) </td></tr></table> </p> <p>
var %reg = /<a href="\/title\/tt(\d+)\/" onclick=".+\/title_exact.+\?link=\/title\/tt(\d+)\/';">([^<]+)</a>/i
if ($regex(%line,%reg)){ echo -a 1st linkN: $regml(1) 2nd linkN: $regml(2) Title: $regml(3) Result: 1st linkN: 0847163 2nd linkN: 0847163 Title: Family Guy The second capture may or may not be redundant... I don't know what you mean with "multiple" matches though - All the "exact matches", and if, with or without the "(2006) (VG) in the Family Guy example? ____ EDIT: Managed to get the individual "popular" and "exact matches" with this snippet, but it's pretty ugly (especially the whiles): ; /socktest <search string>
alias socktest {
if ($sock(imdb)) sockclose imdb
sockopen imdb imdb.com 80
sockmark imdb $replace($$1-,$chr(32),$(%20,0))
.timer 1 10 sockclose imdb
on *:sockopen:imdb: {
sockwrite -n $sockname get $+(/find?q=,$sock($sockname).mark,;s=tt) HTTP/1.0
sockwrite -n $sockname Host: www.imdb.com
sockwrite $sockname $crlf
on *:sockread:imdb: {
if ($sockerr) { echo -a error | return }
var %read
sockread %read
if ($gettok(%read,1-2,32) == <p><b>Popular Titles</b>) {
var %reg = /\/find-title-(\d+)\/title_popular\/images\/b.gif\?link=\/title\/tt(\d+)\/';">(.+?)(?=<br>|<\/td>)/gi
noop $regex(imdb,%read,%reg)
if ($regml(imdb,1)) ECHO -a POPULAR MATCHES
var %n = 1
while ($regml(imdb,%n)) {
ECHO -a No.: $regml(imdb,%n) LinkNo: $regml(imdb,$calc(%n +1)) $&
Text: $regsubex( $regsubex($regml(imdb,$calc(%n +2)),/<.+?>/g,$null) ,/&#(\d+);/g,$chr(\1))
inc %n 3
var %reg = /\/find-title-(\d+)\/title_exact\/images\/b.gif\?link=\/title\/tt(\d+)\/';">(.+?)(?=<br>|<\/td>)/gi
noop $regex(imdb,%read,%reg)
if ($regml(imdb,1)) ECHO -a EXACT MATCHES
var %n = 1
while ($regml(imdb,%n)) {
ECHO -a No.: $regml(imdb,%n) LinkNo: $regml(imdb,$calc(%n +1)) $&
Text: $regsubex( $regsubex($regml(imdb,$calc(%n +2)),/<.+?>/g,$null) ,/&#(\d+);/g,$chr(\1))
inc %n 3
sockclose imdb
} Hope it's a step in the right direction  I'd display the exact matches (if any), followed by the popular matches (if any)... Note that this snippet doesn't use binvars - but you'll need them, and I assume the script already uses 'em. More or less results will be cut off else. (The regsubex strip html codes in the "text" part and convert to ascii)
Thanks for the help. It definitely is getting me on the right track, but I'm not quite there. Let's look at a Garfield search. Your second code does show both exact matches and the popular. I don't need popular, but need exact and partial (not approximate either). Partial would use title_substring in the regex. I made that change and it shows only the first 3 matches. That is probably due to not using binary variables. Below is the current sockread event. The second section (the ELSE) is where the search part is handled. Any chance you can work your code into there so it will list all results for exact and partial titles?
on *:SOCKREAD:imdb.*:{
if ($sockerr) {
!echo -ecs info * Could not read from socket ' $+ $sockname $+ ' ( $+ $gettok($sock($sockname).wsmsg,2-,32) $+ )
hfree -w $sockname
var %data
if (*.fetchquotes.* iswm $sockname) {
sockread -n &data
while ($sockbr) {
%data = $bvar(&data,1-930).text
var %id = $hget($sockname,id)
if (<a name="qt*"></a> iswm %data) {
var %quoteid = $gettok(%data,2,34)
hadd -m $sockname quoteid %quoteid
if (!$isdir(imdbquotes)) mkdir imdbquotes
if (!$isdir(%id)) mkdir imdbquotes\ $+ %id
var %file = imdbquotes\ $+ %id $+ \ $+ %quoteid $+ .dat
if ($isfile(%file)) write -c %file
elseif (%data == <hr width="30%">) || (<div align="center"> * iswm %data) || (*Related Links* iswm %data) {
if $hget($sockname,quoteid) {
hinc -m $sockname count
hdel -w $sockname quoteid
elseif ($hget($sockname,quoteid)) {
var %quoteid = $ifmatch, %file = imdbquotes\ $+ %id $+ \ $+ %quoteid $+ .dat
if (%data != $null) {
var %string = $nohtml(%data)
if (%string != $null) {
bset -t &info 1 %string
bset &info $calc($bvar(&info,0) + 1) $iif($right(%data,1) == :,32,13 10)
bwrite %file -1 -1 &info
bunset &data &info
sockread -n &data
else {
sockread %data
while ($sockbr) {
var %regexp = /<a href="\x2ftitle\x2ftt(\d+)\x2f[^"]*"(?: onclick="set_args\x28'tt\1'\x2c1,1\x29")?>([^<]+)</a>/i
if ($regex(imdb,%data,%regexp)) {
hadd -m $sockname $calc($hget($sockname,0).item - 1) $nohtml($regml(imdb,1) $regml(imdb,2))
elseif (!$dialog(imdbquotes) && %data == <h2>Other Results</h2>) {
hadd -m $sockname $calc($hget($sockname,0).item - 1) -
sockread %data
Like I said, only titles (exact and partial) are needed, not names or characters. Approximate matches aren't required either. And both the name and tt number are needed. Thanks again for helping. **NOTE: The part about other results in there that adds a - is probably related to separating exact and partial matches from one another. It doesn't have to remain, but if it can still throw in a separator between the two, that won't hurt.
Sorry, I don't get these [f*] bin Wars working  How I think it should be done: - The line in question is the only line that starts with [SPACE]<p><b> ---> line is too long to do a greedy regex match, has to go to a binvar instead - On this line, all chunks to parse are pieces starting with: [SPACE]onclick="(new Image()).src='/rg/find-title- .... and ending with a </td> if they shall include the "year/aka" part, or an </a> if they shall not include the year and aka respectively. As far as I can see, there's no need to go for the initial <a href="/title/tt LINKDIGITS/". ---> loop all $bfind occurences of this starting piece, pick a chunk of sufficient length (short enough to be regex-checked). Each successive $bfind starts at $calc(last found location + $len(search term)). ---> In the loop, pick linkNr. and title of that chunk: $regml(1) and $regml(2). Regexes: with aka and year: /\/find-title-\d+\/title_exact\/images\/b.gif\?link=\/title\/tt(\d+)\/';">(.+?)<\/td>/i and /\/find-title-\d+\/title_substring\/images\/b.gif\?link=\/title\/tt(\d+)\/';">(.+?)<\/td>/i or main title only: /\/find-title-\d+\/title_exact\/images\/b.gif\?link=\/title\/tt(\d+)\/';">(.+?)<\/a>/i and /\/find-title-\d+\/title_substring\/images\/b.gif\?link=\/title\/tt(\d+)\/';">(.+?)<\/a>/i ---> To convert possible "special chars" in the title: e.g. $regsubex( $regml(2) ,/&#(\d+);/g,$chr(\1)) ---> To clean the html-tags of the title (only needed if aka/year are included), your $nohtml alias or e.g. another: $regsubex( char-cleared-title ,/<.+?>/g,$null) ____ I'd be glad if someone can give a nice, concrete example how to do a loop ($bfind chunks) of a lengthy, sockreaded &binvar
Just an addition to show the $bfind loop I have in mind. Dumping all the data first, like: on *:sockread:imdb: {
sockread -fn &imdbtemp
if ($sockbr) {
bwrite temp.txt -1 -1 &imdbtemp
goto read
} Now parsing that data in the sockclose event:
on *:sockclose:imdb:{
if ($isfile(temp.txt)) {
; read whole page into binvar
bread temp.txt 0 $file(temp.txt).size &imdb
; ECHO -a binvar has $bvar(&imdb,0) bytes
; relevant line starts with: [SPACE]<p><b>
; set pointer at the start of the relevant line. set end of that line
var %pos = $bfind(&imdb,1,$chr(32) $+ <p><b>).text, %end = $bfind(&imdb,%pos,0)
; ECHO -a linestart +300: $bvar(&imdb,%pos,300).text
; ECHO -a lineend -300: $bvar(&imdb,$calc(%end -300),300).text
; set indicating "start text" for all the chunks-to-get: onclick="(new Image()).src='/rg/find-title-
var %find = onclick="(new Image()).src='/rg/find-title-
; set regex for matching and capturing
var %reg = /\/find-title-\d+\/title_(\w+)\/images\/b\.gif\?link=\/title\/tt(\d+)\/';">(.+)<\/a>/
; this reg is just an alternative to include "aka, year" etc.
var %reg2 = /\/find-title-\d+\/title_(\w+)\/images\/b\.gif\?link=\/title\/tt(\d+)\/';">(.+)/
; loop all occurrences of %find, up to the end of the relevant line
while ((%pos < %end) && ($bfind(&imdb,%pos,%find).text)) {
; set pointer to the "end marker of this chunk": </td>
var %pos = $bfind(&imdb,$v1,</td>).text
; current chunk
var %chunk = $bvar(&imdb,$v1,$calc(%pos - $v1)).text
; ECHO -a current chunk: %chunk
if ($regex(%chunk,%reg)) { ECHO -a Type: $regml(1) Nr: $regml(2) Text: $regml(3) }
; else ECHO -a skipped a chunk
; remove tempfile
.remove temp.txt
} ...Lacks error handling etc  You have all the matches with $regml(1) = "type" (exact/popular/substring/aprox), $regml(2) = "ttNr." and $regml(3) = "Title". Note that if there's only one exact match, imdb will forward to that title. You can capture this ttNr. in the sockread event: /^Set-Cookie: fd=tt(\d+)\// where $regml(1) is the ttNr. And if there's no match at all, you'll have a "<b>No Matches.</b>" somewhere in the sockread. In the code above, the "find line's start and end" stuff is what I'd like to avoid - yet I don't manage to put only that single, lengthy line directly into a binvar/bwrite it (for an early sockclose and less data to parse). The biggest downside regarding performance is /bwriting the complete sockread into a tempfile fist. Will throwing "the line" out into a separate binvar speed it up (?) I doubt...
Yeah, this is why I hate sites with such poor formatting. Though perhaps they do that purposely so people don't do something similar. I normally avoid such sites and find others that are better formatted, but there's really no alternative to imdb. Oh well, I'll take what you've done and play with it and see if I can make it work. If you or anyone else wants to take a stab (or another in your case), feel free. This won't be the easiest thing I've ever worked on. As far as performance, that's actually a minor consideration here. This particular part of the script is only used to find the right tt number for whatever you're looking for. Once found, the rest of the script does the rest and does it quickly. The find is rarely used... just to update things or add new information, so speed and performance don't matter a whole lot. It just shouldn't take a minute or two to do.  Btw, I really appreciate the help.
I haven't tried yet, but I'm wondering if that last code you posted could be set up to only write the section that contains the matches rather than the entire page? I'll look into that and see if I can find something that is just before the matches that can be checked in the binvar and if it matches, it enables the /bwrite and then it keeps checking for something right after the matches that can disable the /bwrite. That should speed it up considerably, I'd think. Again, I haven't really had a chance to work with this yet, but that seems like it would work.
Now I got only that line written (rough, but worked for me). Still using a tempfile, but it's smaller now at least  on *:SOCKREAD:YOURSOCKNAME: {
; - your error handling etc here -
sockread -fn &YOURBINVAR
if ($sockbr) {
; start writing at [linestart][space]<p><b> (skip ~5k-15k+ bytes at the start)
if ($bvar(&YOURBINVAR,1,7).text == $chr(32) $+ <p><b>) { set -e %imdb.line $true }
; close socket at [linestart]<b>Suggestions For Improving Your Results</b> (skip ~7k bytes at the end)
elseif ($bvar(&YOURBINVAR,1,45).text == <b>Suggestions For Improving Your Results</b>) {
imdb.sockclose $sockname
sockclose $sockname
; write read to tempfile
if (%imdb.line == $true) { bwrite YOURTEMPFILE -1 -1 &YOURBINVAR }
goto read
} On sockclose triggers for a remote sockclose. As the socked now isn't (always) closed remotely: move the "on sockclose" code to a custom alias. I kept the sockclose event only to play safe a bit. on *:SOCKCLOSE:YOURSOCKNAME:{ imdb.sockclose $sockname }
alias -l imdb.sockclose {
unset %imdb.line
- the code you had in the sockclose event here. -
- $sockname is now $1; you could pass sockmarks as well -
} EDIT: the (%pos < %end) comparison in the bfind loop is obsolete now. Just set %pos to 1.
Ok, I finally had some time to get this worked on. It's working really well. Took a little work to get it just right, but your code really helped out a lot. Thanks! The only thing I couldn't figure out how to handle without running a separate socket is getting the title of a show that takes you to the show's page instead of the search results page. I know you gave me a regex for finding the tt number in those cases (I had to change that slightly because they must have changed the site since you made it as it wasn't stored in a cookie as far as I could tell), but there doesn't appear to be any way of getting the title of the show that appears on the new page without doing a new socket to get that information. That probably is the only method, unfortunately. Well, I went ahead and made it pull up the new page to get the title. You can see a slight pause as it does it, if you put in something to display at each step, but it's under a second and you can't normally see it without the test display, so it's all good now.  Thanks again!
