|
Joined: Jun 2004
Posts: 55
Babel fish
|
OP
Babel fish
Joined: Jun 2004
Posts: 55 |
I experienced the following problem: Trying to develop an RSS parser, I wrote a function that replaces HTML entities, Unicode text and some more stuff. It looks this way: alias -l feedtextconvert {
var %tempvar,%feedconverted = $1-
%feedconverted = $replace(%feedconverted,$chr(9),$chr(32))
%tempvar = $regsub(%feedconverted,/"/g,",%feedconverted)
%tempvar = $regsub(%feedconverted,/</g,<,%feedconverted)
%tempvar = $regsub(%feedconverted,/>/g,>,%feedconverted)
%tempvar = $regsub(%feedconverted,/ /g,$chr(32),%feedconverted)
%tempvar = $regsub(%feedconverted,/ä/g,ä,%feedconverted)
%tempvar = $regsub(%feedconverted,/ü/g,ü,%feedconverted)
%tempvar = $regsub(%feedconverted,/ö/g,ö,%feedconverted)
%tempvar = $regsub(%feedconverted,/Ä/g,Ä,%feedconverted)
%tempvar = $regsub(%feedconverted,/Ü/g,Ü,%feedconverted)
%tempvar = $regsub(%feedconverted,/Ö/g,Ö,%feedconverted)
%tempvar = $regsub(%feedconverted,/ß/g,ß,%feedconverted)
%tempvar = $regsub(%feedconverted,/&(\S*);/g,,%feedconverted)
if ($isutf(%feedconverted)) { %feedconverted = $utfdecode(%feedconverted,0) }
%tempvar = $regsub(%feedconverted,/<br( \/)?>/g,\\n,%feedconverted)
%feedconverted = $regsubex(%feedconverted,/<a (\s|\S|=|\")*>(\s|\S|=|\")*<\/a>/g,\2)
%feedconverted = $regsubex(%feedconverted,/<img(\s|\S|=|\")* alt=\"(\s|\S)\"(\s|\S|=|\")*( \/)?>/g,\2)
%tempvar = $regsub(%feedconverted,/<(\/?)(\S*)>/g,,%feedconverted)
return %feedconverted
} Now, when I call the function with %something = $feedtextconvert(%x) in a socket (where %x contains the <title> line of a feed), mIRC crashes (reproducible). When I comment out the line replacing the < or I replace the particular line by: %feedconverted = $replace(%feedconverted,<,$+(<,$chr(32))) ..., it stops crashing. (Any additional character after the < works here, strange enough! Also, replacing < by $chr(60) does not work.) The > line works fine anyway. I presume mIRC has some difficulties with the less-than symbol here. May be worth a glance.
Gamers.IRC - The best way to use mIRC
|
|
|
|
Joined: Dec 2002
Posts: 5,482
Hoopy frood
|
Hoopy frood
Joined: Dec 2002
Posts: 5,482 |
Thanks for the report, unfortunately I can't seem to reproduce this issue here. Can you provide the smallest value for %x that results in a crash for you?
|
|
|
|
Joined: Jun 2004
Posts: 55
Babel fish
|
OP
Babel fish
Joined: Jun 2004
Posts: 55 |
So far we could find out a bit more: First, the line has to be extremely long. While mIRC processes lines of more than 1k characters without any problem, the crash occurs in a line with 3,400+ characters. The whole crashing line we want to process (excerpt from a German weblog) can be found here: test.txt (Too long for pasting here directly, sorry.) One more thing we noticed: If we just //echo the replaced line without using a %var, it works. The crash only happens if we try to assign it to a variable. This thing is getting stranger every day. :-) (Using /tokenize instead of a var works, too, so obviously the variable assignment is broken here.) To make it easier to test, I changed and added the code from above to a new script file: test.mrc
Last edited by Chrisi; 20/01/10 01:28 AM.
Gamers.IRC - The best way to use mIRC
|
|
|
|
Joined: Dec 2002
Posts: 5,482
Hoopy frood
|
Hoopy frood
Joined: Dec 2002
Posts: 5,482 |
Thanks, I was able to reproduce the issue. The cause appears to be a stack overflow in PCRE. I notice that if the following line is commented out the issue does not occur:
;%feedconverted = $regsubex(%feedconverted,/<img(\s|\S|=|\")* alt=\"(\s|\S)\"(\s|\S|=|\")*( \/)?>/g,\2)
Is it possible that the issue is due to a problem with the above expression?
|
|
|
|
Joined: Dec 2005
Posts: 28
Ameglian cow
|
Ameglian cow
Joined: Dec 2005
Posts: 28 |
That expression is causing problems (in terms of crashing mIRC) here, too, after we played a bit with tokenize instead of variable assignment. The regular expression is incomplete anyway (it does not match an <img> code), so I tried to fix it. After some more testing, the following line: //echo -ag $regsubex(<img src="test.png" alt="test" />,/<img([^>]*)alt="([^"]+)"((\s|\S|=|")*)(\s)?(\/)?>/g,\2) ... echoes "test" correctly, so, at least, the "fixed" expression matches. Using it in the particular %feedconverted line still makes it crash mIRC, so (obviously) it does not even matter if the expression matches or not. (I wonder why the stack overflow only applies to some of the rules and does not happen when commenting out some.)
Gamers.IRC team - gamersirc.net #Gamers.IRC on QuakeNet (sometimes we're there).
|
|
|
|
Joined: Dec 2002
Posts: 5,482
Hoopy frood
|
Hoopy frood
Joined: Dec 2002
Posts: 5,482 |
The PCRE library supports recursion using two different methods: one uses normal stack allocation and the other allocates memory on the fly as needed. The reason your script is crashing the PCRE library is due to heavy recursion resulting in a stack allocation error. This is most likely due to an inefficient regular expression. I tested your script using memory-based recursion in PCRE and it completed without crashing, however the result from $feedtextconvert() was not correct. In a previous version of mIRC, I changed PCRE to use memory based recursion since users were reporting the same issue as you. The change solved the heavy recursion crash issue, however it also resulted in far slower expression parsing, which some users reported. So I changed it back to stack allocation in a subsequent version of mIRC. Update: I have made a change to PCRE that limits the recursion depth and causes the parsing to stop at that point. Since it is stack-based, a recursion depth of about 2000 seems to be a safe maximum and prevents mIRC from crashing.
Last edited by Khaled; 21/01/10 10:35 PM.
|
|
|
|
Joined: Dec 2005
Posts: 28
Ameglian cow
|
Ameglian cow
Joined: Dec 2005
Posts: 28 |
Sounds fine, thank you. Now what does this change mean for our script (except that it won't cause a crash anymore)? Do we need to change something to make it compliant with upcoming changes?
Gamers.IRC team - gamersirc.net #Gamers.IRC on QuakeNet (sometimes we're there).
|
|
|
|
Joined: Dec 2002
Posts: 5,482
Hoopy frood
|
Hoopy frood
Joined: Dec 2002
Posts: 5,482 |
The first "img" expression results in an endless recursion loop, so something is wrong with the expression, and it will be halted once it reaches the 2000 recurse limit.
The second "img" expression also requires recursion that goes much deeper than the 2000 recurse limit. So this expression will be halted before it finishes as well.
The second expression does not work correctly, even when I allow it to go as deep as it wants. It finishes fairly quickly but the result is not correct, so something is still wrong with it.
Since I rarely receive reports about regular expressions crashing mIRC, it seems that very few users create regular expressions that require heavy recursion (at least not intentionally), so the 2000 recurse limit seems to be sufficient.
Have you tried testing your expression in another regular expression parser to see what happens?
|
|
|
|
Joined: Mar 2004
Posts: 54
Babel fish
|
Babel fish
Joined: Mar 2004
Posts: 54 |
Instead of (\s|\S|=|\")*
try this: [^>]*
|
|
|
|
Joined: Dec 2005
Posts: 28
Ameglian cow
|
Ameglian cow
Joined: Dec 2005
Posts: 28 |
Have you tried testing your expression in another regular expression parser to see what happens? I tried it using Expresso (which, basically, returns the correct values). I don't know if there is a "more advanced" expression parser available anywhere, so far it was enough to me. (Any ideas?) @Zed: ... that works! Thanks! Now that made so much of a difference? Working regex: /<img([^>]*)alt="([^"]+)"([^>]*)(\s)?(\/)?>/g EDIT: I was told it still crashes. :-(
Last edited by Tuxman; 22/01/10 10:01 PM.
Gamers.IRC team - gamersirc.net #Gamers.IRC on QuakeNet (sometimes we're there).
|
|
|
|
|