I am in a bit of a hurry right now, so can't really look at this, but why not just check all the modes used and base it on that?
Example:
mode +kmo-v key nick1 nick2
Check $1, one character at a time:
Character 1 is +, so set a variable or whatever noting that everything after it is a + (until you reach a -, of course)
Character 2 is k.  As such, you know that $2 will be the key value.
Character 3 is m.  Assuming m is moderated on your network, you can just ignore this mode and continue to the next.
Character 4 is o.  So, you know that you will op someone.  Since you already know $2 is the key, then you know you are on $3 for the nick being opped.
Character 5 is -.  Change your variable, or whatever you're using, to say that everything is now a -.
Character 6 is v.  So, you know that you're devoicing someone.  Since $3 was the last used token, $4 is the one being devoiced.
Now, I am sure that there is probably another method of doing this that is faster than checking every character in $1.  Still, I can't imagine this taking too long to do because you can only do a limited number of modes at one time.
Another method that I can think of really quickly is to backtrack.  Check $1 for specific characters.  Then, remove the + and - from it (and any modes that don't need another token, such as +m [moderated]) and check the $pos of the character (such as o) and add 1 (because $1 isn't a mode token).  You will then know that the token matching the $pos + 1 will be the one opped/deopped.  Then, just backtrack to see what the last used +/- is.
I actually think RusselB was doing this sort of thing, though I haven't looked closely, nor have I tested it.  I just don't have time right now to really go through it all.