mIRC Home    About    Download    Register    News    Help

Print Thread
[7.64] UTF related issue #268480 21/02/21 05:33 PM
Joined: Nov 2004
Posts: 806
Jigsy Offline OP
Hoopy frood
OP Offline
Hoopy frood
Joined: Nov 2004
Posts: 806
Again, another issue that's been more noticable since upgrading to 7.64.

For a while now, I've been puzzled as to why searching for certain Japanese words (/g, /jisho, etc.) would point me to page with gibberish.

[Linked Image from i.imgur.com]

I believe there's a weird inconsistancy in isalpha and isalnum when it comes to certain UTF characters.

For example: 語 is considered by mIRC to be an alpha character, yet ご is not.

Code
jisho { url -a $+(http://jisho.org/,$iif($1-,$+(search/,$htmlhex($v1)))) }
htmlhex {
  if ($1-) {
    var %i = 1, %x
    while (%i <= $len($1-)) {
      if ($mid($1-,%i,1) isalnum) { var %x = %x $+ $v1 }
      else { var %x = %x $+ $chr(37) $+ $base($asc($mid($1-,%i,1)),10,16,2) }
      inc %i
    }
    return %x
  }
}

; $htmlhex(日本語!) -> 日本語%21
; $htmlhex(にほんご) -> %306B%307B%3093%3054 (however this pointed me to the above image)


What do you do at the end of the world? Are you busy? Will you save us?
Re: [7.64] UTF related issue [Re: Jigsy] #268481 21/02/21 05:57 PM
Joined: Jan 2004
Posts: 1,510
maroon Offline
Hoopy frood
Offline
Hoopy frood
Joined: Jan 2004
Posts: 1,510
This looks like it does what you want, and is probably faster to use regsubex than go through a scripted loop. I assume the definition of alnum you're needing is the case-insensitive base36 alphabet. This replicates your "good" example, but I'm not sure either case handles a codepoint in the range 256-4095 which is a 3-digit hex number, or non-alnum in the 33-126 range? Instead of this simplistic substitution pattern, it may need to call a $myalias($asc(\t)) to handle different styles in different ranges. If it needs to be encoding each UTF8 character separately, remove the /u flag.

Code
//tokenize 32 にほんご | var %i 1  | echo -a $regsubex(foo,$1-,/([^0-9A-Za-z])/gu,$chr(37) $+ $base($asc(\t),10,16,2))