mIRC Home    About    Download    Register    News    Help

Topic Options
#263013 - 13/05/18 07:17 PM New encoding method for $encode() and $unsafe()
maroon Offline
Hoopy frood

Registered: 12/01/04
Posts: 869
$unsafe uses $encode(string,m), encoding 3 bytes into 4 text characters. Saving &binvar into %variable is limited below 3/4th of the max line length because the encoded string is 4/3rds as long. Each Unicode codepoint above 2047 is UTF-8 encoded as 3 bytes, so encoding such strings can potentially have a MIME'ed string be 4x as long as the $len of the original text string:

Code:
//bset &v 30 0 | noop $encode(&v,bm) | var %var $bvar(&v,1-).text | echo -a $len(%var) %var
//bset -t &v 1 $str($chr(10004),30) | noop $encode(&v,bm) | var %var $bvar(&v,1-).text | echo -a $len(%var) %var



Base 85 is the lowest number that can translate 4 bytes into 5 text, because 85^5th is greater than 256^4th - so if a variation of base-85 were enabled as a new encoding method, this would allow the unsafe'ed or encoded string to be closer to 4/5th of the max line length.

From the Wikipedia page on Base85, it seems there's not a single standard that's going to be expected when text is described as being Base85 encoded, as there's been several flavors of it created for the needs of different groups. mIRC could choose to use whichever of the Base85 variants would be most likely used for specific external purposes, or use a different alphabet that strikes 9 of the 94 characters from ASCII 33-126 which are most likely to cause problems by being interpreted as special symbols.

Two variants listed at the Base85 Wikipedia page are ZeroQM and RFC-1924. Both variants use all 52 alphanumeric characters, and differ only by which 23 non letter/numbers they use. In this list, I changed characters to 'x' if they're not used by RFC-1924 or ZeroQM.

Code:
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ 7bit printable non-alphanumeric
!x#$%&x()*+x-xxx;<=>?@xxx^_`{|}~ RFC-1924
!x#$%&x()*+x-./:x<=>?@[x]^xx{x}x ZeroQM


To prevent evaluation inside $unsafe(string).undo, there could be a throw-away character at the front, to prevent the first character being $ or %. Having the discarded 1st character be a symbol that's not in the Base64 mime, such as "^", makes it easy to prevent a Base85 string from ever being the same as a Base64 string.

There's still a problem when the $unsafe().undo string contains an unequal number of open/close parenthesis, so perhaps use the RFC-1924 alphabet except replacing () with []. If someone needs to translate back to RFC-1924, they could use $mid($replace(string,[,$chr(40),],$chr(41)),2)

Base85's design looks like it does not pad to create full groups of 5 text characters. A final group of 2/3/4/5 characters is the encoding of 1/2/3/4 binary bytes respectively. Padding would require either an 86th character in the alphabet, or a scheme like used in $encode encryption switches 'pnz' where strings must always pad 1-5 characters.

Adding a switch parameter to $unsafe would retain backwards compatibility, though I don't see why that's needed.

For $encode/$decode, a lot of the switch letters are taken, and I assume '8' would cause problems for a potential future switch needing a numeric modifier. The untaken switch letters in $encode are dfghjkoqvwxy, and I don't see any obvious mnemonic choices, unless you want to permit a case-sensitive capital-U, which is $chr(85). It appears $encode currently treats switch letters as case-insensitive, so currently $encode(test,U) uses the 'u' Uuencode switch. Case-insensitive is not documented, and seems to be the exception compared to other identifiers.

Top
#263018 - 14/05/18 01:07 AM Re: New encoding method for $encode() and $unsafe() [Re: maroon]
Raccoon Offline
Hoopy frood

Registered: 18/02/03
Posts: 2435
There's an identifier somewhere for converting illegal file characters into underscores. Seems related.
_________________________
doiní things a particle can

Top
#263019 - 14/05/18 01:26 AM Re: New encoding method for $encode() and $unsafe() [Re: Raccoon]
maroon Offline
Hoopy frood

Registered: 12/01/04
Posts: 869
I'm not seeing how that relates to what $encode and $unsafe do. The closest to such identifier I'm aware of is how /drawsave chooses the filename when you give it a filename that contains characters it doesn't like. Plus, changing to underscores causes too many inputs having identical outputs. Since upper/lower case alphabet letters are treated as equivalent for filename purposes, there aren't even 64 characters from the printable 33-126 range that can be used for filename characters.

Top
#263020 - 14/05/18 01:32 AM Re: New encoding method for $encode() and $unsafe() [Re: maroon]
Raccoon Offline
Hoopy frood

Registered: 18/02/03
Posts: 2435
Honestly, I'm not fully comprehending the nature of your post. Could you give me the tweet-length summary?

> Each Unicode codepoint above 2047 is UTF-8 encoded as 3 bytes, so encoding such strings can potentially have a MIME'ed string be 4x as long as the $len of the original text string

Seems copacetic to me.
_________________________
doiní things a particle can

Top
#263021 - 14/05/18 02:11 AM Re: New encoding method for $encode() and $unsafe() [Re: Raccoon]
maroon Offline
Hoopy frood

Registered: 12/01/04
Posts: 869
My code example showed that 30 codepoint 10004's required a mime string of 120 characters to mime the 30, so it meant the mime text string has $len of 120 for an original string of 30 - 4x as long.

Base-85 lets you translate 4 binary bytes into 5 text, the way mime translates 3 binary bytes into 4 text.

Top
#263022 - 14/05/18 02:42 AM Re: New encoding method for $encode() and $unsafe() [Re: maroon]
Raccoon Offline
Hoopy frood

Registered: 18/02/03
Posts: 2435
I see. I guess I got confused when you started going into converting certain ascii characters into 'x'. Yeah, I'm a fan of novel encoding schemes (steganography), so I would just push for Base94 and be done with it. All the printable characters besides space.
_________________________
doiní things a particle can

Top
#263023 - 14/05/18 03:11 AM Re: New encoding method for $encode() and $unsafe() [Re: Raccoon]
maroon Offline
Hoopy frood

Registered: 12/01/04
Posts: 869
Base94 as something added to $base() would be nice, but it would trash output whose decimal equivalent is greater than 2^53, like what happens when you try to translate an entire hash to/from base36.

//echo -a $base($base($sha1(abc),16,36),36,16)

With Base85, it lets you translate groups of 4 binary to groups of 5 text, the way mime translates 3 binary to 4 text. Base85 is different than Mime in that there are 5-character strings which are invalid, because unlike Mime where 64^4 == 256^3, In Base85, 85^5 != 256^4.

Top