Z85 Nth chunks and tail padding

Print Thread

#271502 03/04/23 04:52 PM

Joined: Jan 2004

Posts: 2,127

maroon

Hoopy frood

maroon

Hoopy frood

Joined: Jan 2004

Posts: 2,127

Observations on the Z85 encoder in the new beta

#1. The N switch isn't working right with it
#2. The padded string is un-necessarily long.

#1: The N switch

As I understand it, the N switch allows retrieving the encoded portion of really long strings that are otherwise not able to fit within $maxlenl, and that it should be possible to decode each chunk individually without crossing any byte boundaries. For example with UUencode it has a length character relative only to that line's content.

However for 'v' padding it is chopping the encoded string into 60-digit chunks, which normally would be fine because the 4-as-5 encoding makes any multiple of 5 be a good chunk size. But the leading padding digit now means that the N=1 chunk has a padding byte telling you about the last row, and then every row except the 1st begins with the 2nd byte of a group-of-5, and that 1st byte is an orphan at the tail end of the prior row. Assuming the goal of this padding scheme is to ensure that no special character can being a Z85 string, that's not guaranteed to be true for these N chunks.

//var %i 1 | while (%i isnum 1-3) { echo -a $encode($str(S,123),v,%i) | inc %i }

3q=uQ[q=uQ[q=uQ[q=uQ[q=uQ[q=uQ[q=uQ[q=uQ[q=uQ[q=uQ[q=uQ[q=uQ
[q=uQ[q=uQ[q=uQ[q=uQ[q=uQ[q=uQ[q=uQ[q=uQ[q=uQ[q=uQ[q=uQ[q=uQ
[q=uQ[q=uQ[q=uQ[q=uQ[q=uQ[q=uQ[q=uP{

I'm guessing that the solution according to the current design would be that each not-the-last Nth chunk should have a length=61 that begins with '4' and is the encoding of 60*4/5=48 bytes, and only the final Nth row would hold info about a short packet. Depending on whether the length-byte and up to 3 bytes of tail padding are both needed.

* #2 Excessive padding

I haven't seen this kind of padding used for any base85 code, though I see that it achieves the ability to prevent any Z85 string from being evaluated, in spite of containing = $ % , # etc.

Even though the attractiveness of base85 is that it encodes 4-as-5 instead of the 3-as-4 with mime - under this padding scheme which keeps the full group of 5 at the tail end then adds the padding at the front, the Z85 string needs to be quite long before you're guaranteed that 'v' will always be a shorter string, compared to a mime string excluding the '=' padding that are not needed in order to decode accurately. This command shows that this Z85 scheme is not shorter than mime until length 16, and Z85 is longer for length 45, and matches the mime length for strings as long as 57.

//var %i 1 | while (%i isnum 1-58) { var %a $len($encode($str(a,%I),v)) , %b $len($remove($encode($str(a,%I),m),=)) | echo $iif(%a < %b,3,4) %i : %a : %b | inc %I }

--

However, the leading padding character is actually redundant for determining the correct decoded length, similar to the way that the '=' padding is not needed to determine the decode length for mime. ASCII85 was the original base85 method using a different encoding alphabet, and it did not use any padding at all. Because when the remainder of (input length mod 4) is not zero, it's only needed to have (1+remainder) of the final 5 encoding characters to enable the decoder to accurately decode the string to the correct length. The person decoding can identify from the length how many bytes should be decoded by knowing the length of any partial chunk at the tail end.

The way ASCII85 does it without padding and without losing information in the final chunk is to...

encode:
Remembering how many extra 1-3 bytes are needed to complete the final 4-byte uint32, then append them as 0x00 bytes. Then encode as usual. Then, however many of 1-3 0x00 bytes were added, you remove that same number of encoding characters from the 5 encoding chars just created.

decode:
if the final group of 5 encoding characters has only 1-4 characters, then each of those can decode into 1 fewer byte, so 0-3. If it had 1 char then you can skip it and you're done, for the same reason the last of a length 5 mime string can be discarded.

If you need to append 1-3 digits to complete the final quintuplet of Z85 symbols, you would append the FINAL character in the encoding alphabet in order to make the length be the normal 5, which for the Z85 alphabet would be 1-3 of the '#' character. You then decode it like normal into a group of 4 bytes. Then, however many of the # chars had been appended, that's how many of the decoded 4 bytes are discarded.

The online decoder at https://cryptii.com/pipes/z85-encoder appears to be doing it the way I described here, and you can see that they encode 1 byte as 2, 2 as 3, and 3 as 4

--

I've tweaked the original base85 code I'd posted previously, to use the Z85 alphabet instead of one that had eliminated $ % , etc. It had the design I described above, and the command below tests random numbers of lengths 1-3, and it always is able to decode accurately without a length-byte or any padding at all

//var %i 9999 | while (%i) { var %rand $rand(0,4294967295) | bset -c &v 1 $gettok($replace($longip(%rand),.,$chr(32)),1- $+ $r(1,3),32) | bcopy -c &v2 1 &v 1 4 | noop $z85encode(&v,1) $z85decode(&v,1) | if ($bvar(&v,1-) != $bvar(&v2,1-)) echo -a fail %rand $v1 vs $v2 | dec %I }

Code

/*
{
  z85 encoder and decoder by maroon

  Syntax: $z85encode(any string,N) $z85decode(z85text,N)
  N: If $2 is 1: $1 is name of &binvar, otherwise $1 is text

  examples:
  //var -s %a $z85encode(maroon) , %b $z85decode(%a)
  //bset -tc &v 1 maroon | noop $z85encode(&v,1) | echo -a $bvar(&v,1-).text

  85 is used as the base because it's the smallest integer where N^5 >= 256^4

  The benefit of base85 over mime is that it can encode 4 bytes into 5 text, for a 125% output length,
  compared to the 133% length from mime encoding 3 bytes as 4 text. When encrypting a channel message
  to an encrypted string of length 280 bytes, the base85 encoding would be 350 bytes - while the mime
  encoding would be 374 bytes plus often being padded with an additional 2 '=' characters.

  Base85 string should not need padding, as the plaintext length can be calculated from the encoded length
*/

alias z85_alphabet return 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.-:+=^!/*?&<>()[]{}@%$#
alias z85decode_lookup {
  ; '0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
  return $&
    99    99 99 99 99 99 99 99 99 99 99 99 99 99 99 $&
    99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 $&
    99 68 99 84 83 82 72 99 75 76 70 65 99 63 62 69 $&
    0   1  2  3  4  5  6  7  8  9 64 99 73 66 74 71 $&
    81 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 $&
    51 52 53 54 55 56 57 58 59 60 61 77 99 78 67 99 $&
    99 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 $&
    25 26 27 28 29 30 31 32 33 34 35 79 99 80 99 99 $&
    99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 $&
    99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 $&
    99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 $&
    99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 $&
    99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 $&
    99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 $&
    99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 $&
    99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 $&
  }

alias z85encode {
  if ($2 == 1) {
    if (!$bvar($1,0)) { var %err invalid binvar $1 | goto z85_encode_error }
    bcopy -c &maroon.z85.in 1 $1 1 -1
  }
  else noop $regsubex(foo,$1,,,&maroon.z85.in)

  var %in.ptr 1 , %chop 0 , %len $bvar(&maroon.z85.in,0) , %z85.output
  if (!%len) { var %err zero length input string | goto z85_encode_error }

  while (%in.ptr <= %len) {
    var %remain_less_1 $calc(%len - %in.ptr)

    ; If final input chunk's length wasn't 4, append missing bytes as 0x00's, but keep track
    ; so the same number of chars can be stripped from text output
    ; this appends 3 0's to ensure .nlong doesn't fail, but sets %chop to the number of
    ; 0's needed to make the final input chunk be length 5
    if (%remain_less_1 isnum 0-2) { var %chop 3 - $v1 | bset &maroon.z85.in $calc(1+%len) 0 0 0 }

    ; z85 translates 4 bytes into 5 text, so it's efficient to handle as if it's translating
    ; the 4-byte input chunk as if it's a big-endian unsigned 32-bit integer
    var %uint32 $bvar(&maroon.z85.in,%in.ptr).nlong , %divisor 85 ^ 4

    var %j 0 | while (%j < 5) {
      inc %j
      ; by repeatedly floor-dividing by a shrinking divisor, has the effect of outputting the
      ; encoding digits in the expected big-endian order

      var %pos $calc(%uint32 // %divisor) , %uint32 $calc(%uint32 - %divisor * %pos) , %divisor %divisor / 85
      var -p %z85.output %z85.output $+ $mid($z85_alphabet,$calc(1+%pos),1)
    }
    :next_group_of_4
    inc %in.ptr 4
  }
  ; if added 1-3 0x00 bytes to complete a final chunk of 4, chop that many chars from text output
  ; because encoding length for 1-4 byte chunks is always 1+partial_input_chunk_length
  if (%chop) var %z85.output $left(%z85.output,- $+ %chop)

  ; If N=1, replace original binvar's contents and change return value to new binvar-length
  if ($2 == 1) { bset -tc $1 1 %z85.output | var %z85.output $bvar($1,0) }
  return %z85.output
  :z85_encode_error
  echo -sc info *z85encode: %err
  if (($2 == 1) && (&* iswm $1)) bunset $1
  halt
}

alias z85decode {
  if ($2 == 1) {
    if (!$bvar($1,0)) { var %err invalid binvar $1 | goto z85_decode_error }
    ; unset binvar and will re-create
    else { var -p %from.string $bvar($1,1-).text | bunset $1 }
  }
  else var -p %from.string $1
  var %len $len(%from.string) | if (!%len) { var %err zero length z85 string | goto z85_decode_error }

  var %last.char $chr(35) , %in.ptr 1 , %out.ptr 1 , %chop 0
  bset -c &maroon.z85.lookup 1 $z85decode_lookup
  bunset &maroon.z85.out
  while (%in.ptr <= %len) {
    var %this_char $mid(%from.string,%in.ptr,1) , %uint32 0 , %j 0

    ; if final chunk is length 1-char, ignore because that always decodes to $null
    if (%len == %in.ptr) { dec %len | break }

    ; append 1-3 of '<last char>' padding if final group of 5 isn't complete
    ; but keep track of how many chars were added, so that same length can be removed from decoded output
    if ($calc( %len - %in.ptr) < 4) {
      inc %chop $calc(4+%in.ptr -%len)
      var %from.string %from.string $+ $str(%last.char,%chop)
    }

    while (%j < 5) {
      var %s $bvar(&maroon.z85.lookup,$asc($mid(%from.string,%in.ptr,1)))

      if (%s !isnum 0-84) { var %err invalid z85 string $mid(%from.string,%in.ptr,1) | goto z85_decode_error }

    var %uint32 $calc(%uint32 * 85 + %s ) | inc %j | inc %in.ptr }

    :next_group_of_5
    ; cheating by using $longip instead of looping 4x to parse 32-bit value into 4 big-endian byte values
    bset -c &maroon.z85.out %out.ptr $replace($longip(%uint32),.,$chr(32)) | inc %out.ptr 4
  }

  ; if final block was padded, now chop that many extra decoded bytes
  ; because output decoded length == (input chunk length less 1)
  if (%chop) bcopy -c &maroon.z85.out 1 &maroon.z85.out 1 $calc(%out.ptr -1 -%chop)
  ; I've never seen a base85 string being end-padded like mime sometimes is.
  ; Length of the decoded string is calculated from the length of the encoded string

  if ($2 == 1) {
    if ($bvar(&maroon.z85.out,0)) { bcopy -c $1 1 &maroon.z85.out 1 -1 | return $bvar($1,0) }
    ; next creates zero length &binvar for zero length output
    bset -z $1 | return 0
  }
  if ($bvar(&maroon.z85.out,0)) returnex $bvar(&maroon.z85.out,1-).text
  ; input string was exactly 1 z85 char, so return $null
  return $null
  :z85_decode_error
  echo -sc info *z85decode: %err
  halt
}

alias create_z85lookup {
  var %a $1
  if ($len(%a) != 85) { var %err input $v1 MUST be length 85: $1 | goto create85_err }
  if ($regex(foo,$1,/(.).*\1/)) { var %err input MUST not contain dupes: $regml(foo,1) $1 | goto create85_err }
  if ($regex(foo,$1,[^!-~])) { var %err input MUST not contain outside range !-~: $regml(foo,1) $1 | goto create85_err }
  bset -c &maroon.tmp 1 $str(99 $chr(32),255)
  var %i 85 | while (%i) { var %a $asc($mid($1,%i,1)) | dec %i | bset &maroon.tmp %a %i }
  var %list , %i 33 | while (%i isnum 33-126) { if ($chr(%i) !isincs $1) var %list %list $+ $v1 | inc %i } | echo -a unchosen chars: %list
  echo -a To use this 85char alphabet as the new alphabet, (1) paste the following 2 aliases above the existing 2 same name aliases:
  echo 3 -a alias z85_alphabet return $1
  echo 7 -a alias z85decode_lookup return $bvar(&maroon.tmp,1-)
  echo -a (2) find the % $+ last.char variable being defined and change it to use the last char of the new alphabet - if it's different than the last char of the old one
  return
  :create85_err
  echo -sc info *create_z85lookup: %err
  echo -sc info syntax: input = string of non-duplicated 85 characters in range ! through ~
}

Entire Thread
Subject	Posted By	Posted
Z85 Nth chunks and tail padding	maroon	03/04/23 04:52 PM
Re: Z85 Nth chunks and tail padding	Khaled	06/04/23 06:35 AM

Link Copied to Clipboard

Forums Bug Reports Z85 Nth chunks and tail padding