Skip to content

Latest commit

 

History

History
144 lines (125 loc) · 5.54 KB

README.md

File metadata and controls

144 lines (125 loc) · 5.54 KB

lex85

A base85 encoding with stable lexicographic sorting to the raw bytes

lexicographic base85 alphabet

#$%&()*+-0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_abcdefghijklmnopqrstuvwxyz{|}

We want to put bytes into ascii strings with more density then b64.

4/3 blowup for b64 but only 5/4 blowup for lex85.

Our target environment for lex85 is CSV, TSV, JSON, and programming language literal strings. If we look at how many bytes a char if encoded in json is their are 93 chars that are encoded as a single byte.

In [0]: import json
In [1]: {i:chr(i) for i in range(256) if len(json.dumps(chr(i)))<=3}  # "c"
Out[1]:
 32: ' ',  33: '!',  35: '#',  36: '$',  37: '%',  38: '&',  39: "'",  40: '(',
 41: ')',  42: '*',  43: '+',  44: ',',  45: '-',  46: '.',  47: '/',  48: '0',
 49: '1',  50: '2',  51: '3',  52: '4',  53: '5',  54: '6',  55: '7',  56: '8',
 57: '9',  58: ':',  59: ';',  60: '<',  61: '=',  62: '>',  63: '?',  64: '@',
 65: 'A',  66: 'B',  67: 'C',  68: 'D',  69: 'E',  70: 'F',  71: 'G',  72: 'H',
 73: 'I',  74: 'J',  75: 'K',  76: 'L',  77: 'M',  78: 'N',  79: 'O',  80: 'P',
 81: 'Q',  82: 'R',  83: 'S',  84: 'T',  85: 'U',  86: 'V',  87: 'W',  88: 'X',
 89: 'Y',  90: 'Z',  91: '[',  93: ']',  94: '^',  95: '_',  96: '`',  97: 'a',
 98: 'b',  99: 'c', 100: 'd', 101: 'e', 102: 'f', 103: 'g', 104: 'h', 105: 'i',
106: 'j', 107: 'k', 108: 'l', 109: 'm', 110: 'n', 111: 'o', 112: 'p', 113: 'q',
114: 'r', 115: 's', 116: 't', 117: 'u', 118: 'v', 119: 'w', 120: 'x', 121: 'y',
122: 'z', 123: '{', 124: '|', 125: '}', 126: '~'

we need to pick 85 chars for our alphabet

  • we drop 32: ' ' since it is ambiguous with tab
  • we drop 33: '!' to reserve it as less then all lex85 chars
  • we drop 34: '"' since it is the json string container
  • we drop 39: "'" since it is a string container
  • we drop 44: ',' since it is a delimiter
  • we drop 46: '.' since it is a delimiter
  • we drop 47: '/' since it is a delimiter
  • we drop 92: '\' since it is the json string escaping
  • we drop 96: '`' since it is a string container
  • we drop 126: '~' to reserve it as greater than all lex85 chars

giving us the alphabet "#$%&()*+-0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_abcdefghijklmnopqrstuvwxyz{|}"

Note:

  • Sort Order

    The alphabet is in lexicographic sort order this is important since it allows for the encoded strings and decoded bytes to sort in the same order.

  • Fencepost

    '!' and '~' are not in the lex85 alphabet but are in printable ascii so fencepost can be constructed

    "aa!" will sort before "aa#" and after "aa"

    "aa~" will sort after not only "aa}" but "aa}}}}}}}}}}}}... and before "ab"

    Fenceposts allow programmers get exclusive ranges using only inclusive range queries.

encoding

 -------------------------------- 
| C1    | C2   | C3  | C4  | C5  |
|--------------------------------|
| i unsigned 32 bit integer      |
|--------------------------------|
| byte1 | byte2 | byte3 | byte4  |
 --------------------------------

i = (C1 * 85 ** 4) + (C2 * 85 ** 3) + (C3 * 85 ** 2) + (C4 * 85)+ C5
i = (b1 << 24) + (b2 << 16) + (b3 << 8) + (b4 << 0)

If the number of chars is not modulo 5 then pad with b85(84)

If the number of bytes is not modulo 4 then pad with 0x00

Then drop the length of the pad from the output

base85_alphabet = "#$%&()*+-0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_abcdefghijklmnopqrstuvwxyz{|}" 
integer2character85 = base85_alphabet
character2integer85 = {character: index for index, character in enumerate(base85_alphabet)}

def encode85(buffer):
    pad = (-(len(buffer) % -4))  # bytes to get to multiple of 4
    buffer = buffer + b'\x00'*pad
    encoded = [''] * ((len(buffer)//4)*5)
    for i in range(0, len(buffer)//4):
        integer = ((buffer[4 * i + 0] << 24)|
                   (buffer[4 * i + 1] << 16)|
                   (buffer[4 * i + 2] << 8 )|
                   (buffer[4 * i + 3] << 0 ))
        encoded[5 * i + 0] = integer2character85[integer // (85**4) % 85]
        encoded[5 * i + 1] = integer2character85[integer // (85**3) % 85]
        encoded[5 * i + 2] = integer2character85[integer // (85**2) % 85]
        encoded[5 * i + 3] = integer2character85[integer // (85**1) % 85]
        encoded[5 * i + 4] = integer2character85[integer // (85**0) % 85]
        if integer >> 32:
            return OverflowError(f"{integer} > uint32_max")
    return ''.join(encoded)[:len(encoded)-pad]

def decode85(string):
    pad = (-(len(string) % -5))  # characters to get to multiple of 5
    string = string + '}'*pad
    buffer = bytearray(len(string) // 5 * 4)
    for i in range(0, len(string) // 5):
        integer = (character2integer85[string[i * 5 + 0]] * (85 ** 4) +  
                   character2integer85[string[i * 5 + 1]] * (85 ** 3) +
                   character2integer85[string[i * 5 + 2]] * (85 ** 2) +
                   character2integer85[string[i * 5 + 3]] * (85 ** 1) +
                   character2integer85[string[i * 5 + 4]] * (85 ** 0))
        buffer[i * 4 + 0] = integer >> 24 & 0xff
        buffer[i * 4 + 1] = integer >> 16 & 0xff
        buffer[i * 4 + 2] = integer >>  8 & 0xff
        buffer[i * 4 + 3] = integer >>  0 & 0xff
    return bytes(buffer[:len(buffer)-pad])

Recommended syntax for tagging lex85

prefix l

# int literal stile
0lHU}#zJb = 448378203247
# like
0x68656c6c6f = 448378203247

# tagged string literal stile
l"HU}#zJb" = b'hello'
# like
b"\x68\x65\x6c\x6c\x6f" = b'hello'

# multibase stile
multibase.decode("lHU}#zJb")  = b'hello'
#like
multibase.decode("f68656c6c6f") = b'hello'