Evan Harmon - Memex

Binary-to-text Encoding

A Binary-to-text Encoding is encoding of data in plain text. More precisely, it is an encoding of binary data in a sequence of printable characters. These encodings are necessary for transmission of data when the communication channel does not allow binary data or is not 8-bit clean. PGP documentation uses the term "ASCII armor" for binary-to-text encoding when referring to Base64.
wikipedia:: Binary-to-text encoding

Unlike the natural translation between binary and decimal numbers, there is no natural translation between integers and characters. For example, you might create a pairing of 0 to a, 1 to b, and so on. But what integer should be paired with $ or a tab? Since there is no natural way to translate between characters and integers, computer scientists have had to make such translations up. Such translations are called character encodings.

Symbols, tokens, tokenizing - e.g.
compiling
parsing - e.g. to render or display
converting

Glyph

Code point

  • A code point of a coded character set is any allowed value in the character set.

Ascii85

  • main modern uses are in Adobe's PostScript and Portable Document Format file formats, as well as in the patch encoding for binary files used by Git.
  • https://en.wikipedia.org/wiki/Ascii85

Percent-encoding (URL encoding)

Character Repertoire

  • the abstract set of characters
  • essentially every generally known written symbol/glyph/letter/number etc. in any language, mathematics, musical system, etc.

Base36

Base32

Unicode

  • Encoded with hexadecimal
  • There is no such thing as plain text
    • It makes no sense to have a string of characters without knowing what encoding it uses.
  • E.g.:
    • Latin 'A' = U+0041
    • Ampersand = U+0026
  • Ways of declaring what encoding a string is
    • email
      • a string in the header of the form Content-Type: text/plain; charset="UTF-8"
    • HTML
      • how can you read the HTML file until you know what encoding it’s in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:
  • UTF-8
    • way of storing unicode in one byte with almost identical code points as ASCII, but only really works for America/English due to not needing accents, etc.
  • UTF-16
  • UTF-32
  • A letter maps to a certain code point (the U+09 thing)
  • Common myth: you can only store about 65000 characters. False

Character

  • a minimal unit of text that has semantic value.

Code unit

  • a bit sequence used to encode each character of a repertoire within a given encoding form.

Coded character set

  • a character set in which each character corresponds to a unique number.

ASCII

  • ASCII provides a standard translation of the most commonly-used characters to one of the integers 0...127, which means each character can be stored in a computer using a single byte.
  • https://www.ascii-code.com/
  • 7 bits
  • Numbers 32-127
    • space = 32
    • A = 65
  • "American Standard Code for Information Interchange"
  • ASCII maps a to 97, b to 98, and so on for lowercase letters, with z mapping to 122. Uppercase letters map to the values 65 through 90. The other integers between 0 and 127 represent symbols, punctuation, and other assorted odd characters. This scheme is called the ASCII table

Character set

  • e.g. Latin character set (alphabet) is used by English plus many European languages
  • a collection of characters that might be used by multiple languages.

MIME

  • https://en.wikipedia.org/wiki/MIME
  • https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types
  • MIME Types, used to specify the content types in the Accept field, consist of a type and a subtype. They are separated by a slash (/). For example, a text file containing HTML would be specified with the type text/html. If this text file contained CSS instead, it would be specified as text/css. A generic text file would be denoted as text/plain. This default value, text/plain, is not a catch-all, however. If a client is expecting text/css and receives text/plain, it will not be able to recognize the content. Other types and commonly used subtypes: image — image/png, image/jpeg, image/gif audio — audio/wav, image/mpeg video — video/mp4, video/ogg application — application/json, application/pdf, application/xml, application/octet-stream For example, a client accessing a resource with id 23 in an articles resource on a server might send a GET request like this: GET /articles/23 Accept: text/html, application/xhtml The Accept header field in this case is saying that the client will accept the content in text/html or application/xhtml.

Morse Code

Hexadecimal (base16)

Binary-to-text Encoding
Interactive graph
On this page
Binary-to-text Encoding
Glyph
Code point
Ascii85
Percent-encoding (URL encoding)
Character Repertoire
Base36
Base32
Unicode
Base64
Character
Code unit
Coded character set
ASCII
Character set
MIME
Morse Code
Hexadecimal (base16)