aliases:
  - Character encoding
tags:
  - Type/Tech
  - area/tech
  - stem
publish: true
version: 1
dateCreated: 2021-10-25, 22:55
dateModified: 2024-03-14, 05:11
from: 
related: 
contra: 
to:

Binary-to-text Encoding

	A Binary-to-text Encoding is encoding of data in plain text. More precisely, it is an encoding of binary data in a sequence of printable characters. These encodings are necessary for transmission of data when the communication channel does not allow binary data or is not 8-bit clean. PGP documentation uses the term "ASCII armor" for binary-to-text encoding when referring to Base64.
	wikipedia:: Binary-to-text encoding

Unlike the natural translation between binary and decimal numbers, there is no natural translation between integers and characters. For example, you might create a pairing of 0 to a, 1 to b, and so on. But what integer should be paired with $ or a tab? Since there is no natural way to translate between characters and integers, computer scientists have had to make such translations up. Such translations are called character encodings.

Binary

Symbols, tokens, tokenizing - e.g.
compiling
parsing - e.g. to render or display
converting

Glyph

Code point

A code point of a coded character set is any allowed value in the character set.

Ascii85

main modern uses are in Adobe's PostScript and Portable Document Format file formats, as well as in the patch encoding for binary files used by Git.
https://en.wikipedia.org/wiki/Ascii85

Percent-encoding (URL encoding)

Character encodings in HTML - Wikipedia
https://en.wikipedia.org/wiki/Percent-encoding
https://www.w3schools.com/tags/ref_urlencode.asp
UTF-8
- Your browser will encode input, according to the character-set used in your page. The default character-set in HTML5 is UTF-8.
e.g. for spaces and special characters
https://developer.mozilla.org/en-US/docs/Glossary/percent-encoding
URL encoding converts characters into a format that can be transmitted over the Internet. URLs can only be sent over the Internet using the ASCII character-set. Since URLs often contain characters outside the ASCII set, the URL has to be converted into a valid ASCII format. URL encoding replaces unsafe ASCII characters with a "%" followed by two hexadecimal digits. URLs cannot contain spaces. URL encoding normally replaces a space with a plus (+) sign or with %20.

Character Repertoire

the abstract set of characters
essentially every generally known written symbol/glyph/letter/number etc. in any language, mathematics, musical system, etc.

Base36

https://en.wikipedia.org/wiki/Base36

Base32

https://en.wikipedia.org/wiki/Base32

Unicode

Encoded with hexadecimal
There is no such thing as plain text
- It makes no sense to have a string of characters without knowing what encoding it uses.
E.g.:
- Latin 'A' = U+0041
- Ampersand = U+0026
Ways of declaring what encoding a string is
- email
  - a string in the header of the form Content-Type: text/plain; charset="UTF-8"
- HTML
  - how can you read the HTML file until you know what encoding it’s in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:
UTF-8
- way of storing unicode in one byte with almost identical code points as ASCII, but only really works for America/English due to not needing accents, etc.
UTF-16
UTF-32
A letter maps to a certain code point (the U+09 thing)
Common myth: you can only store about 65000 characters. False

Base64

Character

a minimal unit of text that has semantic value.

Code unit

a bit sequence used to encode each character of a repertoire within a given encoding form.

Coded character set

a character set in which each character corresponds to a unique number.

ASCII

ASCII provides a standard translation of the most commonly-used characters to one of the integers 0...127, which means each character can be stored in a computer using a single byte.
https://www.ascii-code.com/
7 bits
Numbers 32-127
- space = 32
- A = 65
"American Standard Code for Information Interchange"
ASCII maps a to 97, b to 98, and so on for lowercase letters, with z mapping to 122. Uppercase letters map to the values 65 through 90. The other integers between 0 and 127 represent symbols, punctuation, and other assorted odd characters. This scheme is called the ASCII table

Character set

e.g. Latin character set (alphabet) is used by English plus many European languages
a collection of characters that might be used by multiple languages.

MIME

https://en.wikipedia.org/wiki/MIME
https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types
MIME Types, used to specify the content types in the Accept field, consist of a type and a subtype. They are separated by a slash (/). For example, a text file containing HTML would be specified with the type text/html. If this text file contained CSS instead, it would be specified as text/css. A generic text file would be denoted as text/plain. This default value, text/plain, is not a catch-all, however. If a client is expecting text/css and receives text/plain, it will not be able to recognize the content. Other types and commonly used subtypes: image — image/png, image/jpeg, image/gif audio — audio/wav, image/mpeg video — video/mp4, video/ogg application — application/json, application/pdf, application/xml, application/octet-stream For example, a client accessing a resource with id 23 in an articles resource on a server might send a GET request like this: GET /articles/23 Accept: text/html, application/xhtml The Accept header field in this case is saying that the client will accept the content in text/html or application/xhtml.

Morse Code

Hexadecimal (base16)

https://en.wikipedia.org/wiki/Hexadecimal
aka base16 or hex

Binary-to-text Encoding

Interactive graph

On this page

Binary-to-text Encoding

Glyph

Code point

Ascii85

Percent-encoding (URL encoding)

Character Repertoire

Base36

Base32

Unicode

Base64

Character

Code unit

Coded character set

ASCII

Character set

MIME

Morse Code

Hexadecimal (base16)