# Binary-to-text Encoding | | A **Binary-to-text Encoding** is encoding of data in plain text. More precisely, it is an encoding of binary data in a sequence of printable characters. These encodings are necessary for transmission of data when the communication channel does not allow binary data or is not 8-bit clean. PGP documentation uses the term "ASCII armor" for binary-to-text encoding when referring to Base64. | |-|-| | | wikipedia:: [Binary-to-text encoding](https://en.wikipedia.org/wiki/Binary-to-text_encoding) | Unlike the natural translation between binary and decimal numbers, there is no natural translation between integers and characters. For example, you might create a pairing of 0 to a, 1 to b, and so on. But what integer should be paired with $ or a tab? Since there is no natural way to translate between characters and integers, computer scientists have had to make such translations up. Such translations are called character encodings. [[Binary numeral system|Binary]] Symbols, tokens, tokenizing - e.g. compiling parsing - e.g. to render or display converting ## Glyph ## Code point - A code point of a coded character set is any allowed value in the character set. ## Ascii85 - main modern uses are in Adobe's PostScript and Portable Document Format file formats, as well as in the patch encoding for binary files used by Git. - https://en.wikipedia.org/wiki/Ascii85 ## Percent-encoding (URL encoding) - [Character encodings in HTML - Wikipedia](https://en.wikipedia.org/wiki/Character_encodings_in_HTML) - https://en.wikipedia.org/wiki/Percent-encoding - https://www.w3schools.com/tags/ref_urlencode.asp - UTF-8 - Your browser will encode input, according to the character-set used in your page. The default character-set in HTML5 is UTF-8. - e.g. for spaces and special characters - https://developer.mozilla.org/en-US/docs/Glossary/percent-encoding - URL encoding converts characters into a format that can be transmitted over the Internet. URLs can only be sent over the Internet using the ASCII character-set. Since URLs often contain characters outside the ASCII set, the URL has to be converted into a valid ASCII format. URL encoding replaces unsafe ASCII characters with a "%" followed by two hexadecimal digits. URLs cannot contain spaces. URL encoding normally replaces a space with a plus (+) sign or with %20. ## Character Repertoire - the abstract set of characters - essentially every generally known written symbol/glyph/letter/number etc. in any language, mathematics, musical system, etc. ## Base36 - https://en.wikipedia.org/wiki/Base36 ## Base32 - https://en.wikipedia.org/wiki/Base32 ## Unicode - Encoded with hexadecimal - There is no such thing as plain text - It makes no sense to have a string of characters without knowing what encoding it uses. - E.g.: - Latin 'A' = U+0041 - Ampersand = U+0026 - Ways of declaring what encoding a string is - email - a string in the header of the form Content-Type: text/plain; charset="UTF-8" - HTML - how can you read the HTML file until you know what encoding it’s in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters: - UTF-8 - way of storing unicode in one byte with almost identical code points as ASCII, but only really works for America/English due to not needing accents, etc. - UTF-16 - UTF-32 - A letter maps to a certain code point (the U+09 thing) - Common myth: you can only store about 65000 characters. False ## [[Base64]] ## Character - a minimal unit of text that has semantic value. ## Code unit - a bit sequence used to encode each character of a repertoire within a given encoding form. ## Coded character set - a character set in which each character corresponds to a unique number. ## ASCII - ASCII provides a standard translation of the most commonly-used characters to one of the integers 0...127, which means each character can be stored in a computer using a single byte. - https://www.ascii-code.com/ - 7 bits - Numbers 32-127 - space = 32 - A = 65 - "American Standard Code for Information Interchange" - ASCII maps a to 97, b to 98, and so on for lowercase letters, with z mapping to 122. Uppercase letters map to the values 65 through 90. The other integers between 0 and 127 represent symbols, punctuation, and other assorted odd characters. This scheme is called the ASCII table ## Character set - e.g. Latin character set (alphabet) is used by English plus many European languages - a collection of characters that might be used by multiple languages. ## MIME - https://en.wikipedia.org/wiki/MIME - https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types - MIME Types, used to specify the content types in the Accept field, consist of a type and a subtype. They are separated by a slash (/). For example, a text file containing HTML would be specified with the type text/html. If this text file contained CSS instead, it would be specified as text/css. A generic text file would be denoted as text/plain. This default value, text/plain, is not a catch-all, however. If a client is expecting text/css and receives text/plain, it will not be able to recognize the content. Other types and commonly used subtypes: image — image/png, image/jpeg, image/gif audio — audio/wav, image/mpeg video — video/mp4, video/ogg application — application/json, application/pdf, application/xml, application/octet-stream For example, a client accessing a resource with id 23 in an articles resource on a server might send a GET request like this: GET /articles/23 Accept: text/html, application/xhtml The Accept header field in this case is saying that the client will accept the content in text/html or application/xhtml. ## Morse Code ## Hexadecimal (base16) - https://en.wikipedia.org/wiki/Hexadecimal - aka base16 or hex