Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1 | CONCEPT |
| 2 | unicode |
| 3 | |
| 4 | DESCRIPTION |
| 5 | LPC strings come in two flavors: As byte sequences and as unicode |
| 6 | strings. For both types almost the full range of string operations |
| 7 | is available, but the types are not to be mixed. So for example |
| 8 | you cannot add a byte sequence to an unicode string or vice versa. |
| 9 | |
| 10 | Byte sequences can store only bytes (values from 0 to 255), |
| 11 | but unicode strings can store the full unicode character set |
| 12 | (values from 0 to 1114111). |
| 13 | |
| 14 | There are two conversion functions to convert between byte sequences |
| 15 | and unicode strings: to_text() which will return a unicode string, |
| 16 | and to_bytes() which returns a byte sequence. Both take either |
| 17 | a string or an array, and when converting between bytes and unicode |
| 18 | also the name of the encoding (to be) used for the byte sequence. |
| 19 | |
| 20 | -- File handling -- |
| 21 | |
| 22 | When a file is accessed either by compiling, read_file(), write_file() |
| 23 | (not read_bytes() or write_bytes(), or when an explicit encoding was |
| 24 | given), the master is asked via the driver hook H_FILE_ENCODING for |
| 25 | the encoding of the file. If none is given, 7 bit ASCII is assumed. |
| 26 | Whenever codes are encounted that are not valid in the given encoding |
| 27 | a compile or runtime error will be raised. |
| 28 | |
| 29 | -- File names -- |
| 30 | |
| 31 | The filesystem encoding can be set with a call to |
| 32 | configure_driver(DC_FILESYSTEM_ENCODING, <encoding>). The default |
| 33 | encoding is derived from the LC_CTYPE environment setting. |
| 34 | If there is no environment setting (or it is set to the default |
| 35 | "C" locale), then UTF-8 is used. |
| 36 | |
| 37 | -- Interactives -- |
| 38 | |
| 39 | Each interactive has its own encoding. It can be set with |
| 40 | configure_interactive(IC_ENCODING, <encoding>). The default is |
| 41 | "ISO-8859-1//TRANSLIT" which maps each incoming byte to the |
| 42 | first 256 unicode characters and uses transliteration to encode |
| 43 | characters that are not in this character set. If an input or |
| 44 | output character can not be converted to/from the configured |
| 45 | encoding it will be silently discarded. |
| 46 | |
| 47 | -- ERQ / UDP -- |
| 48 | |
| 49 | Only byte sequences can be sent to the ERQ or via UDP, |
| 50 | and only byte sequences can be received from them. |
| 51 | |
| 52 | HISTORY |
| 53 | Introduced in LDMud 3.6. |
| 54 | |
| 55 | SEE ALSO |
| 56 | to_text(E), to_bytes(E), configure_driver(E) |