encodings: Character encodings

This module contains functions and classes for dealing with text data in different character encodings. Its main use is converting strings between 16-bit Unicode and 8-bit (narrow) representations. Conversions between 8-bit narrow strings and 16-bit Unicode strings are often needed, since many standard library functions expect strings to be encoded in 16-bit Unicode, whereas many stream classes (including File and Socket) and some functions only support 8-bit strings.

In addition to 8-bit Unicode encodings, this module also defines several locale-specific encodings that only support a subset of Unicode.

See also: The classes io::TextFile and io::TextStream can be used to simplify accessing encoded text streams.

Functions

Decode(string, encoding[, strictness])
Decode a string. The strictness argument may be Strict (this is the default if the argument is omitted) or Unstrict. Example:
Decode(s, Utf8)         -- Decode UTF-8 string to 16-bit Unicode
Encode(string, encoding[, strictness])
Encode a string. Identical to encoding([strictness]).encode(s). The strictness argument may be Strict (this is the default if the argument is omitted) or Unstrict. Example:
Encode("\u20ac", Utf8)  -- Encode the Euro sign in UTF-8

Constants

Strict
Option for character set classes indicating strict encoding and decoding. Invalid input causes an EncodeError or a DecodeError exception to be raised. This is the default behavior.
Unstrict
Option for character set classes indicating unstrict encoding and decoding. Invalid characters are replaced with question marks ("?", when encoding) or with replacement characters ("\ufffd", when decoding).
Bom
The byte order mark character ("\ufeff").

Character encodings

The following character encoding classes are defined. They all conform to the Character encoding interface.

Ascii
The 7-bit ASCII encoding.
Utf8
The UTF-8 Unicode encoding.
Uft16
Utf16Le
Utf16Be
The UTF-16 Unicode encoding. The different variants stand for native, little endian and big endian byte orders.
Iso8859_1 (Latin1)
Iso8859_2 (Latin2)
Iso8859_3 (Latin3)
Iso8859_4 (Latin4)
Iso8859_5
Iso8859_6
Iso8859_7
Iso8859_8
Iso8859_9 (Latin5)
Iso8859_10 (Latin6)
Iso8859_11
Iso8859_13 (Latin7)
Iso8859_14 (Latin8)
Iso8859_15 (Latin9)
Iso8859_16 (Latin10)
The ISO 8859 8-bit encodings, with alias constants in parentheses.
Windows1250
Windows1251
Windows1252
These Windows encodings are also known as Windows code pages.
Cp437
Cp850
Legacy encodings that match MS-DOS code pages. Note that encoded characters in the range 0 to 31 and character 127 are decoded as the corresponding ASCII characters instead of the legacy graphical characters.
Koi8R
The KOI8-R encoding for Russian.
Koi8U
The KOI8-U encoding for Ukrainian.

Character encoding interface

The character encoding classes can be instantiated without arguments or with an optional strictness argument (Strict or Unstrict). Each of their instances keeps track of the state of a single encoded / decoded text stream.

encode(s)
Encode the argument string and return the encoded string. The entire string is always encoded.
decode(s)
Decode as many characters as possible from the argument string and return them. If any partial characters remain at the end of the string, remember them and prepend them to the next argument to decode. Use unprocessed to have a peek at them.
unprocessed()
Return the current buffer of partial characters or an empty string if there are none.

Exceptions

class EncodeError
Raised when encoding is not successful due to invalid input. Inherits from std::ValueError.
class DecodeError
Raised when decoding is not successful due to invalid input. Inherits from std::ValueError.

About supported character encodings

This module supports only a small and somewhat arbitrary set of locale-specific encodings, with a bias towards encodings for Western European languages. New encodings are likely to be added to this module (or to separate, additional modules) in future Alore releases.