encodings: Character encodings

This module contains constants and types for dealing with text data in different character encodings. Its main use is converting strings between 16-bit Unicode and 8-bit (narrow) representations. Conversions between 8-bit narrow strings and 16-bit Unicode strings are often needed, since many standard library functions expect strings to be encoded in 16-bit Unicode, whereas many stream classes (including File and Socket) and some functions only support 8-bit strings.

In addition to 8-bit Unicode encodings, this module also defines several locale-specific encodings that only support a subset of Unicode.

Examples:

"\u00c3\u00a4".decode(Utf8)             -- Decode "ä" in UTF-8 to 16-bit Unicode

"\u20ac".encode(Utf8)                   -- Encode the Euro sign using UTF-8

"sää".encode(Ascii, Unstrict)           -- "s??" (replace characters that cannot be represented
                                        -- in ASCII with question marks)

TextStream("file.txt", Latin1)          -- Open encoded text file for reading

See also: Use methods Str encode and Str decode for encoding and decoding strings.

See also: The classes io::TextFile and io::TextStream provide a simple way of accessing encoded text streams.

Constants

Strict as Constant
Mode option for encoding objects that indicates strict encoding or decoding. Invalid input causes an EncodeError or a DecodeError exception to be raised. This is the default behavior.
Unstrict as Constant
Mode option for encoding objects that indicates unstrict encoding and decoding. Invalid characters are replaced with question marks ("?", when encoding) or with replacement characters ("\ufffd", when decoding).
Bom as Str
The byte order mark character ("\ufeff"). Some platforms (most notably, Windows) often insert this character to the beginning of text files when using a Unicode encoding such as UTF-8.

Character encodings

This module defines the following character encoding objects. They all implement the interface Encoding.

Ascii as Encoding
The 7-bit ASCII encoding.
Utf8 as Encoding
The UTF-8 Unicode encoding.
Uft16 as Encoding
Utf16Le as Encoding
Utf16Be as Encoding
The UTF-16 Unicode encoding. The different variants stand for native, little endian and big endian byte orders.
Iso8859_1 as Encoding (Latin1)
Iso8859_2 as Encoding (Latin2)
Iso8859_3 as Encoding (Latin3)
Iso8859_4 as Encoding (Latin4)
Iso8859_5 as Encoding
Iso8859_6 as Encoding
Iso8859_7 as Encoding
Iso8859_8 as Encoding
Iso8859_9 as Encoding (Latin5)
Iso8859_10 as Encoding (Latin6)
Iso8859_11 as Encoding
Iso8859_13 as Encoding (Latin7)
Iso8859_14 as Encoding (Latin8)
Iso8859_15 as Encoding (Latin9)
Iso8859_16 as Encoding (Latin10)
The ISO 8859 8-bit encodings, with alias constants in parentheses.
Windows1250 as Encoding
Windows1251 as Encoding
Windows1252 as Encoding
These Windows encodings are also known as Windows code pages.
Cp437 as Encoding
Cp850 as Encoding
Legacy encodings that match MS-DOS code pages. Note that encoded characters in the range 0 to 31 and character 127 are decoded as the corresponding ASCII characters instead of the legacy graphical characters.
Koi8R as Encoding
The KOI8-R encoding for Russian.
Koi8U as Encoding
The KOI8-U encoding for Ukrainian.

Interface Encoding

Character encoding objects support creating encoder and decoder objects using the methods encoder and decoder, respectively. These methods can be called without arguments or with an optional mode argument (Strict or Unstrict). If mode is not specified, it defaults to Strict. Each encoder / decoder instance keeps track of the state of a single encoded / decoded text sequence.

Programs typically do not use Encoder and Decoder objects directly, but they use Str encode and Str decode methods and text streams.

encoder([mode as Constant]) as Encoder
Construct an encoder object for the encoding.
decoder([mode as Constant]) as Decoder
Construct a decoder object for the encoding.
name as Str
The name of the encoding. Example:
Utf8.name   -- "Utf8"

Interface Encoder

encode(str as Str) as Str
Encode the argument string and return the encoded string. The entire string is always encoded.

Interface Decoder

decode(str as Str) as Str
Decode as many characters as possible from the argument string and return them. If any partial characters remain at the end of the string, remember them and prepend them to the next argument to decode. Use unprocessed to have a peek at them.
unprocessed() as Str
Return the current buffer of partial characters, or an empty string if there are none.

Exceptions

class EncodeError
Raised when encoding is not successful due to invalid input. Inherits from std::ValueError.
class DecodeError
Raised when decoding is not successful due to invalid input. Inherits from std::ValueError.

Functions

Decode(string as Str, encoding as Encoding[, mode as Constant]) as Str
Deprecated (this feature will be removed in a future Alore version). Decode a string. The mode argument may be Strict (this is the default if the argument is omitted) or Unstrict. Example:
Decode(s, Utf8)         -- Decode UTF-8 string to 16-bit Unicode

Note: Use the Str decode method instead.

Encode(string as Str, encoding as Encoding[, mode as Constant]) as Str
Deprecated (this feature will be removed in a future Alore version). Encode a string. Identical to encoding.encoder([mode]).encode(s). The mode argument may be Strict (this is the default if the argument is omitted) or Unstrict. Example:
Encode("\u20ac", Utf8)  -- Encode the Euro sign in UTF-8

Note: Use the Str encode method instead.

About supported character encodings

This module supports only a small and somewhat arbitrary set of locale-specific encodings, with a bias towards encodings for European languages. New encodings are likely to be added to this module (or to separate, additional modules) in future Alore releases.