@head @module encodings @title encodings: Character encodings
This module contains constants and types for dealing with text data in different character encodings. Its main use is converting strings between 16-bit Unicode and 8-bit (narrow) representations. Conversions between 8-bit narrow strings and 16-bit Unicode strings are often needed, since many standard library functions expect strings to be encoded in 16-bit Unicode, whereas many stream classes (including @ref{File} and @ref{Socket}) and some functions only support 8-bit strings.
In addition to 8-bit Unicode encodings, this module also defines several locale-specific encodings that only support a subset of Unicode.
Examples: @example "\u00c3\u00a4".decode(Utf8) -- Decode "ä" in UTF-8 to 16-bit Unicode "\u20ac".encode(Utf8) -- Encode the Euro sign using UTF-8 "sää".encode(Ascii, Unstrict) -- "s??" (replace characters that cannot be represented -- in ASCII with question marks) TextStream("file.txt", Latin1) -- Open encoded text file for reading @end @see Use methods @ref{Str.encode} and @ref{Str.decode} for encoding and decoding strings. @end @see The classes @ref{io::TextFile} and @ref{io::TextStream} provide a simple way of accessing encoded text streams. @end
This module defines the following character encoding objects. They all implement the interface @href{Encoding}. @var Ascii as Encoding @desc The 7-bit ASCII encoding. @end @var Utf8 as Encoding @desc The UTF-8 Unicode encoding. @end @var Uft16 as Encoding @var Utf16Le as Encoding @var Utf16Be as Encoding @desc The UTF-16 Unicode encoding. The different variants stand for native, little endian and big endian byte orders. @end @var Iso8859_1 as Encoding (Latin1) @var Iso8859_2 as Encoding (Latin2) @var Iso8859_3 as Encoding (Latin3) @var Iso8859_4 as Encoding (Latin4) @var Iso8859_5 as Encoding @var Iso8859_6 as Encoding @var Iso8859_7 as Encoding @var Iso8859_8 as Encoding @var Iso8859_9 as Encoding (Latin5) @var Iso8859_10 as Encoding (Latin6) @var Iso8859_11 as Encoding @var Iso8859_13 as Encoding (Latin7) @var Iso8859_14 as Encoding (Latin8) @var Iso8859_15 as Encoding (Latin9) @var Iso8859_16 as Encoding (Latin10) @desc The ISO 8859 8-bit encodings, with alias constants in parentheses. @end @var Windows1250 as Encoding @var Windows1251 as Encoding @var Windows1252 as Encoding @desc These Windows encodings are also known as Windows code pages. @end @var Cp437 as Encoding @var Cp850 as Encoding @desc Legacy encodings that match MS-DOS code pages. Note that encoded characters in the range 0 to 31 and character 127 are decoded as the corresponding ASCII characters instead of the legacy graphical characters. @end @var Koi8R as Encoding @desc The KOI8-R encoding for Russian. @end @var Koi8U as Encoding @desc The KOI8-U encoding for Ukrainian. @end @class-hidden Encoding @h2 Interface Encoding
Character encoding objects support creating encoder and decoder objects using the methods encoder and decoder, respectively. These methods can be called without arguments or with an optional mode argument (@ref{Strict} or @ref{Unstrict}). If mode is not specified, it defaults to Strict. Each encoder / decoder instance keeps track of the state of a single encoded / decoded text sequence.
Programs typically do not use Encoder and Decoder objects directly, but they use @ref{Str.encode} and @ref{Str.decode} methods and text streams. @fun encoder([mode as Constant]) as Encoder @desc Construct an encoder object for the encoding. @end @fun decoder([mode as Constant]) as Decoder @desc Construct a decoder object for the encoding. @end @var name as Str @desc The name of the encoding. Example: @example Utf8.name -- "Utf8" @end @end @class-hidden Encoder @h2 Interface Encoder @fun encode(str as Str) as Str @desc Encode the argument string and return the encoded string. The entire string is always encoded. @end @class-hidden Decoder @h2 Interface Decoder @fun decode(str as Str) as Str @desc Decode as many characters as possible from the argument string and return them. If any partial characters remain at the end of the string, remember them and prepend them to the next argument to decode. Use unprocessed to have a peek at them. @end @fun unprocessed() as Str @desc Return the current buffer of partial characters, or an empty string if there are none. @end
This module supports only a small and somewhat arbitrary set of locale-specific encodings, with a bias towards encodings for European languages. New encodings are likely to be added to this module (or to separate, additional modules) in future Alore releases.