Lexical structure

Tokens

Each source file is divided into tokens, starting from the beginning of the file.

Identifiers (id) are case sensitive. Some of them are reserved; see section Reserved words below.

id :: alpha alphanum+
alpha :: "a".."z" | "A".."Z" | "_"
alphanum :: alpha | digit
digit :: "0".."9"

Numeric literals (int and float) are entered in base 10. Floating point literals can optionally have a fractional part, separated with a dot, and an integer exponent, separated with the letter e. If the exponent is present, the numeric value before the exponent is multiplied by 10**e, where e is the numeric value of the exponent.

int :: digit+
float :: digit+ exponent | digit* "." digit+ [ exponent ]
exponent :: ("e" | "E") ["+" | "-"] digit+

String literals (str) are entered within single or double quotes. The surrounding quotes are not part of the string value. Literal double quotes in double-quoted strings and literal single quotes in single-quoted strings must be duplicated.

str :: <"> (<any character except ", CR or LF> | <"> <">)* <">
     | <'> (<any character except ', CR or LF> | <'> <'>)* <'>

A sequence of form \uHHHH within a string literal, where each H is a hexadecimal digit (0..9, a..f or A..F), is mapped to the character code represented by the hexadecimal number. Backslash characters are not special within string literals unless immediately followed by "u" and a 4-digit hexadecimal number.

Various non-alphanumeric operator and punctuator tokens are defined:

opsym :: "+" | "-" | "*" | "/" | "**" | ":" | "==" | "!=" | "<" | ">" | ">=" | "<="
punct :: "(" | ")" | "[" | "]" | "," | "=" | "+=" | "-=" | "*=" | "/=" | "**=" | "::"

Newlines and semicolons can be used as statement separators (br). They are interchangeable. Repeated statement separators behave identically to a single statement separator.

br :: (newline | ";")+
newline :: <CR> <LF> | <LF> | <CR>

Whitespace and comments are ignored before and after tokens. Whitespace characters are optional, except between a token ending with an alphanumeric character and another token starting with an alphanumeric character, in which case they are required. Finally, there must be no whitespace characters before the initial-comment and utf8-bom tokens.

whitespace :: " " | <TAB>
comment :: "--" <any character except CR or LF>*

An initial source line starting with #! is interpreted as a comment:

initial-comment :: "#!" <any character except CR or LF>*

The special utf8-bom token may be present at the start of UTF-8 encoded files:

utf8-bom :: <EF> <BB> <BF>

Joining lines

Newlines after the following tokens are interpreted as whitespace, not as statement separators:

+ - * / ** div mod and or : to is == != < <= >= ( [ = += -= *= /= **= ,

This can be used to divide long lines into multiple shorter lines.

A br token after a > token is ignored in expressions, but not in other contexts. This rule allows a type annotation to end with a '>' token, even when followed by a br token.

Reserved words

The following words are reserved and cannot be used as identifiers (i.e. as names of global or local definitions, as member names or as module name components):

and
as
bind
break
case
class
const
def
div
dynamic
elif
else
encoding
end
except
finally
for
if
implements
import
in
interface
is
mod
module
nil
not
or
private
raise
repeat
return
self
super
switch
to
try
until
var
while

Restricted names

Module name components and names of global definitions starting with two underscores (__) are reserved for internal use by the implementation. The implementation may freely define such names for any purpose, but user programs should not depend on their presence or absence to remain portable with different Alore implementations.

Additionally, it is recommended that the following names not be used as the first component of a module name, since they are reserved for use in future releases of Alore:

alore
argparse
compiler
crypt
csv
email
fileutil
ftp
getpass
httpserver
json
locale
logging
process
queue
readline
serialize
smtp
sqlite
ssl
stack
subprocess
tempfile
timezone
traceback
udp
unicode
url
xml
xmlrpc
xmltree

It is likely that some of these names will never be used in any future Alore release. Future Alore releases may remove some names from this list; these changes are retroactively applied to all earlier Alore versions as well.

Source file encodings

Alore source files may be encoded in ASCII, UTF-8 or ISO-8859-1 (Latin 1). See section Encoding declaration for information on specifying the encoding of a source file.

All 7-bit character codes except CR and LF (10 and 13, respectively) can be used in comments and string literals, including null characters, independent of the source file encoding. Quotes, however, may have to be doubled within string literals.

In an ISO-8859-1 encoded source file, all character codes in range from 128 to 255, inclusive, can be used in comments and string literals. Similarly in a UTF-8 encoded source file, all valid UTF-8 sequences for code points between 128 and 65535, inclusive, can be used in comments and string literals. Any character code between 0 and 65535 can be entered in a string literal using the \uHHHH form, independent of the source file encoding.