Monday, October 6, 2014

Lexical Structure

An ELENA module consists of one or more source files. A source file is an ordered sequence of Unicode characters (usually encoded with the UTF-8 encoding).

There are several sequences of input elements: white space, comments and tokens. The tokens are the identifiers, keywords, literals, operators and punctuators.

The raw input stream of Unicode characters is reduced by ELENA DFA into a sequence of <input elements>.

	<input> :
			{ <input element> }*
		
	<input element> :
			<white space>
			<comment>
			<token>
			
	<token> :
			<identifier>
			<full identifier>
			<local identifier>
			<keyword>
			<literal>
			<operator-or-punctuator>

Of these basic elements, only tokens are significant in the syntactic grammar of an ELENA program.

White space

ELENA White space are a space, a horizontal tab and line terminators. They are used to separate tokens.

	<white space> :
		SP (space)
		HT (horizontal tab)
		CR (return)
		LF (new line)

Comments

ELENA uses c++-style comments:

   /* block comment */

   // end-of-line comment

	<comment> :
		<block comment>
		<end-of-line comment>
		
	<block comment> :
		'/' '*' <block comment tail>
		
	<end-of-line comment> :	
		'/' '/' { <not line terminator> }*
		
	<block comment tail> :
		'*' <block comment star tail> 
                <not star> <block comment tail>
		
	<block comment star tail> :
		'/' 
                '*' <block comment star tail> 
                <neither star nor slash> <block comment tail>
		
	<not star> :
		any Unicode character except '*'
		
	<neither star nor slash> :
		any Unicode character except '*' and '/'

	<not line terminator> :
		any symbol except LR and CF

ELENA comments do not nest. Comments do not occur inside string literals

Identifiers

An identifier is a sequence of letters, underscore and digits starting with letter or underscore. An identifier length is restricted in the current compiler design (maximal 255 characters)

	<identifier> :
		<letter> { <letter or digit> }*
		
	<letter> :
		Unicode character except white space, 
                        punctuator or operator
		'_'
		
	<letter or digit> :
		<letter>
		Digit 0-9

ELENA identifiers are case sensitive.

Full identifiers

A full identifier is a sequence of identifiers separated with "'" characters. It consists of a namespace and a proper name. A full identifier length is restricted in the current compiler design (maximal 255 characters)

	<full identifier> :
		[ <name space> ]? "'" <identifier>		
		
	<name space> :
		<identifier> [ "'" { <identifier> } ]*

Local identifiers

A local identifier is a sequence of letters, underscore and digits starting with '$' character. A local identifier length is restricted in the current compiler design (maximal 255 characters)

	<local identifier> :
		'$' <identifier>

Keywords

A keyword is a sequence of letters starting with '#' character. Currently only following keywords are used though others reserved for future use: #class, #symbol, #static, #field, #method, #constructor, #var, #loop, #define, #type, #throw, #break. Keywords can be placed only in the beginning of the statement.

	<local identifier> :
		'#' { <letter> }+
	
	<letter> :
		Unicode characters

Literals

A literal is the source code representation of a value.

	<literal> :
		<integer>
		<float>
		<string>

Integer literals

An integer literal may be expressed in decimal (base 10) or hexadecimal(16).

	<integer> :
		<decimal integer>
		<hexadecimal integer>
		
	<decimal integer> :
		[ <sign> ] { <digit> }+

	<sign> :
		"+"
		"-"
		
	<digit> :
		digit 0-9
		
	<hexadecimal integer> :
		<digit> <digit or hexdigit>* 'h'
		
	<digit or hexdigit> :
		<digit>		
		one of following character - 
                       a b c d e f A B C D E F

Floating-point literals

A floating-point literal has the following parts: a whole-number part, a decimal point, and fractional part, an exponent. The exponent, if present, is indicated by the Unicide letter 'e' or 'E' followed by an optionally signed integer.

At least one digit, in either the whole number or the fraction part, and a decimal point or an exponent are required. All other parts are optional.

	<float> :
		{ <digit> }* '.' { <digit> }* [ <exponent> ] 'r'
		{ <digit> }+ <exponent> 'r'
		
	<digit> :
		digit 0-9

	<exponent> :
		<exponent sign> <integer>
		
	<exponent sign> :
		either 'E' or 'e'
		
	<integer> :
		<sign>? <digit>+
		
	<sign> :
		"+"
		"-"

Real literals are represented with 64-bit double-precision binary floating-point formats.

String literal

A string literal consists of zero or more characters enclosed in double quotes. Characters may be represented by escape sequences.

	<string> : 
		'"' <string tail> '"'
		
	<string tail> :
		<string character> { <string tail> }*
		<escape sequence>  { <string tail> }*
		'%' '%' { <string tail> }*
		'"' '"' { <string tail> }*
		
	<string character> :
		any character except CR or LF or '"'

String literal escape sequences

The string literal escape sequences allow for the representation of some non-graphic character as well as the double quote and percent character.

	<escape sequence> :
		'%' <decimal escape>
		
	<decimal escape> :
		{ <digit> }+
		<alert>
		<backspace>
		<horizontal tab>
		<carriage return>
		<new line>
		
	<digit> :
		digit 0-9

	<alert> :
		'a'

	<backspace> :
		'b'

	<horizontal tab> :
		't'

	<carriage return> :
		'r'

	<new line> :
		'n'

Operators and punctuators

There are several kinds of operators and punctuators. Operators are short-cut form of messages taking one operand. Punctuators are for grouping and separating.

	<operator-or-punctuator> : one of
		'(', ')', '[', ']', '<', '>', '{', '}',
                '.', ',', '|', ':', '::', '=', '=>', 
		 '+', '-', '*', '/', '+=', '-=', '*=', '/=', 
                 '||', '&&', '^^', '<<', '>>', ':='

No comments:

Post a Comment