Monday, October 3, 2011

ELENA Engine: Sections, references

When I started developing my language I was interested in creating a compiler from scratch without looking into other implementations. So some of my solutions may look unorthodox. Nevertheless the linker is the most "old" part of my code. Once made that decision I stuck with it.

Any compiler may be split into several parts: a parser, a code compiler and a linker. In general a parser parses the source code, a compiler generates the output code and a linker assembles this code into executable file. From the linker point of view the program code consists of sections connected with each other by references. The section may contain both an executable code and data. The output sections usually are saved into temporal files, which we could call compiled modules (or simply modules).

In ELENA the situation is a bit more complex. The modules may contain native executable code (in this case this module is called "primitive" and generated by external tools, for example asm2binx) and ELENA byte codes (ecodes). So ecodes should be converted into native codes with a help of Just In Time Compiler (JITCompiler & JITLinker in ELENA Engine). It could be done in the compiler (elc) itself or in a virtual machine (elenavm) right after the module is loaded (hence the name - just in time compiler).

Now let's discuss ELENA compiled module (.NL file). We could consider the module as a list of data (sections, messages, constants). To help the linker find the required data, every list item has a unique (within its module) identifier - a reference. A reference is a 32bit integer, where the highest byte contains the reference type (see elenascr2\engine\elenaconst.h :: ReferenceType) and the rest is a reference number. The reference number has a corresponded literal identifier in the module reference table - reference name. The reference name consists of module and proper names. If the module name equals to the name of the current one it is the reference to this module, otherwise it is external one.

The code and data are saved into sections. Depending on the reference type the section may contain native code, data, VM byte codes, VMTs. In most cases the section has references to other part of the program or library. To support these external references every section is followed by a relocation table. The relocation table has quite simple structure: it is list of reference id and the address of the reference location in the section (relative to the section).

The module may contain messages (actually message qualifier - subjects) and constants. Every time a new subject is declared its name is saved in the message table and the appropriate reference is used inside the code. With the help of relocation table these ids are synchronized between different modules. The same happens with constants (numeric and literal). This allows us to have only one instance for every constant used in the code.

So how the linker works? Every program has an entry (or several of them in case of VM). Presume it is "sys'entries'simple". The linker loads the module "sys'entries" and finds the appropriate reference number (based on the module reference table). To load the required data we need to know the reference type. Presume it is a symbol reference (byte code section). Combining the reference number and the type we could find the required section. The reference type tells the linker where this section should be copied to, in our case to .TEXT (executable code). If it is a byte code (as in our case) JITCompiler converts ecodes into the native commands. Next the linker will go through the relocation table and load all referred data (sections, external functions, messages, constants and so on) and update the section body with the correct addresses and so on.

And finally I will provide the structure of .NL file:

ELENA module structure
-------------------------
a) General file structure:
+--------------+
| module stamp |
+--------------+
| module name  |
+--------------+
|  references  |
+--------------+
|   messages   |
+--------------+
|  constants   |
+--------------+
|   sections   |
+--------------+

b) module stamp - fixed-size module version signature (not terminated by zero)

c) module name  - zero terminated module name

d) references   - reference section 
+--------------+
|    size      |  total section size
+--------------+ 
|   reference  |
|     memory   |
|  hash table  |

e) messages
+--------------+
|    size      |  total section size
+--------------+ 
|   message    |
|   memory     |
|  hash table  |

f) constants
+--------------+
|    size      |  total section size
+--------------+ 
|   constant   |
|    memory    |
|  hash table  |

g) sections     - section list      

                 
{section}

where section = 
<section id>
<section size>
<section body>

<relocation table size>
<reference id>
<reference position>
...