Toy Compiler Using Ocamllex, Menhir, and LLVM

Implementing our own compiler for a simple C-like language.

Getting Started

A compiler is a pipeline consisting of sub-components: Lexical Analysis, Parsing, and Assembly (and linking). We implement the three components to form a complete compiler using the corresponding tools: ocamllex, menhir, and LLVM. There is a lot of theory going on in the background such as LL(0), LL(1), or BNF grammar, abstract syntax trees(AST), and so on… but we simply use this theory and recommend brushing up if unfamiliar.

The full complete code can be found at the about page.

What Is Our Grammar?

The grammar is the central part of any language. For our toy example we want to keep it simple but complex enough to keep it interesting. We will be using a simple C-like syntax, as shown below:

double circle(double r) {
    3.14 * r * r
}

double cyl(double r, double h) {
    h * circle(r)
}

int main() {
    double vol = cyl(2.0, 5.0)
    printf(vol)
    0
}

The last statement is treated as the return value for the function.

Step 1 - Lexical Analysis

Since OCAML allows use to easily pattern match and traverse ASTs, everything will be written in OCAML. A nice library for lexical analysis is ocamllex. Given our syntax, we need to break it down to a list of known tokens (with the help of regex) such as variables, numbers, operators, etc. Our file is called lexer.mll and simply contains:

{
    open Parser
}

rule token = parse
  | [' ' '\t' '\n']                  { token lexbuf }
  | ['a'-'z' 'A'-'Z' '_']['a'-'z' 'A'-'Z' '0'-'9' '_']* as s       {  TIDENTIFIER (s)}
  | ['0'-'9']+ as lxm                { TINTEGER (int_of_string lxm) }
  | ['0'-'9']+'.'['0'-'9']* as lxm   { TDOUBLE(float_of_string lxm)}
  | "=" as s                         { TEQUAL(String.make 1 s) }
  | "==" as s                        { TCEQ(s) }
  | "!=" as s                        { TCNE(s) }
  | "<"  as s                        { TCLT(String.make 1 s) }
  | "<=" as s                        { TCLE(s) }
  | ">"  as s                        { TCGT(String.make 1 s) }
  | ">=" as s                        { TCGE(s) }
  | "&&" as s                        { TCAND(s) }
  | "||" as s                        { TCOR(s) }
  | "+"  as s                        { TPLUS(String.make 1 s) }
  | "-"  as s                        { TMINUS(String.make 1 s) }
  | "*"  as s                        { TMUL(String.make 1 s) }
  | "/"  as s                        { TDIV(String.make 1 s) }
  | '('                              { TLPAREN }
  | ')'                              { TRPAREN }
  | '{'                              { TLBRACE }
  | '}'                              { TRBRACE }
  | ','                              { TCOMMA }
  | eof                              { EOF }

As you can see we define a token which can be an identifier, an integer, a double, equals, etc. And each option returns a type that may contain internal data such as the name of the identifier, or the value of the integer or double.

Step 2 - Parsing

Now we want to creeate an AST which will represent our language. To do this, we need to define fundamental expression and statement that each node in the AST will define. Hence we have the following:


type expr = 
  | Int of int
  | Double of float
  | Assignment of string * expr
  | MethodCall of string * expr list
  | BinaryOperator of expr * string * expr
  | Identifier of string

type statement = 
    | Expr of expr
    | VariableDeclarationExpr of string * string * expr (* type, id, assignment expr *)
    | VariableDeclaration of string * string
    | FunctionDeclaration of string * string * statement list * statement list

type block = statement list

We have an expression type that can be classified as an int, double, assignment, method call, binary operator, or a variable. A statement type that can be an expression, a variable decleration or assignment, and a function decleration. A block is simple a list of statements. With these types, we can parse the tokens using menhir and generate an AST that represents each expression, statement, and block. In our file parser.mly we have the following:

%{
    open Expr
%}

%token <int> TINTEGER
%token <float> TDOUBLE
%token <string> TIDENTIFIER
%token <string> TPLUS TMINUS TMUL TDIV
%token TLPAREN TRPAREN TLBRACE TRBRACE
%token <string> TEQUAL TCEQ TCNE TCLT TCLE TCGT TCGE TCAND TCOR
%token TCOMMA EOF


%type <string> ident
%type <Expr.expr> numeric expr 
%type <Expr.statement list> func_decl_args
%type <Expr.expr list> call_args
%type <Expr.block> program stmts block
%type <Expr.statement> stmt var_decl func_decl
%type <string> comparison

%start program

%%

program: stmts EOF                   { $1 }
    ;

stmts : stmt { [$1] }
    | stmts stmt { $1@[$2] }
    ;

stmt : var_decl {$1} | func_decl {$1}
    | expr { Expr($1) }
    ;

block : TLBRACE stmts TRBRACE { $2 }
    | TLBRACE TRBRACE { [] }
    ;

var_decl : ident ident { VariableDeclaration($1, $2) }
    | ident ident TEQUAL expr { VariableDeclarationExpr($1, $2, $4) }
    ;

func_decl : ident ident TLPAREN func_decl_args TRPAREN block 
                {FunctionDeclaration($1, $2, $4, $6)}
    ;

func_decl_args : {[]}
    | var_decl {[$1]}
    | func_decl_args TCOMMA var_decl {$1@[$3]}
    ;

ident : TIDENTIFIER { $1 }
    ;

numeric : TINTEGER { Int($1) }
    | TDOUBLE {Double($1)}
    ;

expr : ident TEQUAL expr {Assignment($1, $3)}
    | ident TLPAREN call_args TRPAREN {MethodCall($1, $3)}
    | ident {Identifier($1)}
    | numeric {$1}
    | expr comparison expr {BinaryOperator($1, $2, $3)}
    | TLPAREN expr TRPAREN {$2} 
    ;

call_args : {[]}
    | expr {[$1]}
    | call_args TCOMMA expr {$1@[$3]}
    ;

    comparison : TCEQ {$1} | TCNE {$1} | TCLT {$1} | TCLE {$1} | TCGT {$1} | TCGE {$1} | TPLUS {$1} | TMINUS {$1} | TMUL {$1} | TDIV {$1} | TCAND{$1} | TCOR{$1}
           ;

%%

Step 3 - Assembly using LLVM

With an AST constructed we can now generate LLVM-IR code. I recommend taking a look at the offical LLVM example project, to get a feel for how LLVM works. Essientally we go through our AST, match on the type of node, and preform the corresponding IR operation. For example, in codegen.ml we have the following match statement for Expr:

let rec codegen_expr (e: expr) =
    match e with
    | Int n -> const_int int_type n
    | Double n -> const_float double_type n
    | MethodCall (callname, args) -> 
            (*Printf.eprintf "Calling function %s\n" callname;*)
            let callee =  match lookup_function callname the_module with
                            | Some callee -> callee
                            | None -> raise (Error "unknown function referenced")
            in
            let params = params callee in
            if Array.length params == List.length args then () else
                raise (Error "incorrect # args");
            let args = List.map codegen_expr args in
            if callname = "printf" then
                let printval = List.hd args in
                match classify_type (type_of printval) with
                | TypeKind.Integer ->
                    let const_int n = const_int int_type n in
                    let str s = define_global "buf" (const_stringz context s) the_module in
                    let int_spec = build_gep (str "%d\n") [| const_int 0; const_int 0 |] "int_spec" builder in
                build_call callee [| int_spec;  (List.hd args) |] "" builder
                | _ ->
                    let const_int n = const_int int_type n in
                    let str s = define_global "buf" (const_stringz context s) the_module in
                    let int_spec = build_gep (str "%f\n") [| const_int 0; const_int 0 |] "float_spec" builder in
                build_call callee [| int_spec;  (List.hd args) |] "" builder

            else
            build_call callee (Array.of_list args) "calltmp" builder
    | BinaryOperator (lhs, op, rhs) ->
            (*Printf.eprintf "Operator %s\n" op;*)
            let lhs_val = codegen_expr lhs in
            let rhs_val = codegen_expr rhs in
            (*Printf.eprintf " type LHS %s" (string_of_lltype (type_of lhs_val));
            Printf.eprintf " type RHS %s\n" (string_of_lltype (type_of rhs_val));*)
            let w =
                match classify_type (type_of lhs_val), classify_type (type_of rhs_val) with
                | TypeKind.Integer, TypeKind.Integer -> 0
                | _, TypeKind.Double -> 1
                | TypeKind.Double, _ -> 1
                | _,_ -> 1
            in
            (
                match op with
                    | "+" -> if w = 0 then build_add lhs_val rhs_val "addtmp" builder else build_fadd lhs_val rhs_val "addtmp" builder
                    | "-" -> if w = 0 then build_sub lhs_val rhs_val "subtmp" builder else build_fsub lhs_val rhs_val "subtmp" builder
                    | "*" -> if w = 0 then build_mul lhs_val rhs_val "multmp" builder else build_fmul lhs_val rhs_val "multmp" builder
                    | "/" -> if w = 0 then build_sdiv lhs_val rhs_val "divtmp" builder else build_fdiv lhs_val rhs_val "multmp" builder
                    | "&&" -> build_and lhs_val rhs_val "andtmp" builder
                    | "||" -> build_or lhs_val rhs_val "ortmp" builder
                    | _ -> failwith "Unsupported"
            )
    | Identifier name ->
            (
            (*Printf.eprintf "Variable %s\n" name;*)
            let v = try Hashtbl.find named_values name with
                | Not_found -> raise (Error "unknown variable name")
            in
            match v with
            | t, v -> (*build_load v name builder*) v
)

As you can see, we match an expression with either Int, Double, Method Call, Binary Operator, or Identifier and then generate LLVM code such as const_int int_type for integer number.

Building/Compiling

We create a main file that defines the compiler sub-components. In our file main.ml we have:

open Lexing
open Parser
open Lexer
open Codegen
open Expr

let filename = Sys.argv.(1)
let filename_llvm = Sys.argv.(2)

let () = 
  let block = open_in filename |>
  Lexing.from_channel |>
  Parser.program Lexer.token in
  (*|> print_block in*)
codegen_main block filename_llvm

This uses the pipe operator to pass along the block of code between each of the sub-components, with the final function generating the LLVM.

Now to build ocaml: ocamlbuild -use-ocamlfind -pkgs llvm,llvm.analysis,llvm.bitwriter -use-menhir main.byte. We can now pass into main.byte a file using toy language and output name for LLVM-IR. Afterwards, we compile using llc and gcc or clang to create our executable.

For our example to find the volume of a cylinder:

main.byte cylinder.in llvm-ir.out

more llvm-ir.out
@buf = global [4 x i8] c"%f\0A\00"

declare i32 @printf(i8*, ...)

; Function Attrs: noinline nounwind optnone sspstrong uwtable
define double @circle(double %r) #0 {
entry:
  %multmp = fmul double %r, %r
  %multmp1 = fmul double 3.140000e+00, %multmp
  ret double %multmp1
}

; Function Attrs: noinline nounwind optnone sspstrong uwtable
define double @cyl(double %r, double %h) #0 {
entry:
  %calltmp = call double @circle(double %r)
  %multmp = fmul double %h, %calltmp
  ret double %multmp
}

; Function Attrs: noinline nounwind optnone sspstrong uwtable
define i32 @main() #0 {
entry:
  %calltmp = call double @cyl(double 2.000000e+00, double 5.000000e+00)
  %0 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([4 x i8], [4 x i8]* @buf, i32 0, i32 0), double %calltmp)
  ret i32 0
}

attributes #0 = { noinline nounwind optnone sspstrong uwtable }

llc llvm-ir.out && clang -o cylinder llvm-ir.s

And finally, running our compiled code prints our the volume of the cylinder!

./cylinder
62.800000

This is pretty cool! We went from defining our little language, to parsing it and compiling it, to running it. There are of course plenty of other features that we can add or include such as using LLVM optimization, or even optimizing the AST ourselves. But in the end we created a language that we can call our own.

References

FRC Kickoff 2019

PDF of presentation slides

PDF of FRC Programming Document

Deutsch Algorithm

Implementing 2-qubit Deutsch Algorithm

Initial Problem

Suppose we have a black box that takes in two input bits and produces two output bits. We would like the query the black box as few times as possible to learn what the black box does.

In Deutsch’s problem the black box implements one of the following: \[ f_1(x,y) = (x,y)\] \[ f_2(x,y) = (x,\bar{y})\] \[ f_3(x,y) = (x,x \oplus y)\] \[ f_4(x,y) = (x,x \oplus \bar{y})\]

Make it Quantum

We write the four possible functions as unitary matrices operating on the standard computational basis.

\[ U_1 = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \] \[ U_2 = \begin{bmatrix} 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \end{bmatrix} \] \[ U_3 = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \end{bmatrix} \] \[ U_4 = \begin{bmatrix} 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \] These applied to \( \left|00\right\rangle \), \( \left|01\right\rangle \), \( \left|10\right\rangle \), \( \left|11\right\rangle \) yield the expected result.

Now we can solve this problem in one query if by using a general superposition and manipulation of the two qubits.

Algorithm

We draw the quantum circuit using qasm2circ:

Circuit of Deutsch Algorithm for two qubits

The Hadamard gate is: \[ H = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}\]

When we apply the Hadamard gate: \[ \left|\psi\right\rangle = H\left|0\right\rangle \otimes H \left|1\right\rangle\] \[ = \frac{1}{\sqrt{2}}(\left|0\right\rangle + \left|1\right\rangle) \otimes \frac{1}{\sqrt{2}}(\left|0\right\rangle - \left|1\right\rangle) \] \[ = \frac{1}{2} (\left|00\right\rangle - \left|01\right\rangle + \left|10\right\rangle - \left|11\right\rangle) \]

Now, once we apply \( U_i \) we obtain: \[ \left|\phi_1\right\rangle = U_1 \left|\psi\right\rangle = \frac{1}{2} (\left|00\right\rangle - \left|01\right\rangle + \left|10\right\rangle - \left|11\right\rangle) \] \[ \left|\phi_2\right\rangle = U_2 \left|\psi\right\rangle = \frac{1}{2} (-\left|00\right\rangle + \left|01\right\rangle - \left|10\right\rangle + \left|11\right\rangle) \] \[ \left|\phi_3\right\rangle = U_3 \left|\psi\right\rangle = \frac{1}{2} (\left|00\right\rangle - \left|01\right\rangle - \left|10\right\rangle + \left|11\right\rangle) \] \[ \left|\phi_4\right\rangle = U_4 \left|\psi\right\rangle = \frac{1}{2} (-\left|00\right\rangle + \left|01\right\rangle + \left|10\right\rangle - \left|11\right\rangle) \]

We finish off with the final Hadamard Gates: \[ H \otimes H = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & -1 & 1 & -1 \\ 1 & 1 & -1 & -1 \\ 1 & -1 & -1 & 1 \end{bmatrix} \]

Thus, finally: \[ \left|o_1\right\rangle = (H\otimes H)U_1\left|\psi\right\rangle = \left|01\right\rangle\] \[ \left|o_2\right\rangle = (H\otimes H)U_2\left|\psi\right\rangle = -\left|01\right\rangle\] \[ \left|o_3\right\rangle = (H\otimes H)U_3\left|\psi\right\rangle = \left|11\right\rangle\] \[ \left|o_4\right\rangle = (H\otimes H)U_4\left|\psi\right\rangle = -\left|11\right\rangle\]

This is great, we can figure out the black box by observing the first bit.

Visualization

Let’s draw the states as we evolve in time with the circuit. As an example, lets use \( U_2\).

We start with \[ \left|\psi\right\rangle = (H \otimes H) (\left|0\right\rangle \otimes \left|1\right\rangle) \] \[ = (H \otimes H) \begin{pmatrix} 1 \\ 0 \end{pmatrix} \otimes \begin{pmatrix} 0 \\ 1 \end{pmatrix} \] \[ = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & -1 & 1 & -1 \\ 1 & 1 & -1 & -1 \\ 1 & -1 & -1 & 1 \end{bmatrix} \begin{pmatrix} 0 \\ 1 \\ 0 \\ 0 \end{pmatrix} \] \[= \frac{1}{\sqrt{2}}\begin{pmatrix} 1 \\ -1 \\ 1 \\ -1 \end{pmatrix}\]

The density matrix is \[ \left|\psi \right\rangle \left\langle\psi\right| = \frac{1}{4}\begin{bmatrix} 1 & -1 & 1 & -1 \\ -1 & 1 & -1 & 1 \\ 1 & -1 & 1 & -1 \\ -1 & 1 & -1 & 1 \end{bmatrix} \]

For practice we take the partial trace and draw the mapped density matrix to the Bloch sphere. We then apply \( U_2\) to \( \left|\psi \right\rangle\). And once again draw to the Bloch Sphere.

Finally, we apply \( H \otimes H\) and draw to the Bloch Sphere.

For \(U_2\) we obtain the following animation: Bloch Sphere Animation for U_2

Following similar procedure for \(U_4\) we obtain the following animation: Bloch Sphere Animation for U_4

Wave packet in 2D potential

Having some fun with Julia and simulating a Wave packet in 2D.

Main Idea

We will evolve the wavepacket in 2D by taking the tensor products between two 1D (x,y) spaces.

Wavepacket

The wavepacket is expressed as a Gaussian:

\[\Psi(x) = \frac{\sqrt{\Delta x}}{\pi^{1/4}\sqrt{\sigma}} e^{i p_0 (x-\frac{x_0}{2}) - \frac{(x-x_0)^2}{2 \sigma^2}}\]

With the basis spanning the 1D space. And total state is represented by \(\Psi = \Psi_x \otimes \Psi_y\)

Hamiltonian

The kinetic energy term is p_x^2/2 and p_y^2/2. The two-dimensional potential is a boundary with two slits.

function potential(x,y)
        if x > 20 && x < 25 && abs(y) > 1 && abs(y) < 5
                return 0
        elseif x > 20 && x < 25
                return 150
        elseif x > 48
                return 150
        else
                return 0
        end
end

I also use a gaussian potential:

potential(x,y) = exp(-(x^2 + y^2)/15.0)

Time Evolution

At first, I had some trouble with installing DifferntialEquations.jl due to compilation errors for Arpack.jl. So I decided to just solve it manually by separating the real and imaginary components for the state, and then preforming the Discrete Laplace Operator for finite differences. Followed by adding the potential part. This is what I got: Double slit with Laplace

However, before I began working on visualization, Arpack.jl mysteriously compiled (probably because I switched to the official binary of Julia, rather than compiling my own). So I started exploring DifferntialEquations.jl. I then discovered QuantumOptics.jl which makes it pretty easy to simulate various quantum systems, so decided I should use it. It comes with a nice timeevolution.schroedinger which will evolve (integrate) the system.

Double Split starts with:

x0 = -5
y0 = 0
p0_x = 3.0
p0_y = 0.0
σ = 2.0

Guassian starts with:

x0 = -5
y0 = 0
p0_x = 1.5
p0_y = -.5
σ = 2.0

Results

Julia has very nice support for plotting and animations. I abuse this here. Moreover, due to the Julia pre-compilation step, as long as you do not abuse the global space too much, the processing is rather quick.

Double Slit potential with magnitude mapped to color

Gif of Double slit

Gaussian Potential

Gif of Gaussian Potential

Represent phase as hue and magnitude as value (HSV)

Decided this is not enough to show the true “wave” nature as the visualization neglects the actual phase component… So time for more color!

Double Slit potential

Gif of Double slit under HSV

Gaussian Potential

Gif of Gaussian potential under HSV

HSV and Showing Potential

Guess this would be a good time to actually visualize the of potential, and get our final nice visualization

Double Slit potential

Gif of Double slit under HSV with potential

Gaussian potential

Gif of Gaussian potential under HSV with potential

About