← All Posts

Compilers are hard

May 17, 2026

So it turns out compilers are incredibly difficult. My respect for the teams making compilers has shot up immensely in the last 2 weeks; a compiler is a much bigger beast of a program than i had initially imagined.

What can I do so far? Realistically very little. I support the keywords ‘int’, ‘void’ and ‘return’. I support most binary math expressions, and bitwise expressions, but I don’t currently support logical operations (<, >, &&, ||, etc.) but those are a work in progress. I also don’t support variable assignment as of yet.

/* comments are supported and stripped out later */
#pragma once //all preprocesser statements are ignored

int main (void) { //the void is ignored but is valid syntax
    5 * ( 2 + 10 % 3 ) + 1; //this statement doesn't do anything but is supported
    return ~ ( -2 );
}

Above is an example of a program that would currently compile with my compiler, and below after talking about each section I have put the relevant debug output (tokens, ast, mir, assembly)

I guess I can explain what a compiler is. In short its a collection of smaller functions with the purpose of changing a human readable languange (c in this case) to a machine readable language, aka, a binary. This is done by taking an input file, lexing it to turn the input into tokens, turning the tokens into an abstract syntax tree, turning the AST into an intermediate representation (IR) with single static assignment, optimizing the IR, turning the IR into assembly, and finally turning the assembly into a binary.

Lexing is a fairly easy step. It is just a loop that looks throug every character of the input with a huge chain of if and if else statements. Is it a letter? Try to make a word, and match that word to a keyword. If it doesn’t match then it’s an identifier. For instance, ‘main’ is an identifier because its the name of a function. If it’s a number, it is assumed to be some kind of math statement and will turn a statement like ‘2 + 3’ into the tokens ‘2’ ‘+’ ‘3’, and so on. This step also checks for some of the language rules but very few.

/* comment string */
int identifier ( void ) { // comment string
number * ( number + number % number ) + number ; // comment string
return ~ ( - number ) ;
} End 

Parsing is the step that changes the tokens into a tree representation of the program. Also parsing is where a lot of the language’s rules are checked for. Are all bracket’s opened and closed properly, do statements end in a semicolon, does a function declaration have a type identifier, ensure the right precedence to math operations, etc. This is mostly a tedious step where not much of interest actually happens (so far).

main:
  ExprStatement:
    Binary(+)
      Binary(*)
        Number(5)
        Binary(+)
          Number(2)
          Binary(%)
            Number(10)
            Number(3)
      Number(1)
  Return:
    Unary(~)
      Unary(-)
        Number(2)

IR is next. Currently I have broken this up into two separate representations. MIR (middle IR) which is a SSA turns the AST into a set of linear instructions. This will continue to grow more and more complex as I build more support in, but this is where the instructions still represent a high level code but approaching the assembly instruction set. By making this SSA which is where every variable and constant in the program is given a unique temporary variable, is starting to represent how the stack and registers are used in the final version.

main:
  t0 = 5
  t1 = 2
  t2 = 10
  t3 = 3
  t4 = t2 % t3
  t5 = t1 + t4
  t6 = t0 * t5
  t7 = 1
  t8 = t6 + t7
  t9 = 2
  t10 = -t9
  t11 = ~t10
  ret t11

I also use LIR which is an object representation of the assembly. I guess technically this could just be called the assembly generation but I like my name. This is where i turn the the MIR into the assembly instructions, and when i want to generate the assembly, I generate it directly from the LIR (it just has to be turned to strings and formatted in the right way) and save that result as its own file.

    .text

    .global  main
    .type    main, @function

main:
    pushq  %rbp
    movq   %rsp,  %rbp
    movl   $5, -4(%rsp)
    movl   $2, -8(%rsp)
    movl   $10, -12(%rsp)
    movl   $3, -16(%rsp)
    movl   -12(%rsp), %eax
    cdq
    idivl  -16(%rsp)
    movl   %edx, -20(%rsp)
    movl   -8(%rsp), %r10d
    addl   -20(%rsp), %r10d
    movl   %r10d, -24(%rsp)
    movl   -4(%rsp), %r10d
    imull  -24(%rsp), %r10d
    movl   %r10d, -28(%rsp)
    movl   $1, -32(%rsp)
    movl   -28(%rsp), %r10d
    addl   -32(%rsp), %r10d
    movl   %r10d, -36(%rsp)
    movl   $2, -40(%rsp)
    movl   -40(%rsp), %r10d
    negl   %r10d
    movl   %r10d, -44(%rsp)
    movl   -44(%rsp), %r10d
    notl   %r10d
    movl   %r10d, -48(%rsp)
    movl   -48(%rsp), %eax
    movq   %rbp, %rsp
    popq   %rbp
    ret


    .section .note.GNU-stack,"",@progbits
    .ident "PyrrhicIR (https://gitea.izirith.com/azaroth08/PyrrhicIR)"%

The final step is calling the linker. Currently this is also a trivial step because I didn’t have to make the linker myself, I just call it from within my compiler. In this case I am using cc.

Clearly, I have a lot of work left to do but I do have a functioning, but very limited, compiler.