How to Build a Compiler: A Step-by-Step Outline for Creating a custom compiler in C from scratch

This article was originally published on Valerius Petrini's Programming Blog on 06/10/2025

Over the past year I've been working on my own programming language, Jakarta, a compiled language that I needed to build a compiler for. If you're unfamiliar with compilers, think of them as the thing that converts your programming language's code (in this case .jak files) into machine code that your computer can then run.

Building a compiler is a huge undertaking, and needs a lot of time and effort, especially if you are starting from scratch and not using something like LLVM to speed things up. Because of that, it's a good idea to have a step-by-step outline at the beginning to know what exactly you're doing. For something like Jakarta and many other compilers, the process is split into four basic parts: the lexer, the parser, the intermediate representation, and the assembler.

The Lexer

Think of the lexer as a tokenizer: it splits up your code into tokens to make it easier for the compiler to understand things.

Take this line:

function factorial(int n)

The tokens would fall into these categories:

Keywords: function
Identifiers: factorial, int, n
Symbols: (, )

Since Jakarta is written in C, our first step is to put everything into an array, splitting code into one of three token types:

Identifiers: things like variable, function, and type names
Symbols: punctuation and operators
Literals: values like numbers or strings

We can use a C enum to represent the type of each token, such as:

enum TokenType {
    TOKEN_KEYWORD_FUNCTION,
    TOKEN_SYMBOL_IDENTIFIER,
    TOKEN_STRING_LITERAL,
    TOKEN_NUMBER,
    TOKEN_OPEN_PAREN,
    TOKEN_CLOSE_PAREN,
    // etc.
};

To make debugging easier, we also store the line and column numbers of each token. This helps us report errors with helpful messages like “unexpected token on line 4, column 12.”

Here's an example of the line from earlier and what the outputted tokens look like (with line and columns omitted):

function factorial(int n)

Token: "function", Type: KEYWORD_FUNCTION  
Token: "factorial", Type: SYMBOL_IDENTIFIER  
Token: "(", Type: SYMBOL_OPEN_PARENTHESIS  
Token: "int", Type: SYMBOL_IDENTIFIER  
Token: "n", Type: SYMBOL_IDENTIFIER  
Token: ")", Type: SYMBOL_CLOSING_PARENTHESIS

We can take all this data and move to the parser, which makes sure that these tokens are in the right order.

The Parser

The parser is the section of the compiler that checks if all of the tokens are in the right order, and stores metadata about functions and variables.

The best part about our previous approach is it's really easy to check if our tokens are in the right order.

Take our highly used factorial function:

function factorial(int n)

If we find a KEYWORD_FUNCTION token, then we must expect a SYMBOL_IDENTIFIER (which is the name of the function), then an opening parentheses, then parameters (which includes the type and the name, and commas if there are multiple), and the closing parenthesis.

To do this, we can have two basic functions, peek and consume:

Token* consume(Tokenizer* tokenizer) {
    Token* content = get_from_array(tokenizer->tokens, 0);
    remove_from_array(tokenizer->tokens, 0);
    return content;
}

bool peek(Tokenizer* tokenizer, TokenType symbol) {
    Token* token = get_from_array(tokenizer->tokens, 0);
    return token->symbol == symbol;
}

peek checks if the next token is the correct type, and consume removes the token. We also have a peekConsume function that consumes the token if the peek function returns true.

Here are some other parts of the parser that help make sure everything is in order:

The Abstract Syntax Tree (AST)

The Abstract Syntax Tree is perhaps the largest part of the parser, as it handles converting the tokens from our lexer into a tree format. This tree structure filters out syntactic noise (like punctuation), leaving only the essential structure of the code.

The logic for creating the AST can be split into two parts: keywords and expressions. Keywords are exactly what they sound like, keywords like if and function that have predefined structures.

Take function as an example, we can have an AST setup like this:

Function Node
├ Name Node: factorial
├ Return Type Node: int
├ Parameters Node
│ └ Parameter Node: int n
└ Body Node: Contains All Child Nodes

All of the nodes within our function go into the Body Node into sequential order, so when we are doing our next step, we can read them in one by one and convert them into our Intermediate Representation.

On the other hand, expressions are the things that vary a lot. Expressions can be mathematical equations that you assign to variables, boolean expressions that you use in if statements, and things that you return from functions. Here are some examples:

10 + 20 * 4
n % 2 == 0 ? "Even" : "Odd"
"%s: %d".format(name, age)"

As you can see, expressions take on many different forms and can vary in how big or small they are. To handle these expressions, we use the shunting yard algorithm, which converts infix expressions (like a + b * c) into postfix (Reverse Polish Notation), making it easier to build a structured tree.

Here an example of 10 + 20 * 4 in our AST:

Binary Expression Node: +
├ Left: 10
└ Right: 
   Binary Expression Node: *
   ├ Left: 20
   └ Right: 4

We start from the bottom by doing 20 * 4 first, ensuring we can handle precedence.

Error Checking

Most error handling happens in the parser, as we are passing through the tokens to make sure they are in the right order. An example of this would be throwing an error if a variable name is declared multiple times, or if we use a keyword as an identifier. We can use the line and column number from the lexer here to make it easier for the developer to see what is happening.

Symbol Tables

Symbol tables are used to store all variables, function, and type data (Jakarta allows developers to create custom types). These are stored in a hashmap with the key being the name, and the value being a struct or object containing metadata of that variable/function/type.

We only need one hashmap for function and types as they can only be made in the top level scopes, but we need to take into account scopes for variables to implement shadowing and making sure we handle duplicates correctly. To do this, we can create a stack of hashmaps: when you enter a new scope, push a new hashmap to the stack, and add all variables there. When a scope is exited, pop the most recent scope.

For function parameters, we can add those to the same scope as the content of that function.

Type Checking

During the parser, we also go ahead and check if all of our types are correct. This includes things like checking to see if we assign the correct types to variables, and that any expressions within if statements resolve to a boolean.

By building the AST, we make it easier to move into the next portion of the compiler: the intermediate representation.

Intermediate Representation (IR)

The reason we have an Intermediate Representation (IR) is because each type of computer chip runs a different type of assembly language. For example, Intel x86 CPUs use a different instruction set than ARM chips used in smartphones. Instead of writing a completely separate code generator for each chip, we first translate our source code into an IR, and then convert that IR into the appropriate assembly based on the target architecture.

Think of the Intermediate Representation as a "global" assembly language (yes, we are making another language alongside our programming language). The goal is to get as close to assembly as possible without depending on any specific hardware.

To illustrate this, take the example of adding two and two in two different architectures:

; x86 Assembly
mov eax, 2     ; Move 2 into register eax
add eax, 2     ; Add 2 to eax

; ARM Assembly
mov r0, #2     ; Move 2 into register r0
add r0, r0, #2 ; Add 2 to r0, store result in r0

Although they’re different syntactically, the logic is similar — move 2 into a register, then add 2. So our IR could look something like this:

MOV REGISTER1 2
ADD REGISTER1 2

Each node in the Abstract Syntax Tree (AST) can be translated into these IR instructions. Once the IR is built and saved (often into a .ir or .jakir file), the next stage, the assembler, can detect the current architecture and generate the appropriate native machine code.

The Assembler

The assembler is one of the easiest parts of the compiler to explain. It takes the Intermediate Representation (IR) and generates assembly code based on the target architecture (like x86_64 or ARM). After that, the assembly is assembled into an .obj/.o (object) file, which contains the machine code instructions.

Depending on the compiler and build system, a separate tool called the linker may then be used to combine multiple .obj files and libraries into a single executable file (like an .exe on Windows). This final file is what you can run on your computer.

Conclusion

Although each compiler is built in a different way, these four steps cover the basics of compiler design, alongside all of the sub-steps.

Thanks for reading!

Valerius Petrini @villyp