Building a Python Interpreter in C: How nuPython Works Under the Hood
What does it actually mean to “run” a Python program? Not the abstract answer — the mechanical one. CPython, the reference implementation, is a C program. So is nuPython, a minimal Python interpreter I built from scratch. The difference is scale and completeness. The fundamentals are identical.
The entire interpreter is orchestrated in four lines:
struct TokenQueue* tokens = parser_parse(input);
struct STMT* program = programgraph_build(tokens);
struct RAM* memory = ram_init();
execute(program, memory);
Scan characters into tokens. Build a graph from those tokens. Allocate memory. Execute the graph against that memory. Then destroy everything in reverse. Every language runtime — CPython, V8, the JVM — follows this same pipeline. What nuPython makes explicit is what each stage actually is in memory: a struct, a linked list, a union, a heap-allocated array.
Stage 1: Characters to Tokens
The scanner reads a raw stream of bytes. Its job is to recognize that while is not five characters — it’s nuPy_KEYW_WHILE. This is called tokenization. A token is the smallest unit of meaning in a language: a keyword, a literal, an operator, a name.
Every token is represented by this struct:
struct Token {
int id; // which token (nuPy_KEYW_WHILE, nuPy_INT_LITERAL, etc.)
int line; // source line (1-based)
int col; // source column (1-based)
};
Three integers. Twelve bytes. The id encodes what the token is — one of 81 distinct types ranging from nuPy_PLUS to nuPy_KEYW_PASS. The line and col fields encode where it appeared. These survive all the way to runtime: when the interpreter prints **SEMANTIC ERROR: name 'x' is not defined (line 7), that 7 is token.line.
The token struct doesn’t store a string value — nuPy_PLUS is always +, there’s nothing to store. Only identifiers and literals carry a string payload, and that lives in the token queue node.
The Token Queue
Tokens are collected into a TokenQueue: a singly-linked list where each node holds the token metadata, an optional char* value (the identifier’s name, the literal’s text), and a next pointer.
Two operations make parsing possible without committing prematurely. peekToken() inspects the front of the queue without consuming it. peek2Token() looks one further ahead. This two-token lookahead is necessary because the grammar is ambiguous at one token: *x at the start of a line could be a pointer dereference assignment or the beginning of an expression, and you can’t tell until the second character. duplicate() copies the entire queue for speculation — the parser can attempt a parse on the copy, abandon it on failure, and retry on the original. Classic backtracking without modifying the primary stream.
Stage 2: Tokens to Program Graph
The parser validates syntax and hands a token queue to programgraph_build(), which produces the program graph — the structure the execution engine will traverse. This is roughly what other systems call an Abstract Syntax Tree (AST), except statements form a linked list rather than a true tree.
The central design pattern is the discriminated union (also called a tagged union). Every statement in nuPython is represented by a single STMT struct:
struct STMT {
int stmt_type; // enum: ASSIGNMENT, FUNCTION_CALL, IF_THEN_ELSE, WHILE_LOOP, PASS
int line;
union {
struct STMT_ASSIGNMENT* assignment;
struct STMT_FUNCTION_CALL* function_call;
struct STMT_IF_THEN_ELSE* if_then_else;
struct STMT_WHILE_LOOP* while_loop;
struct STMT_PASS* pass;
} types;
};
The stmt_type integer is the tag. Without it, a caller holding a struct STMT* has no safe way to know which union member is valid — reading the wrong branch is undefined behavior. The tag is what makes runtime dispatch safe. When the execution engine switches on stmt_type, it knows exactly which pointer to dereference.
If/elif/else chains are the most interesting structure. Consider:
if x < 0:
y = -1
elif x == 0:
y = 0
else:
y = 1
The STMT_IF_THEN_ELSE struct has a true_path (statements to run if the condition is true), a false_path (what to run otherwise), and a next_stmt (where execution resumes after the entire block). The elif isn’t a special case — it’s just another STMT_IF_THEN_ELSE living at the false_path of the one above it. The chain is recursive. The execution engine handles arbitrary elif depth without any special logic: it follows false_path until a branch matches or all paths are exhausted.
The next_stmt pointer deserves attention. It’s the stop-node mechanism. When execute_until(program, memory, stop_node) is called to run an if body, it receives the outermost next_stmt as the stop. The recursive call stops there — it doesn’t bleed into whatever statement follows the entire if block. No explicit return-address stack. The program graph’s pointer structure is the control flow graph.
Stage 3: RAM — A Custom Memory Abstraction
When Python executes x = 5, something has to store the integer 5 and associate it with the name "x". CPython uses a reference-counted object heap. nuPython uses struct RAM — a flat, explicitly-managed variable store.
Every value is typed at runtime:
struct RAM_VALUE {
int value_type; // INT, REAL, STR, PTR, BOOLEAN, NONE
union {
int i; // INT, PTR, BOOLEAN
double d; // REAL
char* s; // STR
} types;
};
The union means each cell costs max(sizeof(int), sizeof(double), sizeof(char*)) bytes of payload — 8 bytes on a 64-bit machine — not the sum of all three. Type is stored as a 4-byte tag. The whole value is 16 bytes with padding.
The RAM struct holds an array of these cells plus a map from variable names to cell indices:
struct RAM {
struct RAM_VALUE* cells; // array of typed values
struct RAM_MAP* map; // varname → cell index
int size;
int capacity;
};
There are two read functions: ram_read_cell_by_name() does an O(n) scan of map to find the index, then O(1) into cells. ram_read_cell_by_addr() skips the name lookup entirely — O(1) direct. This dual interface is what makes pointer semantics work.
The pointer in nuPython is literally an integer array index. p = &x stores the cell index of x as a RAM_TYPE_PTR integer in p. Later, *p = 10 reads that integer from p and calls ram_write_cell_by_addr() with it. There’s no actual C pointer dereference happening — it’s two RAM operations: a name lookup returning an integer, then an index-based write.
x = 5
p = &x
*p = 99
print(x) # prints 99
String ownership is explicit throughout. Every ram_write_cell_by_name() call duplicates the string — the RAM owns its copy. Every ram_read_cell_by_name() call also duplicates the string — the caller owns that copy and must call ram_free_value() when done. Conservative, but correct. No use-after-free from a stale pointer into RAM’s internal storage.
Stage 4: Execution
The execution engine is a dispatch loop:
while (stmt != NULL && stmt != stop_node) {
switch (stmt->stmt_type) {
case STMT_ASSIGNMENT: execute_assignment(...); break;
case STMT_FUNCTION_CALL: execute_function_call(...); break;
case STMT_IF_THEN_ELSE: /* evaluate, recurse */ break;
case STMT_WHILE_LOOP: /* evaluate, recurse */ break;
case STMT_PASS: /* no-op */ break;
}
stmt = get_next_stmt(stmt);
}
Expression evaluation is where types collide. The binary operator handler inspects both operands’ value_type fields and routes to a type-specific function: execute_binary_int, execute_binary_real, or execute_binary_str. Mixed int and real operands are upcasted to double before the real handler runs — the same implicit promotion Python applies, made explicit here because C has no implicit promotion for user-defined types.
Division by zero is caught before the operator executes:
if (right == 0) {
printf("**SEMANTIC ERROR: division by zero (line %d)\n", stmt->line);
*success = false;
return;
}
No signal handler. No setjmp/longjmp. No exception. The error propagates up through every caller via *success = false. This is the Go-style error-first pattern: explicit, predictable, and zero magic.
String concatenation mallocs a new buffer: malloc(strlen(left) + strlen(right) + 1). The +1 is the null terminator — a classic C pitfall when forgotten. The result is caller-owned and eventually lands in RAM, where it gets duplicated again.
The four builtins — print(), input(), int(), float() — are dispatched by function name string comparison. input() always returns RAM_TYPE_STR, exactly like Python’s input(). int() uses atoi() with a validity check: atoi returns 0 both on success with input "0" and on failure with non-numeric input, so the interpreter checks whether the input string is actually "0" to distinguish them.
x = int(input("Enter a number: "))
y = x * x
print(y)
What Was Left Out (and What That Teaches You)
nuPython supports integer, float, string, boolean, and None types. Assignments, while loops, if/elif/else, and four builtins. No def, no for, no lists, no dicts, no exceptions.
Each omission is a lesson. No garbage collector means every malloc has a visible paired free. The ownership contract is explicit — you can trace exactly who allocated and who frees every byte. CPython’s reference counting hides this; nuPython forces you to see it. No exception mechanism means every error is a returned NULL or false. The discipline required to propagate errors correctly through ten call frames is exactly what makes robust C code hard. No operator precedence means a * b + c requires tmp = a * b; result = tmp + c — which surfaces the fact that precedence in a real parser isn’t magic, it’s grammar rule hierarchy. No def means no call stack, no local scopes, no return mechanism — each of which is a significant addition on its own.
nuPython is about 2,000 lines of C. CPython is 400,000. The delta is features. The fundamentals are the same.