Writing an emulator: the first steps


I’ve read an enlightening blog post. I’ve set up a fresh new git repo. I’ve tracked down a dump of the Game Boy’s boot ROM. I’m pumped as fuck!

Now what?

Setting the bar (low?)

The task ahead, exciting as it is, might seem a little daunting at first. Let’s set ourselves a small, reasonably attainable goal. Implementing a CPU from scratch isn’t so scary, there are good ways to get started with that, like the Synacor Challenge or the Advent of Code. It’s basically writing a loop in which we’ll associate the next value in a memory array to some function, call said function, increment our index in the memory array and repeat the whole thing ad infinitum — or until we run out of memory to read from.

So my first milestone will be to write a very simple command-line program that:

  • Defines a CPU structure representing the Game Boy
  • Loads the boot ROM file in memory
  • Executes the first few bytes of that ROM
  • Displays the current state of the CPU when it’s done

All of this could reasonably be done in a day with a single array, a big switch/case block and a for loop. The boot ROM I’m going to have to execute is only 256 bytes long after all1 Only 200 of which are actual machine code as the ROM contains 56 bytes worth of tile graphics.. To help even further, the commented assembly code for the boot ROM is also available online, and I can easily pick out the first couple machine instructions I’ll need to implement.

All right, I can do this, let’s get started!

Let’s get a couple things out of the way right now: I’m writing this emulator in Go because it’s a language I’m having lots of fun with these days and it’s low-level enough for my purpose. I’m also writing most of the code in Visual Studio Code since it’s got good support for Go — especially the debugger — and runs great on Linux. That’s it, your mileage may vary.

Now, what’s a CPU anyway? Structurally, it’s not much more than a bunch of data (8-bit) registers to use as storage when operating on numbers and a couple address (16-bit) registers used to refer to the next instruction to execute in memory, or the bottom of the stack. Yes, this is a terrible oversimplification, but still a good enough approximation of the Game Boy’s CPU2A Sharp LR35902 which is that weird mix between an 8080 and a Z80. to get us started.

The Game Boy’s CPU registers are few, well documented, familiar if you’ve seen Z80 assembly before and easily summarized in a structured type. Go’s fixed-size types already pretty accurately represent the range of values each register supports, which means they’ll wrap around nicely when we operate on them, without needing checks or modulo operations to truncate results. It will make it a little more awkward to use a pair of 8-bit registers as a single 16-bit register (AF, BC, DE or HL), as some CPU instructions do, but I’ll worry about that later. Also thanks to Go, all these fields contain zero by default.

type CPU struct {
        A, F   uint8
        B, C   uint8
        D, E   uint8
        H, L   uint8
        SP     uint16
        PC     uint16

The first register we need to look at to get started is PC (for Program Counter), which tells the CPU where to read the next instruction in memory. It conveniently starts at zero so we already know to look at the very beginning of the boot ROM to see what is the very first CPU instruction we should implement. We could check the disassembled code from the boot ROM, or we could also just open said ROM with an hexadecimal editor, see that the first byte is 0x31 and look up what CPU instruction it is, either in the Pan Docs or in the instruction set tables available online — I mostly used the latter to look up opcodes and the former when I needed more information about an opcode in particular.

Either way, we’ll find out that 0x31 is the opcode for LD SP,d16 — a somewhat barbaric way to say “store (LoaD) into register SP the value contained in the next two bytes in memory”. The disassembled boot code even offers a quick explanation:

LD SP,$fffe             ; $0000  Setup Stack
XOR A                   ; $0003  Zero the memory from $8000-$9FFF (VRAM)
LD HL,$9fff             ; $0004

So the first thing a Game Boy’s CPU does when booting up is to set its SP register to the value 0xfffe. Easy! Obviously, I still need a tiny bit of logic to extract that 0xfffe value from memory after reading the instruction, which implies implementing memory first. I’ll get to that in a moment. For the time being, I’ll assume that, somewhere in that infinite loop I was mentioning back in the introduction, there will be a bit of code reading a little like:

for {
        opcode := memory[cpu.PC]

        // ...
        if opcode == 0x31 {
                cpu.SP = memory[cpu.PC] | (memory[cpu.PC+1] << 8)
                cpu.PC += 2
        // ...

Now, this is assuming that we have a memory array which contains, from index 0 onward, the individual bytes of the boot ROM code. Still, wasn’t that quick? I’ve already virtually executed three bytes from the ROM, only 197 to go! It’s not quite as bad as it might sound since many instructions will repeat throughout the boot ROM, only with different parameters. For instance, LD SP,$fffe is very similar to the LD HL,$9fff instruction two lines below, or, a bit further down, to LD HL,$ff26. I could just implement a generic function that takes a 16-bit register as parameter and stores the next two bytes in memory in it.

In fact, if I remove all numerical parameters from the boot ROM’s assembly code, it turns out only 49 different instructions are used. This means I could technically run the whole boot sequence by only implementing the remaining 48 CPU instructions and never have to look at the rest of the instruction set!

ADD (HL)         INC B               LD A,($FF00+d8)     LD (d16),A     LD L,d8
BIT 7,H          INC C               LD A,B              LD D,A         LD SP,d16
CALL address     INC DE              LD A,d8             LD D,d8        POP BC
CP d8            INC H               LD A,(DE)           LD DE,d16      PUSH BC
CP (HL)          INC HL              LD A,E              LD E,d8        RET
DEC A            JR address          LD A,H              LD H,A         RLA
DEC B            JR NZ,address       LD A,L              LD (HL),A      RL C
DEC C            JR Z,address        LD B,d8             LD (HL+),A     SUB B
DEC D            LD ($FF00+C),A      LD C,A              LD (HL-),A     XOR A
DEC E            LD ($FF00+d8),A     LD C,d8             LD HL,d16

The list above is similar to the way those opcodes are listed in the instruction set tables mentioned earlier. Here d8 represents an 8-bit parameter value read from memory just after the opcode, d16 a 16-bit parameter or address and r8 a signed 8-bit number to add to or subtract from PC (for relative jumps). You will also notice that many of these distinct opcodes actually have a very similar purpose. All these LD register,value instructions could definitely be described by even more generic functions. Let’s group together all instructions working on single registers, those working on double registers, those working with addresses, etc. Now we’re down to 25 generic instructions, that sounds totally manageable!

ADD (address)      DEC r               LD A,(address)      LD rr,d16         RET
BIT n,r            INC r               LD r,r              LD (HL+),A        RLA
CALL address       INC rr              LD r,d8             LD (HL-),A        RL r
CP d8              JR cond,address     LD r,(address)      POP rr            SUB r
CP (HL)            LD (address),A      LD (address),r      PUSH rr           XOR r

Incidentally, that is pretty much how the CPU instruction set is organized in the Pan Docs, and it did influence my implementation. Now, the code excerpt I wrote roughly shows how to implement LD SP,d16 but to truly execute it, we should really read the instruction directly from the ROM, so we need to implement some kind of memory for that…

The simplest memory model

Memory is conceptually simpler than a CPU: we can see it as an arbitrarily long list of numerical values that we can access in any order we want. In the vast majority of languages I can think of, this is easily represented by an array. Of course, we’ll want something better than that: the Game Boy has several distinct memory zones (work RAM, video RAM, boot ROM, cartridge ROM…) some of which have a direct effect on the display, sound and other bits of the hardware. Right now, though, we’ll make do with a mere byte array.

Actually, if you remember the criteria for the basic program I’m starting with, we don’t even need any kind of RAM yet: the three first CPU instructions only have an effect on the CPU itself. We can start with just the boot ROM as a 256-byte long array, point the CPU at element zero of that array and run our loop.

Reading a file to an array3In Go it’s technically a slice but I’ll stick to calling it an array for simplicity’s sake. is a one-liner4Okay really a four-liner because you should always check for errors.. But let’s also properly rewrite that code snippet from earlier into something that’s got a better chance of compiling:

// Imports and CPU structure defined further above.

func main() {
        memory, err := ioutil.ReadFile("./dmg-rom.bin")
        if err != nil {

        cpu := CPU{}
        for {
                opcode := memory[cpu.PC]

                switch opcode {
                // ... other opcodes

                case 0x31: // LD SP,d16

                        // Here, Go requires some type conversion to fit two
                        // uint8 into one uint16 and also to avoid grossly
                        // overflowing an uint8 by shifting it left 8 bits,
                        // which would just make it zero and yield 0x00fe
                        // instead of 0xfffe. That was a fun crash to debug!
                        cpu.SP = uint16(memory[cpu.PC]) | uint16(memory[cpu.PC+1])<<8
                        cpu.PC += 2

                // ... other opcodes

Some implementation details (the package name, imports and the CPU structure) have been left out. The point is that the program reads the first byte in memory, gets ready to read the next one, figures out that the first byte is the opcode for LD SP,d16 and sets the SP register accordingly by building a 16-bit value from the next two bytes in memory. When it’s done, the CPU’s program counter is incremented to point to the next instruction right after the two bytes we just read.

Note that the Game Boy is little-endian: when reading a 16-bit value from memory, the least significant byte (the rightmost 8 bits) will come first (its memory address will be lower), and the most significant (the leftmost 8 bits) will come second (its address will be higher). In our case, we’re loading the value 0xfffe in the SP register, but in that memory array, it’s stored as two consecutive bytes: 0xfe 0xff.

This is why we set SP to the first byte, and then OR it with the second one shifted left by 8 bits (i.e. we add the second value that’s been multiplied by 256). In the end, representing each value with 16 bits instead of 8, the operation is SP = 0x00fe + 0xff00 = 0xfffe.Think of it as reading a number backwards and rebuilding it from its individual digits. 421 would be written 124 but you’d decode it by computing 1+(2×10)+(4×100).

This is already quite a bit of work just to set a variable’s value, so let’s wrap it up. The code above will blissfully run through the whole memory array, updating the value in register SP whenever it reads the byte 0x31 — provided there are at least two more bytes left after it in memory. But we won’t see any of it since the program will crash as soon as it gets to the end of the boot ROM and tries reading outside the array.

What I initially did was to exit the loop as soon as it encountered an opcode it didn’t know, print the CPU’s internal state and exit the program. This made for an encouraging metric in the beginning: each opcode I implemented got me further along the boot ROM code, which showed in the way the PC register contained a higher and higher value at each new successful run.

                // ... rest of the switch block above.

                // Quit the program when reading an unimplemented opcode from memory.
                        fmt.Printf("Unknown opcode: %#2x\n", opcode)
                        fmt.Printf("cpu.PC=0x%04x\n", cpu.PC)
                        fmt.Printf("cpu.SP=0x%04x\n", cpu.SP)

Let’s try and execute that tiny, contrived example now:

$ go run the-first-steps.go
Unknown opcode: 0xaf

It works! The emulator stopped at the second opcode, 0xaf, which is the XOR A on the second line in the boot code. It properly decoded and executed LD SP,$fffe and the program counter is already pointing at the next instruction in the fifth memory byte since XOR A is a single-byte opcode with no extra parameters. Now all that’s left is to implement XOR A5Which, considering it actually just means “set A to zero”, is no big feat., then run the program again and then move on to LD HL,$9fff which, as we noted, should look extremely similar to LD SP,$fffe and keep doing that until we hit an instruction that needs to access the RAM.

When we do, it’ll be time for a slightly more complicated memory model that I’ll cover in the next post. In the mean time, I consider the first milestone reached. This isn’t much at all, but the very core of the emulator is already there. We are — slowly, blindly, lovingly and painstakingly — booting up a Game Boy!


You can download the example program above and run it anywhere from the command line:

$ go run the-first-steps.go

It expects a dmg-rom.bin file to be present in the same folder. If you don’t feel like tracking down a boot ROM, you can create a usable stub with:

$ echo -n -e '\x31\xfe\xff\xaf\x21\xff\x9f\x32' > dmg-rom.bin

Thank you for reading!