content/blog/2017/01/chip8-disassembly.markdown @ 40accabe3a95

Update
author Steve Losh <steve@stevelosh.com>
date Mon, 18 Mar 2024 15:47:07 -0400
parents f5556130bda1
children (none)
(
:title "CHIP-8 in Common Lisp: Disassembly"
:snip "What's in a ROM?"
:date "2017-01-02T17:15:00Z"
:draft nil

)

In the previous posts we looked at how to emulate a [CHIP-8][] CPU with Common
Lisp.  After adding a screen, input, and sound the core of the emulator is
essentially complete.

I've been guiding you through the code step by step and it might look simple,
but that's only because I went down all the dead ends myself first.  In
practice, when you're writing an emulator for a system you'll need a way to
debug the execution of code.  The first step is to be able to *read* the code,
so let's look at how to add a disassembler to our simple CHIP-8 emulator.

The full series of posts so far:

1. [CHIP-8 in Common Lisp: The CPU](http://stevelosh.com/blog/2016/12/chip8-cpu/)
2. [CHIP-8 in Common Lisp: Graphics](http://stevelosh.com/blog/2016/12/chip8-graphics/)
3. [CHIP-8 in Common Lisp: Input](http://stevelosh.com/blog/2016/12/chip8-input/)
4. [CHIP-8 in Common Lisp: Sound](http://stevelosh.com/blog/2016/12/chip8-sound/)
5. [CHIP-8 in Common Lisp: Disassembly](http://stevelosh.com/blog/2017/01/chip8-disassembly/)
6. [CHIP-8 in Common Lisp: Debugging Infrastructure](http://stevelosh.com/blog/2017/01/chip8-debugging-infrastructure/)
7. [CHIP-8 in Common Lisp: Menus](http://stevelosh.com/blog/2017/01/chip8-menus/)

The full emulator source is on [BitBucket][] and [GitHub][].

[CHIP-8]: https://en.wikipedia.org/wiki/CHIP-8
[BitBucket]: https://bitbucket.org/sjl/cl-chip8
[GitHub]: https://github.com/sjl/cl-chip8

<div id="toc"></div>

## Disassembling Single Instructions

The first thing we'll need is a way to take a single instruction like `#x8055`
and turn it into something we can read.  The easiest way to do this seemed to be
to copy the dispatch loop from the CPU emulator and turn it into a disassembly
function:

```lisp
(defun disassemble-instruction (instruction)
  (flet ((v (n) (symb 'v (format nil "~X" n))))
    (let ((_x__ (ldb (byte 4 8) instruction))
          (__x_ (ldb (byte 4 4) instruction))
          (___x (ldb (byte 4 0) instruction))
          (__xx (ldb (byte 8 0) instruction))
          (_xxx (ldb (byte 12 0) instruction)))
      (case (logand #xF000 instruction)
        (#x0000 (case instruction
                  (#x00E0 '(cls))
                  (#x00EE '(ret))))
        (#x1000 `(jp ,_xxx))
        (#x2000 `(call ,_xxx))
        (#x3000 `(se ,(v _x__) ,__xx))
        (#x4000 `(sne ,(v _x__) ,__xx))
        (#x5000 (case (logand #x000F instruction)
                  (#x0 `(se ,(v _x__) ,(v __x_)))))
        (#x6000 `(ld ,(v _x__) ,__xx))
        (#x7000 `(add ,(v _x__) ,__xx))
        (#x8000 (case (logand #x000F instruction)
                  (#x0 `(ld ,(v _x__) ,(v __x_)))
                  (#x1 `(or ,(v _x__) ,(v __x_)))
                  (#x2 `(and ,(v _x__) ,(v __x_)))
                  (#x3 `(xor ,(v _x__) ,(v __x_)))
                  (#x4 `(add ,(v _x__) ,(v __x_)))
                  (#x5 `(sub ,(v _x__) ,(v __x_)))
                  (#x6 `(shr ,(v _x__) ,(v __x_)))
                  (#x7 `(subn ,(v _x__) ,(v __x_)))
                  (#xE `(shl ,(v _x__) ,(v __x_)))))
        (#x9000 (case (logand #x000F instruction)
                  (#x0 `(sne ,(v _x__) ,(v __x_)))))
        (#xA000 `(ld i ,_xxx))
        (#xB000 `(jp ,(v 0) ,_xxx))
        (#xC000 `(rnd ,(v _x__) ,__xx))
        (#xD000 `(drw ,(v _x__) ,(v __x_) ,___x))
        (#xE000 (case (logand #x00FF instruction)
                  (#x9E `(skp ,(v _x__)))
                  (#xA1 `(sknp ,(v _x__)))))
        (#xF000 (case (logand #x00FF instruction)
                  (#x07 `(ld ,(v _x__) dt))
                  (#x0A `(ld ,(v _x__) k))
                  (#x15 `(ld dt ,(v _x__)))
                  (#x18 `(ld st ,(v _x__)))
                  (#x1E `(add i ,(v _x__)))
                  (#x29 `(ld f ,(v _x__)))
                  (#x33 `(ld b ,(v _x__)))
                  (#x55 `(ld (mem i) ,_x__))
                  (#x65 `(ld ,_x__ (mem i)))))))))
```

There are a lot of other ways we could have done this, like making a proper
parser or adding functionality to `define-opcode`, but since there's not that
many instructions I think this is reasonable.  Now we can pass in a raw,
two-byte instruction and get out something readable:

```
[SBCL] CHIP8> (disassemble-instruction #x8055)
(SUB V0 V5)

[SBCL] CHIP8> (disassemble-instruction #x4077)
(SNE V0 119)
```

## Disassembling Entire ROMs

Disassembling a single instruction will be useful, but it would also be nice to
disassemble an entire ROM at once to see what its code looks like.  Let's make
a little helper function to handle that:

```lisp
(defun dump-disassembly (array &optional (start 0) (end (length array)))
  (iterate
    (for i :from start :below end :by 2)
    (print-disassembled-instruction array i)
    (sleep 0.001)))
```

The `sleep` is there because Neovim's terminal seems to shit the bed if you dump
too much text at it at once.  Computers are garbage.

Other that than, `dump-disassembly` is pretty straightforward: just iterate
through the array of instructions two bytes at a time and print the information.
Let's look at the printing function now:

```lisp
(defun print-disassembled-instruction (array index)
  (destructuring-bind (address instruction disassembly)
      (instruction-information array index)
    (let ((*print-base* 16))
      (format t "~3,'0X: ~4,'0X ~24A~%"
              address
              instruction
              (or disassembly "")))))
```

Once again we'll delegate to a helper function.
`print-disassembled-instruction` just handles the string formatting to dump an
instruction to the screen.  Running it for a single instruction would print
something like:

<pre class="lineart">
    200: 8055 (SUB V0 V5)
     ^    ^   ^
     |    |   |
     |    |  Disassembly
     |    |
     |   Raw instruction
     |
    Address
</pre>

The helper function `instruction-information` is simple, but we'll be using it
in the future for something else, so it's nice to have:

```lisp
(defun instruction-information (array index)
  (let ((instruction (retrieve-instruction array index)))
    (list index
          instruction
          (disassemble-instruction instruction))))
```

It just takes an address and memory array and returns a list of:

* The address
* The raw instruction at the address
* The disassembly for that instruction

`retrieve-instruction` is simple (for now):

```lisp
(defun retrieve-instruction (array index)
  (cat-bytes (aref array index)
             (aref array (1+ index))))
```

These functions *could* be combined into a single bigger function, but I'm
a strong believer in having each function do exactly one thing.  And as we'll
see, each of these "simple" tasks is going to get more complicated later.

Now we can dump the disassembly for a ROM to see how it works:

```
(run "roms/ufo.rom") ; stores the current chip struct in *c*

(dump-disassembly (chip-memory *c*) #x200 #x220)
200: A2CD (LD I 2CD)
202: 6938 (LD V9 38)
204: 6A08 (LD VA 8)
206: D9A3 (DRW V9 VA 3)
208: A2D0 (LD I 2D0)
20A: 6B00 (LD VB 0)
20C: 6C03 (LD VC 3)
20E: DBC3 (DRW VB VC 3)
210: A2D6 (LD I 2D6)
212: 641D (LD V4 1D)
214: 651F (LD V5 1F)
216: D451 (DRW V4 V5 1)
218: 6700 (LD V7 0)
21A: 680F (LD V8 F)
21C: 22A2 (CALL 2A2)
21E: 22AC (CALL 2AC)
```

## Sprites

Take a look at `print-disassembled-instruction` again:

```lisp
(defun print-disassembled-instruction (array index)
  (destructuring-bind (address instruction disassembly)
      (instruction-information array index)
    (let ((*print-base* 16))
      (format t "~3,'0X: ~4,'0X ~24A~%"
              address
              instruction
              (or disassembly "")))))
```

Notice how we used `(or disassembly "")` instead of just passing in the
disassembly.  If you look back at `disassemble-instruction` you'll see it uses
normal `case` statements, not `ecase`, so if the instruction doesn't match any
valid opcodes it will return `nil`.

The CHIP-8 (like [most computers][neumann]) uses the same memory to hold both
program code (instructions) and data.  Data includes things like player
health, score, location, and most importantly: the sprites that will be drawn on
the screen.

[neumann]: https://en.wikipedia.org/wiki/Von_Neumann_architecture

Unfortunately there's no way to know for sure whether a given memory location
contains an instruction (and thus needs to be disassembled) or is intended to be
a piece of data.  Indeed, someone like [Mel][] could conceivably figure out
a way to make a particular sequence of instructions perform double duty as
a sprite!  So our disassembler will just always show the disassembly for
anything that *might* be an instruction.

[Mel]: http://www.catb.org/jargon/html/story-of-mel.html

But with that said, we can probably make some educated guesses.  If we had some
way to visualize what a hunk of memory would look like *if* it were rendered as
a sprite, we could probably figure out where most of the program's sprites are
kept.  It's unlikely that any given sequence of instructions would just *happen*
to look like a ghost from Pac Man or something.

We could add a separate function to draw out the sprite data, but the CHIP-8's
sprites are so simple that we can just tack it on to the disassembly output.
[Remember][cowgod-disp] that each byte of memory defines one eight-pixel-wide
row of a sprite, and that `DRW X, Y, Size` will draw `Size` rows of a sprite
using contiguous bytes in memory.  So if memory contains something like this (at
the location specified by the index register):

[cowgod-disp]: http://devernay.free.fr/hacks/chip8/C8TECH10.HTM#2.4

```
Address Data
#x300   #b11110000
#x301   #b00010000
#x302   #b11110000
#x303   #b00010000
#x304   #b11110000
```

A `DRW X, Y, 5` instruction would draw a `3` sprite to the screen:

<pre class="lineart">
    ████

    ████

    ████
</pre>

It would be trivial to simply render the bits of any given instruction as spaces
and some other ASCII character and tack it onto the end of the disassembly, but
there's a snag: instructions are *two* bytes each, but each row in a sprite is
*one* byte long.  Our sprites would get pretty mangled if we printed two of
their rows per line of disassembly — for example, `4` would look like this:

<pre class="lineart">                                       byte 1  byte 2
                                       1111111122222222
    064: 9090 (SNE V0 V9)              █  █    █  █
    066: F010                          ████    █
    068: 10F0 (JP F0)                     █    ████
</pre>

Not ideal.  One option would be to make every instruction of disassembly two
lines long, but that's painful to read when trying to look at the code portions
of the ROM.  We can get around this with a delightful little hack: using
characters from [Unicode Block Elements][block] to cram two rows of sprite data
into a single line of output.  Let's start by defining a `bit-diagram` function
that will take a two-byte-wide integer and return an ASCII diagram of its bits:

[block]: https://en.wikipedia.org/wiki/Block_Elements

```lisp
(defun bit-diagram (integer)
  (iterate (for high-bit :from 15 :downto 8)
           (for low-bit :from 7 :downto 0)
           (for hi = (logbitp high-bit integer))
           (for lo = (logbitp low-bit integer))
           (collect (cond ((and hi lo) #\full_block)
                          (hi          #\upper_half_block)
                          (lo          #\lower_half_block)
                          (t           #\space))
                    :result-type 'string)))
```

```lisp
; Example rows of sprite data:
; 11110000
; 11001100

(bit-diagram #b1111000011001100)
"██▀▀▄▄  "
```

Now that we've got this we can easily add it into our disassembly functions:

```lisp
(defun instruction-information (array index)
  (let ((instruction (retrieve-instruction array index)))
    (list index
          instruction
          (disassemble-instruction instruction)
          (bit-diagram instruction))))           ; NEW

(defun print-disassembled-instruction (array index)
  (destructuring-bind (address instruction disassembly bits)
      (instruction-information array index)
    (let ((*print-base* 16))
                                     ; NEW
      (format t "~3,'0X: ~4,'0X ~24A ~8A~%"
              address
              instruction
              (or disassembly "")
              bits))))              ; NEW
```

Now when we dump the disassembly of a ROM we'll also see what each instruction
would look like if drawn as a sprite.  For program code this will tend to look
like garbage (unless some crazy person has managed to write code that also works
as sprites):

<pre class="lineart" style="line-height: 1.0 !important;">
    200: A2CD (LD I 2CD)               █▄▀ ▄▄▀▄
    202: 6938 (LD V9 38)                ▀█▄█  ▀
    204: 6A08 (LD VA 8)                 ▀▀ █ ▀
    206: D9A3 (DRW V9 VA 3)            █▀▄▀▀ ▄█
    208: A2D0 (LD I 2D0)               █▄▀▄  ▀
    20A: 6B00 (LD VB 0)                 ▀▀ ▀ ▀▀
    20C: 6C03 (LD VC 3)                 ▀▀ ▀▀▄▄
    20E: DBC3 (DRW VB VC 3)            ██ ▀▀ ██
</pre>

But when we look at areas of the ROM that *do* contain sprites, they look pretty
recognizable:

<pre class="lineart" style="line-height: 1.0 !important;">
    050: F090                          █▀▀█
    052: 9090 (SNE V0 V9)              █  █
    054: F020                          ▀▀█▀
    056: 6020 (LD V0 20)                ▀█
    058: 2070 (CALL 70)                 ▄█▄
    05A: F010                          ▀▀▀█
    05C: F080                          █▀▀▀
    05E: F0F0                          ████
    060: 10F0 (JP F0)                  ▄▄▄█
    062: 10F0 (JP F0)                  ▄▄▄█
    064: 9090 (SNE V0 V9)              █  █
    066: F010                          ▀▀▀█
    068: 10F0 (JP F0)                  ▄▄▄█
    06A: 80F0 (LD V0 VF)               █▄▄▄
    06C: 10F0 (JP F0)                  ▄▄▄█
    06E: F080                          █▀▀▀
</pre>

Human eyes are pretty good at picking out patterns, so when you're scrolling
through a disassembled ROM it's pretty easy to tell which sections are sprites
and which are data, even if it's not perfectly rendered.

## Result

We've now got a way to dump the disassembly of a ROM to see what its code and
data look like.

We can also inspect the rest of our emulator's state at runtime with NREPL or
SLIME by running things like `(chip-program-counter *c*)`.

## Future

Manually querying for information and dumping the disassembly isn't very
ergonomic, so in the future we'll look at adding:

* Debugging infrastructure like pausing and breakpoints
* A graphical debugger/disassembly viewer

As well as a few other niceties like menus for loading ROMs, etc.