Introduction

I created a Game Boy emulator in Ruby and released it as a gem called rubyboy! (I’d be happy if you could give it a star!)

rubyboy by sacckey

This Article

While explaining the implementation process of Ruby Boy, I’ll introduce the points where I got stuck and the techniques I devised. I’ll also introduce what I did to optimize Ruby Boy.

Why I Created a Game Boy Emulator

  • I wanted to do some personal development, but since web services incur maintenance costs, I wanted to create something that could be maintained for free
  • As I use Ruby for work, I had been wanting to create a Ruby gem for a while
  • Developing a game emulator has “clear goals & is fun when it works”, so it seemed like it would be easier to maintain motivation
    • In particular, I have a special attachment to the Game Boy

→ Let’s create a Game Boy emulator in Ruby and release it as a gem!

Emulator Overview

The following image is the architecture of the Game Boy:

"Game Boy / Color Architecture - A Practical Analysis" by Rodrigo Copetti, Published: February 21, 2019, Last Modified: January 9, 2024. Available at: https://www.copetti.org/writings/consoles/game-boy/. Licensed under Creative Commons Attribution 4.0 International License.

The goal is to implement a program that emulates this hardware. The class diagram of Ruby Boy and the roles of each class are as follows:

  • Console: Main class
  • Lcd: Handles screen rendering
  • Bus: Controller for implementing memory-mapped I/O. Mediates the reading and writing of configuration values from the CPU to various hardware
  • Cpu: Reads instructions from ROM, interprets and executes them
  • Registers: Performs reading and writing of registers
  • Cartridge: Performs reading and writing of ROM and RAM in the cartridge. Implementation differs for each type of MBC chip (explained later)
  • Apu: Generates audio data
  • Rom: Loads the game program from the cartridge
  • Ram: Performs reading and writing of RAM data in the cartridge and Game Boy
  • Interrupt: Manages interrupts. Interrupts are performed from the following three classes
  • Timer: Counts the number of cycles
  • Ppu: Generates pixel information to be rendered on the display
  • Joypad: Receives button inputs from the Game Boy

In Ruby Boy, synchronization between components is achieved by having the CPU execute instructions and advancing the cycle count of Ppu, Timer, and Apu by the number of cycles taken for execution. Therefore, the contents of the main loop are as follows:

cycles = @cpu.exec
@timer.step(cycles)
@apu.step(cycles)
if @ppu.step(cycles)
  @lcd.draw(@ppu.buffer)
  key_input_check
  throw :exit_loop if @lcd.window_should_close?
end

Implementation Process

I have compiled my implementation notes in a scrap. I will explain by extracting from this.

UI Implementation

The UI part that handles screen rendering, audio playback, and keyboard input was implemented using SDL2 via the Ruby-FFI gem. I created a wrapper class that aggregates the necessary SDL2 methods, and the design is to call SDL2 methods from there.

ROM Loading

First, I made it possible to load and use the game data. For example, the title is stored at 0x0134~0x0143, so it can be retrieved as follows:

data = File.open('tobu.gb', 'r') { _1.read.bytes }
p data[0x134..0x143].pack('C*').strip
=> "TOBU"

Implementation of MBC (Memory Bank Controller)

Many Game Boy games use MBC (Memory Bank Controller), which achieves address space expansion through bank switching. There are different types of MBC chips such as MBC1, MBC3, MBC5, etc., each with different sizes of usable ROM and RAM, so implementation specific to the type of MBC chip is necessary. Ruby Boy supports NoMBC (without MBC) and MBC1 games, and I used the Factory pattern to return the appropriate MBC implementation. By adding chip implementations and case handling, it’s possible to support other types of MBC chips as well.

module Rubyboy
  module Cartridge
    class Factory
      def self.create(rom, ram)
        case rom.cartridge_type
        when 0x00
          Nombc.new(rom)
        when 0x01..0x03
          Mbc1.new(rom, ram)
        when 0x08..0x09
          Nombc.new(rom)
        else
          raise "Unsupported cartridge type: #{rom.cartridge_type}"
        end
      end
    end
  end
end

CPU Implementation

I implemented a program that repeats the following CPU execution cycle:

  • Fetch instruction from ROM
  • Decode the instruction
  • Execute the instruction

To maintain motivation, instead of implementing all CPU and PPU processes at once, I first aimed to run the following minimal test ROM:

gb-hello-world by dusterherz

For debugging, I used a Game Boy emulator called BGB. It’s useful because you can execute step by step while displaying register and memory information, allowing you to compare the behavior with your own CPU.

I checked the required CPU instructions using BGB and implemented them. Then, implementing the PPU’s bg rendering process will make the test pass.

Next, I implemented all instructions and interrupt handling, aiming to pass two CPU test ROMs: cpu_instrs and instr_timing.

Points Where I Got Stuck

  • When setting the value of f in pop af, I hadn’t set the lower 4 bits to 0000
    • The lower 4 bits of the f register are always 0000
  • The c flag calculation was incorrect for two CPU instructions (opcode=0xe8, 0xf8): ADD SP, e8 and LD HL, SP + e8.
    • It passed with cflag = (@sp & 0xff) + (byte & 0xff) > 0xff

By fixing these, it successfully passed.

…The rendering looks a bit off, but the CPU processing is OK.

PPU

I aimed to implement the remaining rendering processes and pass dmg-acid2, which is a test ROM for PPU.

The test will pass by implementing window and sprite rendering, interrupt handling, and DMA transfer. Care must be taken with the priority of sprite display.

Now that rendering is possible, implementing the Joypad will make games playable. The following is a video of running a game called Tobu Tobu Girl:

At this point, the game became operational! However, it’s extremely slow. From here on, I worked on optimization.

Optimization

I’ll introduce what I did to optimize Ruby Boy. These techniques are not limited to emulator implementation and are likely applicable for improving the performance of Ruby programs in general.

Execution environment:

  • PC: MacBook Pro (13-inch, 2018)
  • Processor: 2.3 GHz Quad-Core Intel Core i5
  • Memory: 16 GB 2133 MHz LPDDR3

Benchmarking

I measured the time it took to execute the first 1500 frames of Tobu Tobu Girl without audio and rendering, repeating the measurement three times. Since benchmarking will be done repeatedly, I recommend preparing a dedicated program and setting up a system where you can start benchmarking immediately with a command execution.

Profiler

I used the Stackprof gem.

stackprof by tmm1

It can be used simply by enclosing the area you want to measure in a block, and it’s recommended because it has low overhead.

Optimization Part 1

Enabling YJIT

By enabling YJIT, Ruby’s JIT compiler, the FPS improved. YJIT has become practical from Ruby 3.2, and can be enabled by adding the --yjit option at runtime.

Ruby: 3.2.2
YJIT: false
1: 36.740829 sec
2: 36.468515 sec
3: 36.177083 sec
FPS: 41.1385591742566

Ruby: 3.2.2
YJIT: true
1: 32.305559 sec
2: 32.094778 sec
3: 31.889601 sec
FPS: 46.73385499531633

FPS: 41.1385591742566 → 46.73385499531633

Avoid Creating a Hash for Sprites Every Time

According to Stackprof results, the render_sprites method is becoming a bottleneck.

==================================
  Mode: cpu(1000)
  Samples: 9081 (1.08% miss rate)
  GC: 4 (0.04%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
      3727  (41.0%)        1920  (21.1%)     Rubyboy::Ppu#render_sprites
      1800  (19.8%)        1800  (19.8%)     Rubyboy::Operand#initialize
      1448  (15.9%)        1448  (15.9%)     Integer#zero?
      3346  (36.8%)        1296  (14.3%)     Enumerable#each_slice
       919  (10.1%)         919  (10.1%)     Integer#<<
       424   (4.7%)         424   (4.7%)     Integer#<=>
      3552  (39.1%)         294   (3.2%)     Array#each
...

Let’s investigate further to see which part within render_sprites is the bottleneck.

  code:
                                  |   220  |     def render_sprites
    3    (0.0%)                   |   221  |       return if @lcdc[LCDC[:sprite_enable]].zero?
                                  |   222  |
    2    (0.0%)                   |   223  |       sprite_height = @lcdc[LCDC[:sprite_size]].zero? ? 8 : 16
                                  |   224  |       sprites = []
                                  |   225  |       cnt = 0
 3346   (36.8%)                   |   226  |       @oam.each_slice(4).each do |sprite_attr|
                                  |   227  |         sprite = {
                                  |   228  |           y: (sprite_attr[0] - 16) % 256,
                                  |   229  |           x: (sprite_attr[1] - 8) % 256,
                                  |   230  |           tile_index: sprite_attr[2],
                                  |   231  |           flags: sprite_attr[3]
                                  |   232  |         }
                                  |   233  |         next if sprite[:y] > @ly || sprite[:y] + sprite_height <= @ly
                                  |   234  |
                                  |   235  |         sprites << sprite
                                  |   236  |         cnt += 1
   15    (0.2%) /    15   (0.2%)  |   237  |         break if cnt == 10
 1887   (20.8%) /  1887  (20.8%)  |   238  |       end
  386    (4.3%) /    12   (0.1%)  |   239  |       sprites = sprites.sort_by.with_index { |sprite, i| [-sprite[:x], -i] }
                                  |   240  |
...

Line 226’s block occupies a high percentage of execution time. Upon closer inspection, it creates a Hash called sprite, and then adds sprite to an array called sprites if it meets certain conditions. By modifying this to create sprite only when the conditions are met, the speed improved.

FPS: 46.73385499531633 → 49.2233733053377

In this way, I steadily continued to identify and fix bottlenecks.

Calculate tile_map_addr outside the loop

commit

FPS: 49.2233733053377 → 56.6580741129914

Calculate tile_index outside the loop

commit

FPS: 56.6580741129914 → 60.44140113483162

These are based on the basic principle “do outside the loop what can be done outside the loop”, but it’s critically important for emulators. Resolving these issues dramatically improved performance.

Ruby v3.2 -> v3.3

At this point, Ruby Boy achieved about 60 FPS without rendering, but hit a wall. There’s a trade-off between optimization and code readability. For example, abandoning the use of constants and directly writing mysterious integers would make it faster, but that’s not desirable.

While pondering this, Ruby 3.3.0 was released on 2023/12/25. Ruby 3.3’s YJIT was reported to be even faster, so I tried updating…

!?

It got incredibly fast!!!! Ruby 3.3 was faster than I imagined. Thank you so much 🙏 By the way, this comparison post was even reposted by Matz. I’m thrilled.

Reducing GC

Thanks to Ruby 3.3, performance improved significantly, but in exchange? GC occurrences increased dramatically.

rubyboy % stackprof stackprof-cpu-myapp.dump
==================================
  Mode: cpu(1000)
  Samples: 16405 (4.57% miss rate)
  GC: 5593 (34.09%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
      3688  (22.5%)        3688  (22.5%)     (sweeping)
      2332  (14.2%)        2109  (12.9%)     Enumerable#flat_map
      2050  (12.5%)        2050  (12.5%)     Integer#<=>
      5593  (34.1%)        1679  (10.2%)     (garbage collection)
      1038   (6.3%)        1038   (6.3%)     Rubyboy::Ppu#to_signed_byte
      1004   (6.1%)        1004   (6.1%)     Rubyboy::SDL.RenderClear
       646   (3.9%)         646   (3.9%)     Rubyboy::Ppu#get_pixel
       437   (2.7%)         437   (2.7%)     Integer#>>
       701   (4.3%)         332   (2.0%)     Rubyboy::Ppu#render_sprites
      1354   (8.3%)         278   (1.7%)     Rubyboy::Lcd#draw
      3825  (23.3%)         257   (1.6%)     Rubyboy::Ppu#step
      1627   (9.9%)         255   (1.6%)     Rubyboy::Ppu#render_bg
       633   (3.9%)         247   (1.5%)     Enumerable#each_slice
       230   (1.4%)         230   (1.4%)     Rubyboy::Registers#read8
       226   (1.4%)         226   (1.4%)     (marking)
...

Also, in Pokemon Red, the performance from the title screen to the professor’s dialogue scene was still heavy, so resolving these issues became the next goal.

GC Profiler

I used HeapProfiler to detect GC occurrence locations.

heap-profiler by Shopify

This can also be easily used by enclosing the area you want to detect in a block, but be careful as it may stop returning detection results when running for a long time.

Execution results (partial)

rubyboy % heap-profiler tmp/report
Total allocated: 563.01 MB (4198804 objects)
Total retained: 10.13 kB (252 objects)

allocated memory by file
-----------------------------------
 454.17 MB  rubyboy/lib/rubyboy/cpu.rb
  93.18 MB  rubyboy/lib/rubyboy/ppu.rb
  10.06 MB  rubyboy/lib/rubyboy/apu.rb

allocated memory by class
-----------------------------------
 462.20 MB  Hash
  49.79 MB  Array
  14.61 MB  Enumerator

allocated objects by file
-----------------------------------
   2839605  rubyboy/lib/rubyboy/cpu.rb
   1105342  rubyboy/lib/rubyboy/ppu.rb
    251462  rubyboy/lib/rubyboy/apu.rb

allocated objects by class
-----------------------------------
   2888757  Hash
    416967  Array
    273888  <memo> (IMEMO)
    273888  <ifunc> (IMEMO)
    251442  Float

retained memory by file
-----------------------------------
   3.92 kB  rubyboy/lib/rubyboy/cpu.rb
   2.20 kB  rubyboy/lib/rubyboy/ppu.rb

retained objects by file
-----------------------------------
        98  rubyboy/lib/rubyboy/cpu.rb
        54  rubyboy/lib/rubyboy/ppu.rb
        24  rubyboy/lib/rubyboy.rb
        18  rubyboy/lib/rubyboy/lcd.rb
        18  rubyboy/lib/rubyboy/apu.rb

Looking at this, it is clear that creating a large number of Hashes within the Cpu class is the cause of GC occurrences.

Change instruction arguments from Hash to Symbol

commit

case opcode
-  when 0x01 then ld16({ type: :register16, value: :bc }, { type: :immediate16 }, cycles: 12)
+  when 0x01 then ld16(:bc, :immediate16, cycles: 12)

Avoid creating Hash when referencing flags

commit

- def flags
-   f_value = @registers.f
-   {
-     z: f_value[7] == 1,
-     n: f_value[6] == 1,
-     h: f_value[5] == 1,
-     c: f_value[4] == 1
-   }
- end

+ def flag_z
+   @registers.f[7] == 1
+ end

+ def flag_n
+   @registers.f[6] == 1
+ end

+ def flag_h
+   @registers.f[5] == 1
+ end

+ def flag_c
+   @registers.f[4] == 1
+ end

With these modifications, GC occurrences were reduced to 2.71%.

Optimization Part 2

Reducing Integer#<=>

At this point, the benchmark and Stackprof results are as follows: It’s important to note that this result was measured with rendering enabled and at the heaviest part of Pokemon Red, so comparison with previous results is not meaningful.

Ruby: 3.3.0
YJIT: true
1: 26.798767 sec
FPS: 55.97272441676141
==================================
  Mode: cpu(1000)
  Samples: 10430 (5.57% miss rate)
  GC: 283 (2.71%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
      2275  (21.8%)        2275  (21.8%)     Integer#<=>
      1267  (12.1%)        1267  (12.1%)     Rubyboy::SDL.RenderClear
      1186  (11.4%)        1186  (11.4%)     Rubyboy::Ppu#to_signed_byte
      2366  (22.7%)         864   (8.3%)     Rubyboy::Ppu#render_bg
       784   (7.5%)         784   (7.5%)     Rubyboy::Ppu#get_pixel
      1773  (17.0%)         641   (6.1%)     Rubyboy::Ppu#render_window
       992   (9.5%)         415   (4.0%)     Rubyboy::Ppu#render_sprites
       334   (3.2%)         334   (3.2%)     Integer#>>
       852   (8.2%)         319   (3.1%)     Enumerable#each_slice
      5453  (52.3%)         311   (3.0%)     Rubyboy::Ppu#step
      4199  (40.3%)         213   (2.0%)     Integer#times
       188   (1.8%)         188   (1.8%)     Rubyboy::Timer#step
       187   (1.8%)         187   (1.8%)     (sweeping)
       142   (1.4%)         142   (1.4%)     Rubyboy::SDL.UpdateTexture
       129   (1.2%)         129   (1.2%)     Array#size
       426   (4.1%)         114   (1.1%)     Rubyboy::Ppu#get_color
       851   (8.2%)         109   (1.0%)     Array#each
       981   (9.4%)         105   (1.0%)     Rubyboy::Cpu#get_value
       283   (2.7%)          85   (0.8%)     (garbage collection)
...

While GC has been reduced, performance is still below 60 FPS, and Integer#<=> (number comparison) appears to be the bottleneck. Number comparisons occur frequently in address-based branching like the following:

def read_byte(addr)
  case addr
  when 0x0000..0x7fff
    @mbc.read_byte(addr)
  when 0x8000..0x9fff
    @ppu.read_byte(addr)
...

To eliminate these comparisons, I created an array called @read_methods in preprocessing, which contains the correspondence between addresses and processes. This allows for calling with just array reference during execution.

def set_methods
  0x10000.times do |addr|
    case addr
    when 0x0000..0x7fff
      @read_methods[addr] = -> { @mbc.read_byte(addr) }
    when 0x8000..0x9fff
      @read_methods[addr] = -> { @ppu.read_byte(addr) }
...

This technique was inspired by Optcarrot, a NES emulator written in Ruby.

Let’s run the benchmark and Stackprof again.

rubyboy % RUBYOPT=--yjit bundle exec rubyboy bench
Ruby: 3.3.0
YJIT: true
1: 21.75409 sec
FPS: 68.95255099156066
rubyboy % bundle exec stackprof stackprof-cpu-myapp.dump
==================================
  Mode: cpu(1000)
  Samples: 9505 (6.87% miss rate)
  GC: 325 (3.42%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
      1238  (13.0%)        1238  (13.0%)     Rubyboy::Ppu#to_signed_byte
      1208  (12.7%)        1208  (12.7%)     Rubyboy::SDL.RenderClear
      2558  (26.9%)         907   (9.5%)     Rubyboy::Ppu#render_bg
       865   (9.1%)         865   (9.1%)     Rubyboy::Ppu#get_pixel
       849   (8.9%)         849   (8.9%)     Rubyboy::Cartridge::Mbc1#set_methods
      1803  (19.0%)         663   (7.0%)     Rubyboy::Ppu#render_window
      1053  (11.1%)         460   (4.8%)     Rubyboy::Ppu#render_sprites
      5782  (60.8%)         346   (3.6%)     Rubyboy::Ppu#step
       906   (9.5%)         343   (3.6%)     Enumerable#each_slice
       313   (3.3%)         313   (3.3%)     Integer#>>
      4412  (46.4%)         245   (2.6%)     Integer#times
       237   (2.5%)         237   (2.5%)     (sweeping)
       197   (2.1%)         197   (2.1%)     Rubyboy::Timer#step
       193   (2.0%)         193   (2.0%)     Rubyboy::SDL.UpdateTexture
      1141  (12.0%)         162   (1.7%)     Rubyboy::Bus#set_methods
       433   (4.6%)         134   (1.4%)     Rubyboy::Ppu#get_color
       114   (1.2%)         114   (1.2%)     Array#size
       478   (5.0%)         109   (1.1%)     Rubyboy::Cpu#get_value
       918   (9.7%)          99   (1.0%)     Array#each
        75   (0.8%)          75   (0.8%)     Rubyboy::Cpu#increment_pc_by_byte
      9180  (96.6%)          68   (0.7%)     Rubyboy::Console#bench
       325   (3.4%)          65   (0.7%)     (garbage collection)
        49   (0.5%)          49   (0.5%)     Integer#<=>
...

The FPS improved from 55.97272441676141 to 68.95255099156066, and the proportion of Integer#<=> was reduced from 21.8% to 0.5%. There’s still room for further optimization, but having achieved the goal, I’m considering this complete for now.

Optimization Results

Before(rubyboy v1.0.0, Ruby 3.2.2)After(rubyboy v1.3.1, Ruby 3.3.0 + YJIT)

Conclusion

Positive Aspects

Emulator Development is Fun

As initially planned, I was able to implement while having fun. While it’s enjoyable when an emulator runs, I think a major factor is the abundance of documentation and test ROMs. Especially, since the test ROMs provide feedback on incorrect parts, I was able to progress without getting stuck and could refactor easily. Also, I’m happy that I could run the cartridges that were lying dormant at my parents’ home.

Published a Ruby Gem

Having used Ruby for a while, I’m happy to finally publish a working gem. https://rubygems.org/gems/rubyboy

Install it now with gem install rubyboy!

Learned About Low-Level Technology

Through implementing programs that mimic CPU, memory, registers, RAM, etc., I was able to deepen my knowledge about their roles and operations. It was enjoyable to see things I knew theoretically actually come up, leading to “so that’s what it meant” discoveries. How about using this for experimental subjects in universities or technical colleges?

Gained Experience in Program Optimization

I experienced the steady process of “creating a benchmark program, running a profiler, and fixing suspicious areas” for optimization. I believe my resolution for optimization has improved and my sense for detecting bottlenecks has sharpened.

Experienced Designing and Implementing a Larger Program

I hadn’t written many large programs outside of web programming, so it was good to gain that experience. With emulators, I felt it was important to appropriately distribute responsibilities among classes and write generalized programs to increase reusability (especially for the CPU).

Impressions of Ruby

  • Happy with the simple syntax!
  • Delighted by the many thoughtful methods!
  • Pleased with the abundance of useful gems!
  • Frustrated by the slow processing speed!
    • Glad that it’s significantly faster than before thanks to YJIT’s evolution!

Future Plans

I’m planning to work on the following:

  • Fixing rendering bugs
  • Adding more MBC types
  • Supporting Game Boy Color
  • WebAssembly support
  • Improving the benchmark system
    • Want to make it usable as a benchmark program for Ruby

References

Self-Made Blog Posts

These are articles about creating Game Boy emulators. I referred to them for implementation approaches, techniques, and potential pitfalls 🙏

Presentation Slides

Documentation

  • Pan Docs
    • A page that covers Game Boy specifications comprehensively, used like a dictionary.
  • Game Boy CPU (SM83) instruction set
    • CPU specification table. It displays a list of each instruction’s content, opcode, cycle count, and updated flags, and also provides JSON. I implemented CPU instructions while referring to this.
  • Rustで作るGAME BOYエミュレータ:低レイヤ技術部
    • A book about implementing a Game Boy emulator in Rust. It sets goals for each chapter, allowing for step-by-step progress. The explanations for each chapter are quite detailed, and it’s recommended even if you’re implementing in a language other than Rust. It was especially helpful for implementing PPU and APU.