While talking about home brew 8080 and 8086 systems on Discord an ebay search brought me to Elijah’s store page where this small little curiosity was up for sale. It’s literally just a NEC v30 on a Raspberry Pi hat, for a mere $15 USD! Interestingly enough the v30 can operate at 3.3v meaning no special hardware is required to interface to the GPIO bus on a Pi. This reminds me so much of the CP/M cartridge for the Commodore 64, and the price being so right I quickly ordered one and eagerly awaited to 2 weeks shipping to Asia.
While I have Pi 4’s that I run Windows 10 on to drive some displays & power point, I wanted to use the slightly faster Pi400 for this. The Pi400 has a compatible GPIO expansion port so just like a cartridge it’s a simple matter of slotting the card, powering up and building the software. While there is an included binary, it’s a 32bit one, and I’m running Manjaro on the Pi400 for a similar look/feel as the PineBook Pro. Anyways the dependences are SDL2, and an odly named ‘wiringPi’ library that allows C programs to interface to the GPIO.
You can download the emulator over on homebrew8088, specifically the Raspberry Pi Second Project. The last ‘ver 2’ download has the project configured for a v30 which is an 8086 analogue, unlike the v20 which is an 8088. When physically interfacing to the processor things like this really matter!
With the emulator built it was pretty simple to fire it up, and boot into MS-DOS:
I have to admit I was a little startled at first as I really had no idea if this was going to work at all. I’d spoken to an engineer friend and he was saying plugging a CPU directly into the GPIO bus, and toggling connections to actually emulate the board was both crazy and that without any electrical buffers it’d most likely either fry the processor and maybe the Pi as well. I suspect this being low voltage may be sparing both, although I have no EE so I’m not going to pretend to know.
Loading up Norton SI confirms what Elijah had posted on Ebay is that it runs very slowly about 1/3rd the speed of an XT. Now I may not know anything about hardware but this seemed at least something a profiler could at least tell me what is going on, and if someone like me helicoptering in on the shoulder of giants could see something.
gcc -I/usr/include/SDL2 -pg -O2 *.cpp -o pi -lSDL2 -lwiringPi -lpthread -lstdc++
This will build a profiled version of the emulator that’ll let us know which functions are being called both the number of times, and how much time to do so. Not knowing anything but having profiled other emulators, the usual pattern is that you spend most time fetching and possibly translating memory; Both in feeding instructions and pushing/popping data from stack and pointers. Waiting is usually for initialisation and for IO.
Once you’ve run your profiled executable, it’ll dump a binary file gmon.out which you can then use gprof to format to a text file like this:
gprof pi gmon.out > report.txt
And then looking at the report you can see where the top time, along with top calls are. Some things just take a while to complete and other well they get called far too often.
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 39.91 0.71 0.71 286883 0.00 0.00 Print_Char_9x16(SDL_Render er*, int, int, unsigned char) 16.30 1.00 0.29 1 0.29 1.02 Start_System_Bus(int) 12.37 1.22 0.22 1100374 0.00 0.00 Data_Bus_Direction_8086_OUT() 7.87 1.36 0.14 5954106 0.00 0.00 CLK()
As expected Start_System_Bus takes 1 second, followed by 1,100,374 calls to set the Data_Bus_Direction_8086_OUT (no doubt the Pi needs to alternate between reading and writing to the CPU), followed by 5,954,106 ticks of the CLK function. Of course the real culprit is Print_Char_9x16 which was called 286,883 times, and is responsible for nearly 40% of the tuntime!
Obviously for a simple MS-DOS boot the screen should not be calling any print char anywhere near this many times. Clearly something is amiss. Not knowing anything I added a simple counter to block at the top of the Print_Char_9x16 function to let it only execute 1:1000 times, and I got this:
Obviously it’s not right, which means that the culprit really isn’t Print_Char_9x16 but rather what is calling it. It was a simple change to each of the Mode functions to only render a fraction of the time, and I changed it to a define to let me fire it more often. This is a simple diff, assuming WordPress doesn’t screw it up. It’s not pretty but it gets the job done.
$ diff -ruN ver2/vga.cpp ver2-j/vga.cpp
--- ver2/vga.cpp 2020-07-29 10:36:51.000000000 +0800
+++ ver2-j/vga.cpp 2021-06-04 01:51:33.546124473 +0800
@@ -1,5 +1,9 @@
#include "vga.h"
+static int do9x16 = 0;
+#define VIDU 5000
+
+
void Print_Char_18x16(SDL_Renderer *Renderer, int x, int y, unsigned char Ascii_value)
{
for (int i = 0; i < 9; i++)
@@ -23,6 +27,12 @@
void Mode_0_40x25(SDL_Renderer *Renderer, char* Video_Memory, char* Cursor_Position)
{
+do9x16++;
+if(do9x16>VIDU)
+ {do9x16=0;}
+else
+ {return;}
+
int index = 0;
for (int j = 0; j < 25; j++)
{
@@ -36,6 +46,7 @@
Print_Char_18x16(Renderer, (Cursor_Position[0] * 18), (Cursor_Position[1] * 16), 0xDB);
SDL_RenderPresent(Renderer);
}
+
void Print_Char_9x16(SDL_Renderer *Renderer, int x, int y, unsigned char Ascii_value)
{
for (int i = 0; i < 9; i++)
@@ -57,6 +68,12 @@
}
void Mode_2_80x25(SDL_Renderer *Renderer, char* Video_Memory, char* Cursor_Position)
{
+do9x16++;
+if(do9x16>VIDU)
+ {do9x16=0;}
+else
+ {return;}
+
int index = 0;
for (int j = 0; j < 25; j++)
{
@@ -102,6 +119,12 @@
void Graphics_Mode_320_200_Palette_0(SDL_Renderer *Renderer, char* Video_Memory)
{
+do9x16++;
+if(do9x16>VIDU)
+ {do9x16=0;}
+else
+ {return;}
+
SDL_RenderClear(Renderer);
int index = 0;
for (int j = 0; j < 100; j++)
@@ -156,6 +179,12 @@
}
void Graphics_Mode_320_200_Palette_1(SDL_Renderer *Renderer, char* Video_Memory)
{
+do9x16++;
+if(do9x16>VIDU)
+ {do9x16=0;}
+else
+ {return;}
+
SDL_RenderClear(Renderer);
int index = 0;
for (int j = 0; j < 100; j++)
While it feels more responsive on the console, it’s still incredibly slow. SI was returning the same speed which means that although we aren’t hitting the screen anywhere near as often it’s still doing far too much. Is it really a GPIO bus limitation? Again I have no idea. But the next function of course is the clock.
First I tried dividing the usleep in half thinking that maybe it’s not getting called enough. And running SI revealed that I’d gone from a 0.3 to a 0.1! Obviously this is not the desired effect! So instead of a divide I multiplied it by four:
diff -ruN ver2/timer.cpp ver2-j/timer.cpp
--- ver2/timer.cpp 2020-08-12 00:32:13.000000000 +0800
+++ ver2-j/timer.cpp 2021-06-04 02:06:25.505904407 +0800
@@ -7,7 +7,7 @@
{
while(Stop_Flag != true)
{
- usleep(54926);
+ usleep(54926*4);
IRQ0();
}
}
Now re-running SI I get this:
Now it’s scoring a 1.5! Obviously these are all ‘magic numbers’ and tied to the Pi400 and more importantly I haven’t studied the code at all, I’m not trying to disparage or anything, if anything it’s just a quick example why profiling your code can be so important! At the same time trying to run games is so incredibly slow I don’t even know if my changes had any actual impact to speed as emulation of benchmarks can be such a finickie thing.
My goto game, Battletech 3025 Crescent Hawks Inception loads to the first splash but then seems to hang. I could be impatient or there could be further issues but I’m just some impatient tourist with a C compiler…
With my changes and re-running the profiler I now see this:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
95.41 129.23 129.23 22696621 5.69 5.69 Read_Memory_Array(unsigned long long, char*, int)
2.90 133.15 3.92 Start_System_Bus(int)
0.88 134.34 1.19 64369074 0.02 0.02 CLK()
0.30 134.74 0.40 keyboard()
0.16 134.96 0.22 412873 0.53 0.53 Print_Char_9x16(SDL_Render
er*, int, int, unsigned char)
0.08 135.07 0.11 11273939 0.01 0.01 Data_Bus_Direction_8086_OUT()
Which is now what I expect with the bulk of the emulation now calling Read_Memory, with the Clock following that and of course our tamed screen renderer (although its still called far too much!) with the Data_Bus_Direction being further down the list. No doubt some double buffering and checking what changed in between calls would go a LONG way to optimise it, just as would actually studying the source code.
The one cool thing about this is that if I wanted to write a PC emulator this way gives me the confidence that the CPU is not only 100% cycle accurate, but it’s 100% bug for bug accurate since we are using a physical processor.
And again for $15 USD + Shipping I cannot recommend this enough!