Elijah Miller’s NEC v30 on a Pi hat

v30 on a board

While talking about home brew 8080 and 8086 systems on Discord an ebay search brought me to Elijah’s store page where this small little curiosity was up for sale. It’s literally just a NEC v30 on a Raspberry Pi hat, for a mere $15 USD! Interestingly enough the v30 can operate at 3.3v meaning no special hardware is required to interface to the GPIO bus on a Pi. This reminds me so much of the CP/M cartridge for the Commodore 64, and the price being so right I quickly ordered one and eagerly awaited to 2 weeks shipping to Asia.

While I have Pi 4’s that I run Windows 10 on to drive some displays & power point, I wanted to use the slightly faster Pi400 for this. The Pi400 has a compatible GPIO expansion port so just like a cartridge it’s a simple matter of slotting the card, powering up and building the software. While there is an included binary, it’s a 32bit one, and I’m running Manjaro on the Pi400 for a similar look/feel as the PineBook Pro. Anyways the dependences are SDL2, and an odly named ‘wiringPi’ library that allows C programs to interface to the GPIO.

You can download the emulator over on homebrew8088, specifically the Raspberry Pi Second Project. The last ‘ver 2’ download has the project configured for a v30 which is an 8086 analogue, unlike the v20 which is an 8088. When physically interfacing to the processor things like this really matter!

With the emulator built it was pretty simple to fire it up, and boot into MS-DOS:

first boot!

I have to admit I was a little startled at first as I really had no idea if this was going to work at all. I’d spoken to an engineer friend and he was saying plugging a CPU directly into the GPIO bus, and toggling connections to actually emulate the board was both crazy and that without any electrical buffers it’d most likely either fry the processor and maybe the Pi as well. I suspect this being low voltage may be sparing both, although I have no EE so I’m not going to pretend to know.

Loading up Norton SI confirms what Elijah had posted on Ebay is that it runs very slowly about 1/3rd the speed of an XT. Now I may not know anything about hardware but this seemed at least something a profiler could at least tell me what is going on, and if someone like me helicoptering in on the shoulder of giants could see something.

gcc -I/usr/include/SDL2 -pg -O2 *.cpp -o pi -lSDL2 -lwiringPi -lpthread -lstdc++

This will build a profiled version of the emulator that’ll let us know which functions are being called both the number of times, and how much time to do so. Not knowing anything but having profiled other emulators, the usual pattern is that you spend most time fetching and possibly translating memory; Both in feeding instructions and pushing/popping data from stack and pointers. Waiting is usually for initialisation and for IO.

Once you’ve run your profiled executable, it’ll dump a binary file gmon.out which you can then use gprof to format to a text file like this:

gprof pi gmon.out > report.txt

And then looking at the report you can see where the top time, along with top calls are. Some things just take a while to complete and other well they get called far too often.

Each sample counts as 0.01 seconds. % cumulative self self total
time seconds seconds calls s/call s/call name
39.91 0.71 0.71 286883 0.00 0.00 Print_Char_9x16(SDL_Render er*, int, int, unsigned char)
16.30 1.00 0.29 1 0.29 1.02 Start_System_Bus(int)
12.37 1.22 0.22 1100374 0.00 0.00 Data_Bus_Direction_8086_OUT()
7.87 1.36 0.14 5954106 0.00 0.00 CLK()

As expected Start_System_Bus takes 1 second, followed by 1,100,374 calls to set the Data_Bus_Direction_8086_OUT (no doubt the Pi needs to alternate between reading and writing to the CPU), followed by 5,954,106 ticks of the CLK function. Of course the real culprit is Print_Char_9x16 which was called 286,883 times, and is responsible for nearly 40% of the tuntime!

Obviously for a simple MS-DOS boot the screen should not be calling any print char anywhere near this many times. Clearly something is amiss. Not knowing anything I added a simple counter to block at the top of the Print_Char_9x16 function to let it only execute 1:1000 times, and I got this:

Obviously it’s not right, which means that the culprit really isn’t Print_Char_9x16 but rather what is calling it. It was a simple change to each of the Mode functions to only render a fraction of the time, and I changed it to a define to let me fire it more often. This is a simple diff, assuming WordPress doesn’t screw it up. It’s not pretty but it gets the job done.

$ diff -ruN ver2/vga.cpp ver2-j/vga.cpp 
--- ver2/vga.cpp	2020-07-29 10:36:51.000000000 +0800
+++ ver2-j/vga.cpp	2021-06-04 01:51:33.546124473 +0800
@@ -1,5 +1,9 @@
 #include "vga.h"
 
+static int do9x16 = 0;
+#define VIDU 5000
+
+
 void Print_Char_18x16(SDL_Renderer *Renderer, int x, int y, unsigned char Ascii_value)
 {
 	for (int i = 0; i < 9; i++)
@@ -23,6 +27,12 @@
 
 void Mode_0_40x25(SDL_Renderer *Renderer, char* Video_Memory, char* Cursor_Position)
 {
+do9x16++;
+if(do9x16>VIDU)
+        {do9x16=0;}
+else
+        {return;}
+
 	int index = 0; 
 	for (int j = 0; j < 25; j++)
 	{
@@ -36,6 +46,7 @@
 	Print_Char_18x16(Renderer, (Cursor_Position[0] * 18), (Cursor_Position[1] * 16), 0xDB);
 	SDL_RenderPresent(Renderer);	
 }
+
 void Print_Char_9x16(SDL_Renderer *Renderer, int x, int y, unsigned char Ascii_value)
 {
 	for (int i = 0; i < 9; i++)
@@ -57,6 +68,12 @@
 }
 void Mode_2_80x25(SDL_Renderer *Renderer, char* Video_Memory, char* Cursor_Position)
 {
+do9x16++;
+if(do9x16>VIDU)
+        {do9x16=0;}
+else
+        {return;}
+
 	int index = 0; 
 	for (int j = 0; j < 25; j++)
 	{
@@ -102,6 +119,12 @@
 
 void Graphics_Mode_320_200_Palette_0(SDL_Renderer *Renderer, char* Video_Memory)
 {
+do9x16++;
+if(do9x16>VIDU)
+        {do9x16=0;}
+else
+        {return;}
+
 	SDL_RenderClear(Renderer);
 			int index = 0; 				
 			for (int j = 0; j < 100; j++)
@@ -156,6 +179,12 @@
 }
 void Graphics_Mode_320_200_Palette_1(SDL_Renderer *Renderer, char* Video_Memory)
 {
+do9x16++;
+if(do9x16>VIDU)
+        {do9x16=0;}
+else
+        {return;}
+
 	SDL_RenderClear(Renderer);
 			int index = 0; 
 			for (int j = 0; j < 100; j++)

While it feels more responsive on the console, it’s still incredibly slow. SI was returning the same speed which means that although we aren’t hitting the screen anywhere near as often it’s still doing far too much. Is it really a GPIO bus limitation? Again I have no idea. But the next function of course is the clock.

First I tried dividing the usleep in half thinking that maybe it’s not getting called enough. And running SI revealed that I’d gone from a 0.3 to a 0.1! Obviously this is not the desired effect! So instead of a divide I multiplied it by four:

diff -ruN ver2/timer.cpp ver2-j/timer.cpp 
--- ver2/timer.cpp	2020-08-12 00:32:13.000000000 +0800
+++ ver2-j/timer.cpp	2021-06-04 02:06:25.505904407 +0800
@@ -7,7 +7,7 @@
 {
    while(Stop_Flag != true)
    {
-      usleep(54926); 
+      usleep(54926*4); 
       IRQ0();
    }
 }

Now re-running SI I get this:

Norton SI with clock multiplied by four

Now it’s scoring a 1.5! Obviously these are all ‘magic numbers’ and tied to the Pi400 and more importantly I haven’t studied the code at all, I’m not trying to disparage or anything, if anything it’s just a quick example why profiling your code can be so important! At the same time trying to run games is so incredibly slow I don’t even know if my changes had any actual impact to speed as emulation of benchmarks can be such a finickie thing.

My goto game, Battletech 3025 Crescent Hawks Inception loads to the first splash but then seems to hang. I could be impatient or there could be further issues but I’m just some impatient tourist with a C compiler…

With my changes and re-running the profiler I now see this:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 95.41    129.23   129.23 22696621     5.69     5.69  Read_Memory_Array(unsigned long long, char*, int)
  2.90    133.15     3.92                             Start_System_Bus(int)
  0.88    134.34     1.19 64369074     0.02     0.02  CLK()
  0.30    134.74     0.40                             keyboard()
  0.16    134.96     0.22   412873     0.53     0.53  Print_Char_9x16(SDL_Render
er*, int, int, unsigned char)
  0.08    135.07     0.11 11273939     0.01     0.01  Data_Bus_Direction_8086_OUT()

Which is now what I expect with the bulk of the emulation now calling Read_Memory, with the Clock following that and of course our tamed screen renderer (although its still called far too much!) with the Data_Bus_Direction being further down the list. No doubt some double buffering and checking what changed in between calls would go a LONG way to optimise it, just as would actually studying the source code.

The one cool thing about this is that if I wanted to write a PC emulator this way gives me the confidence that the CPU is not only 100% cycle accurate, but it’s 100% bug for bug accurate since we are using a physical processor.

And again for $15 USD + Shipping I cannot recommend this enough!

3 thoughts on “Elijah Miller’s NEC v30 on a Pi hat

  1. AMD Hypertransport bus had provisions to connect different architecture CPUs to the same bus, with full interaction between them in master/slave configuration (Boot CPU and its “coprocessor”). It was said to give PCs the same powers than Amiga had years back with its Zorro bus compatible with “personality” CPU cards. Unfortunately this never got any mainstream board implementation, besides some rare Xilinx networking accelerators sold in 200x. Would have been nice to team it with virtualization extensions to have “Accelerated” VM sessions of any architecture, as long as there would be a “personality card” available for it. Imagine having in the same x86_64-PC “accelerated” VM sessions booting ARM, RiscV, SPARC, PPC (was the main objective back then, in order to emulate macOS PPC)…

    • I remember something about some universal hotplug system, in some way this is what PReP/CHrP was a generic PC with a different processor but where it fell flat was it being tied to POWER instead of allowing different processors, let alone something like the PowerPC 615.

  2. Obviously reducing the timer tick frequency will make the computer look faster for SI because SI uses this timer as its clock. Given that the timer period on an XT was 1.19318MHz / 65536, I think usleep(33927) should be a more accurate emulation. However this doesn’t account for extra delays introduced by thread scheduling etc.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.