Degrading qemu performance in DooM

Quick table… its late.

I’m using MS-DOS 5 & this benchmark suite loaded into a VMDK, and ran a few tests to check performance numbers.

version2 3d bench3 chris bencha doom ticksc quake demo

I snagged 3.1.50 from

better performance than v2, sure, but for interactive stuff.. not so much.

So what is really going on here? Why is 0.90 so much faster when it comes to doom, and how is it possible that it’s the slowest in raw CPU performance. And fastest at IO? It appears that the crux of the issue is simply how it handles its IO, heavily favoring device performance VS CPU.

I’ll have to follow up with more builds and reading release notes to see what changed between releases. And what was it exactly that broke between gcc 3 and 4, and why the rip had to be.

I still like 0.90, if anything for it’s ability to run NeXTSTEP and NetWare.

18 thoughts on “Degrading qemu performance in DooM

  1. Interesting table – certainly this backs up my feelings that the raw TCG performance has improved substantially over the past few releases. Also iothread improves performance by not blocking the main CPU emulation loop whilst waiting for IO so something is amiss here.

    Can you confirm exactly when your performance regression appeared? From the table above it appears that 0.10.5 should have similar performance to 0.9, but you’re saying that interactive performance of 0.10.5 is much worse?

    In short, if you can supply me with a basic DooM image plus instructions on how to measure the interactive performance then I’ll have a go at bisecting this down between 0.10 and 0.9.

    • I’m on a different machine, so numbers are going to be different but I built a bunch of versions to see the differences.

      All compiled with gcc version 3.4.5 (mingw special)

      ==> qemu-0.8.1-log.txt <==

      ==> qemu-0.8.2-log.txt <==
      timed 2134 gametics in 174 realtics

      ==> qemu-0.9.0-log.txt <==
      timed 2134 gametics in 176 realtics

      ==> qemu-0.9.1-log.txt <==

      ==> qemu-0.10.0-log.txt <==
      timed 2134 gametics in 307 realtics

      ==> qemu-0.10.1-log.txt <==
      timed 2134 gametics in 311 realtics

      ==> qemu-0.10.2-log.txt <==
      timed 2134 gametics in 310 realtics

      ==> qemu-0.10.3-log.txt <==
      timed 2134 gametics in 309 realtics

      ==> qemu-0.10.4-log.txt <==
      timed 2134 gametics in 309 realtics

      ==> qemu-0.10.5-log.txt <==
      timed 2134 gametics in 309 realtics

      ==> qemu-0.10.6-log.txt <==
      timed 2134 gametics in 310 realtics

      ==> qemu-0.11.0-log.txt <==
      timed 2134 gametics in 348 realtics

      ==> qemu-0.11.1-log.txt <==
      timed 2134 gametics in 347 realtics

      ==> qemu-0.12.0-log.txt <==
      timed 2134 gametics in 356 realtics

      ==> qemu-0.12.1-log.txt <==
      timed 2134 gametics in 355 realtics

      ==> qemu-0.12.2-log.txt <==
      timed 2134 gametics in 358 realtics

      ==> qemu-0.12.3-log.txt <==
      timed 2134 gametics in 354 realtics

      ==> qemu-0.12.4-log.txt <==
      timed 2134 gametics in 353 realtics

      ==> qemu-0.12.5-log.txt <==
      timed 2134 gametics in 354 realtics

      ==> qemu-0.14.0-log.txt <==
      timed 2134 gametics in 373 realtics

      0.13 has some issue in the code with missing parts, and 0.15 starts with glib2 which I’m just not in the mood to build.

      The 2019 0218 build gives me 826 realtics on this machine (Xeon E5-2620v2 @ 2.10Ghz)

      I ran them all from a simple .cmd script with these bare flags:
      %1\i386-softmmu\qemu -L %1\pc-bios -hda dos.vmdk -serial file:%1-log.txt

      The hard disk image I’m using is this:

      You’ll get a 404 page, and the username/password is in the 404 page. 🙁

  2. I think that Doom has self-modifying code in its inner loop, which is probably why that benchmark behaves so differently to others — a JIT which is very fast to translate but produces mediocre code will do better with guest code like that than one which translates a little bit more slowly but produces faster code, because the time spent in the JIT itself will dominate the time spent actually running the code. But it’ll be worse for the much more common case where code isn’t self-modifying and the same input code is run many times.

    • Right so as per the example you pointed out to me in R_DrawColumn: Presumably changing the references to patch1/patch2 from being pointers directly into the code to a data location in BSS should help considerably here? It’s interesting to see how what was a cool optimisation on real hardware turns out to have completely the opposite effect during emulation.

      Neozeed – thanks for providing the test image. I did some spot benchmarks from 0.9.1 all the way up to git master, and can confirm as Peter states that as the TCG evolves, the overhead of the retranslation increases which is reflected in the DooM benchmark. In most circumstances this extra translation time produces considerably better code (as you can see from the other benchmarks), however because this particular bit of the DooM loop modifies itself on every iteration it constantly invalidates the QEMU code cache causing a retranslation in what is a performance critical part of the rendering code.

      Note that 0.9.1 also has the same issue but as the compiler is less optimised, it takes much less time to run which is why the overhead is less apparent in older versions.

      • Thanks for the insight!

        I went ahead and dug up my old crappy doom port, and searched some more and found a fix for the fixed point problems I was having, and re-built a version of DooM without sound, using GCC with the only optimization being an assembly version of fixed multiply & and guarded fixed divide.

        0.90 256 ticks

        3.1.50 185 ticks

        It’s SO much more playable, when compared to the standard self modifying Watcom code.

        Kind of interesting where a sub optimal interpreter can actually run faster.. Although this stuff is way above me, but fun to poke around with none the less!

  3. In the version 0.9.0 QEMU used a different code generator. The switch to TCG happened at 0.10.
    I have an impression that the evolution of the QEMU Object Model and the memory API may make the I/O a bit slower: some things are moved from the compile time to the run-time.

    Also it would be interesting to run the benchmarks on non i386 guest cpus. I guess doom would have less of the self-modifying code there. I don’t know though what 0.9.0 machines are capable of running DOOM at all though. The sun4u port – definitely not, the sun4m – only under Linux (if at all). I think you’ve ran Windows NT on a MIPS target. Might be worth a shot. Don’t know about the state of PPC/ARM back then.

    • It’d be cool if there was some MS-DOS like OS for other processors… I feel that Linux is a bit heavy handed, but I may have to give that a go.

      I’m starting to wonder if it jit/interprets GCC generated code better.

      • The only other CPU I know with an MS-DOS-like OS for it is the 68000 which has two (GEMDOS on the Atari ST, and whatever the X68K uses).

        But I suppose some parts of the FreeDOS kernel could be ported to another CPU, maybe…

        • I’ve seen TOS or opentos or whatever it’s called ported to that DIY kiwi68008 board. I think the Amiga too.

          No idea through software how hard to port to a ficticous 68000 with a linear frame buffer like mode 13vga is…

          Itd be pointless, but kinda funny

          • I see it’s also been ported to the Amiga!

            Although I haven’t the slightest idea what is involved in either porting or writing apps for it..

            It looks like AES and all the video stuff is part of it too.

  4. I wanted to give your qemu 0.90 a try for a long time, and only recently found some time to do so and revisit this. After painfully managing to build old gcc 3.4.5 into something runnable on Fedora 33 (using RedHat Linux 9 of all things), the 32bit qemu 0.90 produced by this process gave me a doom score of 407. By contrast, a 32bit build of qemu 6.0.0 on the same system, using the system compiler gcc 10.3.1, produces a hideous score of 2583. The corresponding 64bit build at least reaches 2037. And as a cross-check that I did not somehow mis-configure it completely, the Fedora binary of qemu 5.1.0 gives me 1793.

    But my impression is that this is not specific to Doom. I remember running Windows 2000 on my old MacBook back in 2008 using some qemu 0.9x binary not built by me, and the performance was quite acceptable. The same cannot be said about running Windows 2000 on current versions of qemu.

    • Redoing the benchmarks because I just noticed that you used option a, while I chose option b. So these are the results for a:

      0.90: 110
      5.1.0: 308
      6.0.0: 419/347 (32/64 bit)

      • yes the much later versions are faster! Like so much things change with time.

        I built a cross compiler gcc 3 chain to build 0.9 as downloading so much crap and VM’s… yuck.

        The one thing for me at least is that 0.9 feels more hackable, and smallish to work with, I still haven’t managed to get the latest stuff to build for Windows. I guess I should be cross compiling that as well. 😐

        • Setting up the cross build was a bit painful, but it’s definitely the better route. So I don’t need to employ a 2 decades old distro anymore. Yay! 😉

          Do note, however, that higher numbers correspond to lower performance for the Doom benchmark.

Leave a Reply