How to fix rsync slowing down over time (SOLVED)

(This is a guest post by Antoni Sawicki aka Tenox)

I often make copies of large data archives, typically many TB in size. I found that rsync transfer speed slows down over time, typically after a few GB, especially when copying large files. Eventually reaching crawl speeds of just few KB/s. The internet is littered with people asking the same question or why rsync is slow in general. There really isn’t a good answer out there, so I hope this may help.

After doing some quick profiling I found out that the main culprit was rsync's advanced delta transfer algorithm. The algorithm is super awesome for incremental updates as it will only transfer changed parts of a file instead of the whole thing. However when performing initial copy it’s not only unnecessary but gets in the way and the CPU is spinning calculating CRC on chunks that never could have changed. As such…

Initial rsync copies should be performed with -W option, for example:

$ rsync -avPW <src> <dst>

The -W or --whole-file option instructs rsync to perform full file copies and do not use delta transfer algorithm. In result there is no CRC calculation involved and maximum transfer speeds can be easily achieved.

Long term, rsync could be patched to do a full file transfer if the file doesn’t exist in destination.

Also while copying jumbo archives of many TB I don’t want to see every individual file being copied. Instead I want a percentage of the total archive size and current transfer speed in MB/s. After some experiments I arrived at this weird combo:

$ rsync -aW --no-i-r --info=progress2 --info=name0 <src> <dst>

UK is over the edge: archive.org blocked at the telecom level

Well at first that looks weird. It pings and all so I jump to incognito mode, and…

Content Lock on EE helps to keep you and your children safe online by blocking 18-rated content.
We have three settings – Strict, Moderate and Off so you can choose exactly what level of security you’d like.
Please note: All new and existing accounts with Content Lock enabled have the “Moderate” setting applied by default. Content Lock is only activated when you’re using our network – not when you’re using WiFi.

And this is EE censoring archive.org . UNREAL!

Going through the SIM registration, and login….

You need a credit card to get it unlocked. Luckily my Hong Kong business card worked, as always set the zip code to ‘0000’.

Thanks over reaching corporations (at the behest of who?) from blocking me from the past?

Pathetic.

Microsoft Ends the Bethesda Launcher

Why yes, I do live in a cave!

Although I only got this for Fallout 76, back after the discounts started after launch:

So it doesn’t mean a heck of a lot to me. And they did a Fallout 76 migration a while back, and the rest was just freebies given out for whatever reason. Oh well. Steam, love them or hate them kills another single vendor pointless storefront.

Ready to run OpenVMS VM – Student Kit from VSI

(This is a guest post by Antoni Sawicki aka Tenox)

I was recently registering a new OpenVMS Community License. In the process I learned that there is a ready to run, pre-installed and pre-configured VM with OpenVMS 8.4. Completely free for non-commercial purposes. You don’t even need to register or leave your details (WOW). Just download and run! Thank you VSI!

https://training.vmssoftware.com/student-license/

The student kit runs only on Windows as contains FreeAXP emulator. However it’s super easy to download, install and run.

VSI OpenVMS Student Kit

I’m hoping that in near future once x86 OpenVMS port is ready there will be images for x64 hypervisors like VMware, VirtualBox, Hyper-v and QEMU/KVM hopefully.

Re-visiting an install of 386BSD 0.0

    I shall be telling this with a sigh
    Somewhere ages and ages hence:
    Two roads diverged in a wood,
            and I ---
    I took the one less traveled by,
    And that has made all the difference.
       "The Road Not Taken" [1916] -- Robert Frost

I didn’t want to make my last post exclusively focusing on 386BSD 0.0, but I thought the least I could do to honor Bill’s passing was to re-install 0.0 in 2022. As I mentioned his liberating Net/2 and giving it away for free for lowly 386/486 based users ushered in a massive shift in computer software where so called minicomputer software was now available for micro computer users. Granted 32bit micro computers, even in 1992 were very expensive, but they were not out of the reach of mere mortals. No longer did you have to share a VAX, you could run Emacs all by yourself! As with every great leap, the 0.0 is a bit rough around the edges, but with a bit of work it can be brought up to a running state, even in 2022.

But talking with my muse about legacies, and the impact of this release I thought I should at least go thru the motions, and re-do an installation, a documented one at that!

Stealing fire from the gods:

Although I had done this years ago, I was insanely light on details. From what I remember I did this on VMware, and I think it was fusion on OS X, then switching over to Bochs. To be fair it was over 11 years ago.

Anyways I’m going to use the VMware player (because I’m cheap), and just create a simple VM for MS-DOS that has 16MB of RAM, and a 100MB disk. Also because of weird issues I added 2 floppy drives, and a serial & parallel port opened up to named pipe servers so I can move data in & out during the install. This was really needed as the installation guide is ON the floppy, and not provided externally.

VMware disk geometry

One of the things about 386BSD 0.0 is that it’s more VAX than PC OS, so it doesn’t use partition tables. This also means geometry matters. So hitting F2 when the VM tries to boot, I found that VMware has given me the interesting geometry of 207 cylinders, 16 heads, and a density of 63 sectors/track. If you multiply 207*16*63 you get 208656 usable sectors, which will be important. Multiply that by 512 for bytes per sector you get a capacity of 106,831,872. Isn’t formatting disks like it’s the 1970s fun? Obviously if you attempt to follow along, obviously yours could be different.

Booting off install diskette

Throwing the install disk in the VM will boot it up to the prompt very quickly. So that’s nice. The bootloader is either not interactive at all, or modern machines are so fast, any timeout mechanism just doesn’t work.

As we are unceremonially dumped to a root prompt, it’s time to start the install! From the guide we first remount the floppy drive as read-write with the following:

mount -u /dev/fd0a /

Now for the fun part, we need to create an entry in the /etc/disktab to describe our disk, so we can label it. You can either type all this in, use the serial port, or just edit the Conner 3100 disk and turn it into this:

vmware100|VMWare Virtual 100MB IDE:\
:dt=ST506:ty=winchester:se#512:nt#16:ns#63:nc#207:sf: \
:pa#12144:oa#0:ta=4.2BSD:ba#4096:fa#512: \
:pb#12144:ob#12144:tb=swap: \
:pc#208656:oc#0: \
:ph#184368:oh#24288:th=4.2BSD:bh#4096:fh#512:

As you can see the big changes are the ‘dt’ or disk type line nt,ns and nc, which describe heads, density and cylinders. And how 16,63,207 came from the disk geometry from above. The ‘pa’,’pb’… entries describe partitions, and since they are at the start of the disk, nothing changes there since partitions are described in sectors. Partition C refrences the entire disk, so it’s set to the calculated 208656 sectors. Partition A+B is 24288, so 208,656-24,288 is 184,368 which then gives us the size of partition H. I can’t imagine what a stumbling block this would have been in 1992, as you really have to know your disks geometry. And of course you cannot share your disk with anything else, just like the VAX BSD installs.

With the disklabel defined, it’s now time to write it to the disk:

disklabel -r -w wd0 vmware100

And as suggested you should read it back to make sure it’s correct:

disklabel -r wd0
wd0 labeled as a custom VMware 100

Now we can format the partitions, and get ready to transfer the floppy disk to the hard disk. Basically it boils down to this:

newfs wd0a
newfs wd0h
bad144 wd0 -f
mount /dev/wd0a /mnt
mkdir /mnt/usr
mount /dev/wd0h /mnt/usr
(cd /;tar -cf - .)|(cd /mnt;tar -xvf -)
umount /mnt/usr
umount /mnt
fsck -y /dev/rwd0a
fsck -y /dev/rwd0h

Oddly enough the restore set also has files for the root, *however* it’s not complete, so you need to make sure to get files from the floppy, and again from the restore set.

One of the annoying things about this install is that VMware crashes trying to boot from the hard disk, so this is why we added 2 floppy drives to the install so we can transfer the install to the disk. Also it appears that there is some bug, or some other weird thing as the restore program wants to put everything into the ‘bin’ directory just adding all kinds of confusion, along with it not picking up end of volume correctly. So we have to do some creative work arounds.

So we mount the ‘h’ partition next as it’s the largest one and will have enough scratch space for our use:

mkdir /mnt/bin
mount /dev/wd0a /mnt/bin
mount /dev/wd0h /mnt/bin/usr
cd /mnt/bin/usr

Now is when we insert the 1st binary disk into the second floppy drive, and we are going to dump into a file called binset:

cat /dev/fd1 > binset

Once it’s done, you can insert the second disk, and now we are going to append the second disk to binset:

cat /dev/fd1 >> binset

You need to do this with disks 2-6.

I ran the ‘sync’ command a few times to make sure that binset is fully written out to the hard disk. Now we are going to use the temperamental ‘mr’ program to extract the binary install:

cd /mnt
mr 1440 /mnt/bin/usr/binset | tar -zxvf -

This will only take a few seconds, but I’d imagine even on a 486 with an IDE disk back then, this would take forever.

The system is now extracted! I just ran the following ‘house cleaning’ to make sure everything is fine:

cd /
umount /mnt/bin/usr
umount /mnt/bin
fsck -y /dev/rwd0a
fsck -y /dev/rwd0h

And there we go!

Now for actually booting up and using this, as I mentioned above, VMware will crash attempting to boot 386BSD. Maybe it’s the bootloader? Maybe it’s BIOS? I don’t know. However old versions of Qemu (I tested 0.9 & 0.10.5) will work.

With the system booted you should run the following to mount up all the disks:

fsck -p
mount -a
update
/etc/netstart

I just put this in a file called /start so I don’t have to type all that much over and over and over:

Booting from Hard Disk, under Qemu

On first boot there seems to be a lot of missing and broken stuff. The ‘which’ command doesn’t work, and I noticed all the accounting stuff is missing as well:

mkdir /var/run
mkdir /var/log
touch /var/run/utmp
touch /var/log/wtmp

Will at least get that back in action.

The source code is extracted in a similar fashion, it expects everything to be under a ‘src’ directory, so pretty much the same thing as the binary extract, just change ‘bin’ to ‘src’, and it’s pretty much done.

End thoughts

I think this wraps up the goal of getting this installed and booting. I didn’t want to update or change as little as possible to have that authentic 1992 experience, limitations and all. It’s not a perfect BSD distribution, but this had the impact of being not only free, but being available to the common person, no SPARC/MIPS workstations, or other obscure or specialized 68000 based machine, just the massively copied and commodity AT386. For a while when Linux was considered immature, BSD’s led the networking charge, and I don’t doubt that many got to that position because of that initial push made by Bill & Lynne with 386BSD.

Compressed with 7zip, along with my altered boot floppy with my VMware disk entry it’s 8.5MB compressed. Talk about tiny! For anyone interested here is my boot floppy and vmdk, which I run on early Qemu.

And there we go!

Bill Jolitz passed last month..

As mentioned on the TUHS mailing list. Many will remember his work on being a literal Prometheus, thrashing his work in the cathedral and delivering Net/2 to the lower i386 users. He and his wife, Lynne were instrumental in kicking off the surviving legacy of University research Unix.

I don’t need to bemoan on opertunities lost, the pivotal moments of 1991 or the way the Internet arranged itself around the needs of being PC i386, portability, and then a schism later security.

Details are sparse overall, I believe he is survived by his wife and a daughter?

Bill was far too young at 65.

The slap heard around the world

Even when Im trying to live under my rock, I still am somehow flooded with news that there was a slap fight.

Totally not kayfabe. Borrowed from CNN.com

No not this Will Smith Chris Rock thing, I’m talking of course about Clive Sinclair slapping Chris Curry at the Baron of Beef pub in Cambridge.

What’s the beef about?

Where’s the beef?

Clive vs Curry

As the legend goes, Curry worked under Clive, but he ran into Herman Hauser who had encouraged Curry to go his own way and make that computer of his dreams. Incised about this Clive was able to put together and rush out the Z80 before Acorn had anything ready to ship

£79.95

And more importantly it was CHEAP. You’d have thought that the zx80 would have found a larger world wide market but Commodore and Apple reigned supreme in North America.

Later that year Acorn would ship the Acorn Atom priced around £129 in kit, and £179 assembled it was a lot more expensive but granted it did have a lot more ‘computer’ in there.

In the following year Sinclair had released the ZX81, which although a larger price point also included a lot more, larger ram/rom better display and of course this was ready to ignite the coming war.

As the legend goes a TV show of all things, ‘The Might Micro, (2/3/4/5/6)’ had ignited such a storm in parliament that the Department of Industry & the BBC decided that they were going to produce programming to go along with a selected microcomputer. And that machine was the Newbury NewBrain… until it was obvious that this wasn’t going to be the machine of choice, and the selection was pushed back from the fall of 81 to the spring of 82. With the BBC being forced to open up selection to other UK computer manufacturers, both worked hard for a machine, however Curry swooped in with his new ‘BBC Micro’ (that had started working the day of the inspection) and won the contract.

1982 of course would give us the ZX Spectrum as Sinclair’s answer to what the people needed.

Oddly enough things in the long term didn’t work out for ether of them, as they both made so many missteps that they ended up ultimately shelving both of the units, with Acorn barely surviving, although their ARM processor does live on, mostly because it ended up free of any hardware platform to go along with it.

The plus isn’t plussed

There was no ZX 83 model, instead there was of course the QL for 1984. And taking on the design of the QL the Sinclair + was launched. And despite the name, it was just a 48k with a reset button and nicer keyboard. Very NON plussed. The only upgrade to the ZX would have to come from spain in the form of the 128.

The QL was 100% incompatible with the ZX. Apparently doing something like the SEGA Megadrive, by including both a 68000 and z80 was just too out of the question. Instead it was so focused on price it made the machine not serious enough for the serious business market Clive had craved so much. No socket for a 68881, and the drives being so incredibly tiny, IBM had quickly followed up the PC with the XT which allowed for a hard disk, while the QL with a single slot in no way could fit a then 5 1/4″ full heigh disk.

Although many fault the QL for having relied on the 68008 processor remember even IBM was using the 8088, with the same 8bit constraints, it’s not that it was impossible, it’s that the sleek stylized deck of the QL was just far too ahead of itself, it’d be fine for today, just look at the Pi400! I’d prefer to have one with SD cards up front but I guess I need to learn how to 3d print and make my own.

Another fault of the QL was not having the space on the motherboard to go to the full 1MB of addressable RAM like the PC, and loading the OS from disk. Having the OS in ROM was such an 8bit holdover when loading it from tape would have been useless but the PC way of loading the OS from disk was the way to go, also it far easier facilitated updating. I know the ST & Amiga also went with OS in ROM thinking it saved money but in the long term all the wedge’s of the era just limited themselves.

The real slap: in the market

The real SLAP heard around the UK

The real slap that was heard was the stagnation of both machines, and the decline of the UK computer makers. Acorn had apparently manufactured a tonne of Electron’s for Christmas but the order wasn’t actually put through because of some ‘pull back of a video game crash’ in Europe. I guess it’s the continuation of the video game crash in the USA, but as you can see the stockpile of machines to be blown out was just incredible.

And it was in 1984 that apparently Acorn had run an ad showing that Sinclair computers had a high defect rate, something that has always plagued Sinclair’s quest for low cost machines, Something that had been hand waved as a 1 year replacement policy with many teenagers abusing the machines, that led to the confrontation in the Baron of Beef along with the whooping Sinclair had unleashed on Curry. Although much of this has passed into more legend than fact, even Ruth Bramley didn’t recall anything about the event.

It’s an amazing flash in the pan, that has so many games, and so much early computer culture that was partitioned to a tiny island and for the most part in the rest of the world totally unknown. I hope to get a real Spectrum 128 one day, it sounds like a fascinating machine. Although they made a million? of them, they are quite expensive in any market place. I wonder sometimes if there is demand for a super cheap almost ‘disposable’ 8bit computer. Obviously it’d have be under £20.

Since all this UK micro computer stuff never really left the island it’s all new to me. And maybe many people outside of the UK, or surprisingly the iron curtain where zx spectrums were abundant.

footnote: I know people will say that there was some attempt at selling Sinclair Micros out of Texas with one OEM, but honestly I’ve never hear or seen of any such thing, it’s only recently as a curiosity on youtube. And they were incompatible anyways so whatever.

Also holy crap so an actor slapped another actor in a show where they backslap each other. Who cares?! Bring back Beavis and Butthead, and prime time boxing! People obviously have a thirst for this, why did the WWF’s kayfabe fade? the paywalls?

Silent Partner – Space Walk

aka the 20’s version of Opus number 1. Wait, what?

While feeling generally like crap the last few days and half sleeping letting YouTube play random crap it’d eventually come across a ‘live premier’ and they all of course have the same music.

At first in a haze I thought it was looping videos, but no it was 2 completely different people however the intro music was the same. So in the middle of the night the quest had started to track down the tune.

And in no time I managed to find what most everyone else found. It’s from ‘Silent Partner‘ who made a bunch of free to use music. And in the compilations on soundcloud is the Space Walk.

Silent Partner’s Space Walk

And of course you know them from ‘Spring in My Step‘ among many many others.

Although they do have an official channel now, with a short BIO in their about:

In 2013, YouTube reached out to producer Bryce Goggin, asking if he was interested in creating music for their new Audio Library which aimed to give billions of video creators access to free, safe to use music. His answer was simple – “Yes”. Goggin, who’s worked with the likes of Phish, Sean Lennon, Space Hog, and Pavement, recruited a close circle of session musicians to help take on the mighty endeavor. The team worked out of his studio in Brooklyn NY, creating a diverse collection of over 1k songs that would go on to be used in hundreds of millions of videos.

And from the official song:

Since June 2018, every YouTube premiere is preceded by a colorful countdown that features vibrant, abstract animations and a clock ticking its way down to zero. Every countdown also includes the same song playing front and center, a two-minute instrumental track that stirs up anticipation with its nostalgic electronic synths, drum machine percussion, and orchestral string plucks. This song, “Space Walk” by Silent Partner is often referred to as the “unofficial YouTube national anthem”.

Commenters on YouTube re-uploads of the song agree, as they’ve shared a variety of feelings about the track. One person noted, “People in 2030/2040 will be like: This is soooo nostalgic!! Only real ones remember this.” Somebody else wrote, “This is honestly such a fitting song for YouTube Premiere countdowns, it just perfectly goes with your imagination running wild about what you’re about to see.” Another user painted a picture of the end of YouTube with “Space Walk” as the soundtrack: “I feel like this is something that would play in the final minutes of youtube before the site shuts down. Just this music and a few minutes to remember everything that has happened on this site over the decades before it all goes away.”

“Space Walk” has been heard billions of times (literally). Ed Sheeran’s “Shape Of You,” the most popular all-time song on Spotify, has nearly 3 billion spins, and it wouldn’t be surprising to learn that the YouTube premiere song — across every YouTube premiere ever, music video or otherwise — has been heard more times than that.

And there you go. The site Uproxx has a great article on the hunt for the origin of the song. Like so many I’m just here after the fact, wondering what is the deal with the song.

If IVR’s, hold music and answering machines were still a touchstone, I’m sure Space Walk would be up there. But instead now it’s just intro lead music.

Space Walk mp3

And since it is royalty free here is the MP3 for anyone who really wants it that bad.

Setting the time machine for June 20th 2011

John Titor hunting Orange wine, and IBM 5100’s

No, not that time machine, this one is a rehash of the old local Wikipeida mirror.

So sadly I didn’t keep the source files as I thought they were evergreen, and yeah turns out they are NOT. But thankfully there is a 2011 set on archive.org listed as enwiki-20110620-item-1-of-2 and enwiki-20110620-item-2-of-2. Sadly there isn’t any torrents of these files, and it seems as of today the internet archive torrent servers are dead so a direct download is needed.

Getting started

You are going to need a LOT of disk space. It’s about 10GB for the downloaded compressed data, and with the pages blown out to a database it’s ~60GB. Yes it’s massive. Also enough space for a Debian 7 VM, or a lot of your time trying to decode ancient perl. Yes it really is a write only language. I didn’t bother trying to figure out why it doesn’t work instead I used netcat and a Debian 7 VM.

Thanks to trn he suggested aria2c which did a great job of downloading stuff, although one URL at a time, but that’s fine.

aria2c -x 16 -s 16 -j 16 <<URL>>

I downloaded the following files:

  • enwiki-20110620-all-titles-in-ns0.gz
  • enwiki-20110620-category.sql.gz
  • enwiki-20110620-categorylinks.sql.gz
  • enwiki-20110620-externallinks.sql.gz
  • enwiki-20110620-flaggedpages.sql.gz
  • enwiki-20110620-flaggedrevs.sql.gz
  • enwiki-20110620-image.sql.gz
  • enwiki-20110620-imagelinks.sql.gz
  • enwiki-20110620-interwiki.sql.gz
  • enwiki-20110620-iwlinks.sql.gz
  • enwiki-20110620-langlinks.sql.gz
  • enwiki-20110620-oldimage.sql.gz
  • enwiki-20110620-page.sql.gz
  • enwiki-20110620-pagelinks.sql.gz
  • enwiki-20110620-pages-articles.xml.bz2
  • enwiki-20110620-pages-logging.xml.gz
  • enwiki-20110620-page_props.sql.gz
  • enwiki-20110620-page_restrictions.sql.gz
  • enwiki-20110620-protected_titles.sql.gz
  • enwiki-20110620-redirect.sql.gz
  • enwiki-20110620-site_stats.sql.gz
  • enwiki-20110620-templatelinks.sql.gz
  • enwiki-20110620-user_groups.sql.gz

although the bulk of what you want as a single file is enwiki-20110620-pages-articles.xml.bz2, which is 7.5 GB, downloading the rest of the files is another 10GB rouding this out to 17.5GB of files to download. Yikes!

MySQL on WSLv2

I’m using Ubuntu 20.04 LTS on Windows 11, so adding MySQL is done via the MariaDB version with a simple apt-get install:

apt-get install mariadb-server mariadb-common mariadb-client mariadb-common

Installing MySQL is kind of easy although it will need to be setup to assign the pid file to the right place and set so it can write to it:

mkdir -p /var/run/mysqld
chown mysql:mysql /var/run/mysqld

Otherwise you’ll get this:

[ERROR] mysqld: Can't create/write to file '/var/run/mysqld/mysqld.pid' (Errcode: 2 "No such file or directory")

Additionally you’ll need to tell it to bind to 0.0.0.0 instead of 127.0.0.1 as we’ll want this on the network. I’m on an isolated LAN so it’s fine by me, but of course your millage may vary. For me a simple diff of the config directory is this:

diff -ruN etc/mysql/mariadb.conf.d/50-server.cnf /etc/mysql/mariadb.conf.d/50-server.cnf
--- etc/mysql/mariadb.conf.d/50-server.cnf      2021-11-21 08:22:31.000000000 +0800
+++ /etc/mysql/mariadb.conf.d/50-server.cnf     2022-03-11 10:01:45.369272200 +0800
@@ -27,7 +27,7 @@

 # Instead of skip-networking the default is now to listen only on
 # localhost which is more compatible and is not less secure.
-bind-address            = 127.0.0.1
+bind-address            = 0.0.0.0

 #
 # * Fine Tuning
@@ -43,6 +43,11 @@
 #max_connections        = 100
 #table_cache            = 64

+key_buffer_size = 1G
+max_allowed_packet = 1G
+query_cache_limit = 18M
+query_cache_size = 128M
+
 #
 # * Logging and Replication
 #

As far as I know MySQL doesn’t run on WSLv1. So people with that restriction are kind of SOL. At the same time for me, Debian 7 doesn’t run on Hyper-V so I had to run VMware Player. And well if you can’t run Hyper-V/WSLv2 then you can run it all on Debian 7 which is probably eaiser. Although you’ll probably hit some performance issues in the import that either my machine is fast enough I don’t care or the newer stuff is pre-configured for machines larger than an ISA/PCI gen1 Pentium 60.

I run mysqld manually in a window as I am only doing this adhoc not as a service. Although on a Windows 10 machine to reproduce and test this, mysqld wont run interactively, instead I had to do the ‘service mysql start’ to get it running. So I guess you’ll have to find out the hard way.

Next, be sure to create the database and a user to so this will work:

create database wikidb;
create user 'wikiuser'@'%' IDENTIFIED BY 'password';
GRANT ALL PRIVILEGES ON wikidb.* TO 'wikiuser'@'%' WITH GRANT OPTION;
show grants for 'wikiuser'@'%';

Something like this works well. Yes the password is password but it’s all internal so who cares. If you don’t like it, change it as needed.

With the database & user created you’ll want to make sure that you can connect from the Debian 7 machine with something like this:

mysql -h 192.168.6.10 -uwikiuser -ppassword wikidb

As I don’t think PHP 7 or whatever is modern will run the ancient MediaWiki version 1.15.5 (which I’m using).

This is my setup as I’m writing this so bear with me.

Prepping Apache

Since I have that Debian 7 VM, I used that for setting up MediaWiki. Looking at my apt-cache I believe I loaded the following modules:

  • mysql-client
  • mysql-common
  • apache2
  • apache2.2-bin
  • apache2.2-common
  • apache2-mpm-prefork
  • apache2-mpm-worker
  • apache2-utils
  • libapache2-mod-php5
  • php5-cli
  • php5-common
  • php5-mysql
  • lua5.1
  • liblua5.1

On the Apache side I have the following extension enabled:

alias authz_default authz_user deflate mime reqtimeout
auth_basic authz_groupfile autoindex dir negotiation setenvif
authn_file authz_host cgi env php5 status

Which I think is pretty generic.

I used mediawiki-1.15.5 as the basis mostly because I had started with an incomplete 2010 dump, but after finding this 2011 dump I probably should have gone with 1.16.5 or 1.17.5.. Oh well. When connecting from Debian 7 to my ‘modern’ MariaDB there is one table that needs to be updated, otherwise it’ll fail. A simple diff that needs to be applied (that was with the least amount of effort spent by me!) is this:

--- maintenance/tables.sql      2009-03-20 19:20:39.000000000 +0800
+++ /var/www/maintenance/tables.sql     2022-03-07 14:21:25.580318700 +0800
@@ -1099,7 +1099,7 @@

 CREATE TABLE /*_*/trackbacks (
   tb_id int PRIMARY KEY AUTO_INCREMENT,
-  tb_page int REFERENCES /*_*/page(page_id) ON DELETE CASCADE,
+  tb_page int,
   tb_title varchar(255) NOT NULL,
   tb_url blob NOT NULL,
   tb_ex text,

All being well and patched you can do the install! I just do a super basic install, nothing exciting. In my setup the MySQL server is on 192.168.6.10. I don’t think I changed much of anything?

And with that done if all goes well you’ll get the install completed!

If you get anything else, drop the database (the permission grants stay, because MySQL doesn’t actually drop thing associated with databases.. :shrug:.

Next in the extensions folder I grabbed Scribunto-REL1_35-04b897f.tar.gz, which is still on the extensions site. This required Lua 5.1 and the following to be appended to the LocalSetings.php

#
$wgScribuntoEngineConf['luastandalone']['luaPath'] = '/usr/bin/lua5.1';

$wgScribuntoUseGeSHi = true;
$wgScribuntoUseCodeEditor = true;
#

Keep in mind the original extensions I used are not, and appear to not have been archived, so yeah.

Doing the pages.xml import

You can find the version 0.5 media wiki import script on archive.org. Obviously check the first 5-10 lines of the decompressed bz2 file to see what version you have if you are deviating and look around IA to time travel to see if there is a matching one. I have no idea about modern ones as this is hard enough trying to reproduce an old experiment.

First you need to make some files to setup the pre-post conditions of the insert. It’s about 11,124,050 pages, give or take.

pre.sql

SET autocommit=0;
SET unique_checks=0;
SET foreign_key_checks=0;
BEGIN;

post.sql

COMMIT;
SET autocommit=1;
SET unique_checks=1;
SET foreign_key_checks=1;

Running the actual import

I’m assuming that 192.168.6.33 is the Debian 7 machine, 192.168.6.10 is the Windows 11 machine.

On the machine with the data:

netcat 192.168.6.33 9909 < enwiki-latest-pages-articles.xml.bz2

On the machine that can run the mwimport script:

netcat -l -p 9909 | bzip2 -dc | ./mwimport-0.5.pl | netcat 192.168.6.10 9906

And finally on the MySQL machine:

(cat pre.sql; netcat -l -p 9906 ; cat post.sql) | mysql -f --default-character-set=utf8 wikidb

Since I’m using WSLv2 the Windows firewall may screw stuff up so add a rule with netsh (as Administrator CMD prompt)

netsh interface portproxy add v4tov4 listenaddress=192.168.6.10 listenport=3306 connectaddress=172.24.167.66 connectport=3306
netsh interface portproxy add v4tov4 listenaddress=192.168.6.10 listenport=9906 connectaddress=172.24.167.66 connectport=9906

On my setup it takes about 2.5 hours to load the database, which will be about 51GB.

11340000 pages (1231.805/s),  11340000 revisions (1231.805/s) in 9206 seconds

The savvy among you may notice the -f flag to the mysql parser. And yes that is because there *will* be errors during the process.

I’m not sure what how or what to do about it, but without the -f (force) flag the process will stop around the 2 million row mark. Doing it forced allows the process to continue.

With that done I get the following tallies…

MariaDB [(none)]> SELECT table_name, table_rows FROM INFORMATION_SCHEMA.TABLES    WHERE TABLE_SCHEMA = 'wikidb' and
table_rows > 0;
+---------------+------------+
| table_name    | table_rows |
+---------------+------------+
| interwiki     |         85 |
| objectcache   |         10 |
| page          |   10839464 |
| revision      |   11357659 |
| text          |   14491759 |
| user_groups   |          2 |
+---------------+------------+
9 rows in set (0.002 sec)

If all of this worked (amazing!) then search for something like 1001 and be greeted with:

1001: a non odyssey

MySQL disappointments

So with this in place, having some 51GB laying around just seemed lame. Using WSLv2 I setup a compressed folder on NTFS and moved the data directory into there and it gets it down to a somewhat more manageable 20GB. Since the data doesn’t change I had a better idea, SquashFS. Well it compresses down to 12GB, HOWEVER for the life of me I can’t find anything concrete on using a read only backing store to MySQL. Even general mediawiki stuff seems to want to write to all the tables, I guess it’s index searching?! Insane! And it appears MySQL can only use single file storage units per table? Yeah this isn’t MSSQL with stuff like a database from CD-ROM with the log on a floppy. I tried doing a union overlay filesytem but it makes a 100% copy of a file that changes. That’s not good. I guess using qemu-img for a compressed qcow2 with a writable diff file could hide the read only compressed backing store, but I’ve already lost interest.

Maybe it’s just me, but it seems like there should be a way to write logs/updates/scratch to a RW place, and keep the majority of the data read-only (and highly compressed).

Why doesn’t stuff format correctly

There seems to be a lot of formatting nonsense going on, I probably should step up to mediawiki 1.17. And I’ll add in loading the other SQL tables since they are straight up inserts. Also the extensions I know I loaded don’t seem to exist in any form anymore, and the images I snapshotted of the install are all long gone. It’ll require more diving around.