Setting the time machine for June 20th 2011

John Titor hunting Orange wine, and IBM 5100’s

No, not that time machine, this one is a rehash of the old local Wikipeida mirror.

So sadly I didn’t keep the source files as I thought they were evergreen, and yeah turns out they are NOT. But thankfully there is a 2011 set on archive.org listed as enwiki-20110620-item-1-of-2 and enwiki-20110620-item-2-of-2. Sadly there isn’t any torrents of these files, and it seems as of today the internet archive torrent servers are dead so a direct download is needed.

Getting started

You are going to need a LOT of disk space. It’s about 10GB for the downloaded compressed data, and with the pages blown out to a database it’s ~60GB. Yes it’s massive. Also enough space for a Debian 7 VM, or a lot of your time trying to decode ancient perl. Yes it really is a write only language. I didn’t bother trying to figure out why it doesn’t work instead I used netcat and a Debian 7 VM.

Thanks to trn he suggested aria2c which did a great job of downloading stuff, although one URL at a time, but that’s fine.

aria2c -x 16 -s 16 -j 16 <<URL>>

I downloaded the following files:

  • enwiki-20110620-all-titles-in-ns0.gz
  • enwiki-20110620-category.sql.gz
  • enwiki-20110620-categorylinks.sql.gz
  • enwiki-20110620-externallinks.sql.gz
  • enwiki-20110620-flaggedpages.sql.gz
  • enwiki-20110620-flaggedrevs.sql.gz
  • enwiki-20110620-image.sql.gz
  • enwiki-20110620-imagelinks.sql.gz
  • enwiki-20110620-interwiki.sql.gz
  • enwiki-20110620-iwlinks.sql.gz
  • enwiki-20110620-langlinks.sql.gz
  • enwiki-20110620-oldimage.sql.gz
  • enwiki-20110620-page.sql.gz
  • enwiki-20110620-pagelinks.sql.gz
  • enwiki-20110620-pages-articles.xml.bz2
  • enwiki-20110620-pages-logging.xml.gz
  • enwiki-20110620-page_props.sql.gz
  • enwiki-20110620-page_restrictions.sql.gz
  • enwiki-20110620-protected_titles.sql.gz
  • enwiki-20110620-redirect.sql.gz
  • enwiki-20110620-site_stats.sql.gz
  • enwiki-20110620-templatelinks.sql.gz
  • enwiki-20110620-user_groups.sql.gz

although the bulk of what you want as a single file is enwiki-20110620-pages-articles.xml.bz2, which is 7.5 GB, downloading the rest of the files is another 10GB rouding this out to 17.5GB of files to download. Yikes!

MySQL on WSLv2

I’m using Ubuntu 20.04 LTS on Windows 11, so adding MySQL is done via the MariaDB version with a simple apt-get install:

apt-get install mariadb-server mariadb-common mariadb-client mariadb-common

Installing MySQL is kind of easy although it will need to be setup to assign the pid file to the right place and set so it can write to it:

mkdir -p /var/run/mysqld
chown mysql:mysql /var/run/mysqld

Otherwise you’ll get this:

[ERROR] mysqld: Can't create/write to file '/var/run/mysqld/mysqld.pid' (Errcode: 2 "No such file or directory")

Additionally you’ll need to tell it to bind to 0.0.0.0 instead of 127.0.0.1 as we’ll want this on the network. I’m on an isolated LAN so it’s fine by me, but of course your millage may vary. For me a simple diff of the config directory is this:

diff -ruN etc/mysql/mariadb.conf.d/50-server.cnf /etc/mysql/mariadb.conf.d/50-server.cnf
--- etc/mysql/mariadb.conf.d/50-server.cnf      2021-11-21 08:22:31.000000000 +0800
+++ /etc/mysql/mariadb.conf.d/50-server.cnf     2022-03-11 10:01:45.369272200 +0800
@@ -27,7 +27,7 @@

 # Instead of skip-networking the default is now to listen only on
 # localhost which is more compatible and is not less secure.
-bind-address            = 127.0.0.1
+bind-address            = 0.0.0.0

 #
 # * Fine Tuning
@@ -43,6 +43,11 @@
 #max_connections        = 100
 #table_cache            = 64

+key_buffer_size = 1G
+max_allowed_packet = 1G
+query_cache_limit = 18M
+query_cache_size = 128M
+
 #
 # * Logging and Replication
 #

As far as I know MySQL doesn’t run on WSLv1. So people with that restriction are kind of SOL. At the same time for me, Debian 7 doesn’t run on Hyper-V so I had to run VMware Player. And well if you can’t run Hyper-V/WSLv2 then you can run it all on Debian 7 which is probably eaiser. Although you’ll probably hit some performance issues in the import that either my machine is fast enough I don’t care or the newer stuff is pre-configured for machines larger than an ISA/PCI gen1 Pentium 60.

I run mysqld manually in a window as I am only doing this adhoc not as a service. Although on a Windows 10 machine to reproduce and test this, mysqld wont run interactively, instead I had to do the ‘service mysql start’ to get it running. So I guess you’ll have to find out the hard way.

Next, be sure to create the database and a user to so this will work:

create database wikidb;
create user 'wikiuser'@'%' IDENTIFIED BY 'password';
GRANT ALL PRIVILEGES ON wikidb.* TO 'wikiuser'@'%' WITH GRANT OPTION;
show grants for 'wikiuser'@'%';

Something like this works well. Yes the password is password but it’s all internal so who cares. If you don’t like it, change it as needed.

With the database & user created you’ll want to make sure that you can connect from the Debian 7 machine with something like this:

mysql -h 192.168.6.10 -uwikiuser -ppassword wikidb

As I don’t think PHP 7 or whatever is modern will run the ancient MediaWiki version 1.15.5 (which I’m using).

This is my setup as I’m writing this so bear with me.

Prepping Apache

Since I have that Debian 7 VM, I used that for setting up MediaWiki. Looking at my apt-cache I believe I loaded the following modules:

  • mysql-client
  • mysql-common
  • apache2
  • apache2.2-bin
  • apache2.2-common
  • apache2-mpm-prefork
  • apache2-mpm-worker
  • apache2-utils
  • libapache2-mod-php5
  • php5-cli
  • php5-common
  • php5-mysql
  • lua5.1
  • liblua5.1

On the Apache side I have the following extension enabled:

alias authz_default authz_user deflate mime reqtimeout
auth_basic authz_groupfile autoindex dir negotiation setenvif
authn_file authz_host cgi env php5 status

Which I think is pretty generic.

I used mediawiki-1.15.5 as the basis mostly because I had started with an incomplete 2010 dump, but after finding this 2011 dump I probably should have gone with 1.16.5 or 1.17.5.. Oh well. When connecting from Debian 7 to my ‘modern’ MariaDB there is one table that needs to be updated, otherwise it’ll fail. A simple diff that needs to be applied (that was with the least amount of effort spent by me!) is this:

--- maintenance/tables.sql      2009-03-20 19:20:39.000000000 +0800
+++ /var/www/maintenance/tables.sql     2022-03-07 14:21:25.580318700 +0800
@@ -1099,7 +1099,7 @@

 CREATE TABLE /*_*/trackbacks (
   tb_id int PRIMARY KEY AUTO_INCREMENT,
-  tb_page int REFERENCES /*_*/page(page_id) ON DELETE CASCADE,
+  tb_page int,
   tb_title varchar(255) NOT NULL,
   tb_url blob NOT NULL,
   tb_ex text,

All being well and patched you can do the install! I just do a super basic install, nothing exciting. In my setup the MySQL server is on 192.168.6.10. I don’t think I changed much of anything?

And with that done if all goes well you’ll get the install completed!

If you get anything else, drop the database (the permission grants stay, because MySQL doesn’t actually drop thing associated with databases.. :shrug:.

Next in the extensions folder I grabbed Scribunto-REL1_35-04b897f.tar.gz, which is still on the extensions site. This required Lua 5.1 and the following to be appended to the LocalSetings.php

#
$wgScribuntoEngineConf['luastandalone']['luaPath'] = '/usr/bin/lua5.1';

$wgScribuntoUseGeSHi = true;
$wgScribuntoUseCodeEditor = true;
#

Keep in mind the original extensions I used are not, and appear to not have been archived, so yeah.

Doing the pages.xml import

You can find the version 0.5 media wiki import script on archive.org. Obviously check the first 5-10 lines of the decompressed bz2 file to see what version you have if you are deviating and look around IA to time travel to see if there is a matching one. I have no idea about modern ones as this is hard enough trying to reproduce an old experiment.

First you need to make some files to setup the pre-post conditions of the insert. It’s about 11,124,050 pages, give or take.

pre.sql

SET autocommit=0;
SET unique_checks=0;
SET foreign_key_checks=0;
BEGIN;

post.sql

COMMIT;
SET autocommit=1;
SET unique_checks=1;
SET foreign_key_checks=1;

Running the actual import

I’m assuming that 192.168.6.33 is the Debian 7 machine, 192.168.6.10 is the Windows 11 machine.

On the machine with the data:

netcat 192.168.6.33 9909 < enwiki-latest-pages-articles.xml.bz2

On the machine that can run the mwimport script:

netcat -l -p 9909 | bzip2 -dc | ./mwimport-0.5.pl | netcat 192.168.6.10 9906

And finally on the MySQL machine:

(cat pre.sql; netcat -l -p 9906 ; cat post.sql) | mysql -f --default-character-set=utf8 wikidb

Since I’m using WSLv2 the Windows firewall may screw stuff up so add a rule with netsh (as Administrator CMD prompt)

netsh interface portproxy add v4tov4 listenaddress=192.168.6.10 listenport=3306 connectaddress=172.24.167.66 connectport=3306
netsh interface portproxy add v4tov4 listenaddress=192.168.6.10 listenport=9906 connectaddress=172.24.167.66 connectport=9906

On my setup it takes about 2.5 hours to load the database, which will be about 51GB.

11340000 pages (1231.805/s),  11340000 revisions (1231.805/s) in 9206 seconds

The savvy among you may notice the -f flag to the mysql parser. And yes that is because there *will* be errors during the process.

I’m not sure what how or what to do about it, but without the -f (force) flag the process will stop around the 2 million row mark. Doing it forced allows the process to continue.

With that done I get the following tallies…

MariaDB [(none)]> SELECT table_name, table_rows FROM INFORMATION_SCHEMA.TABLES    WHERE TABLE_SCHEMA = 'wikidb' and
table_rows > 0;
+---------------+------------+
| table_name    | table_rows |
+---------------+------------+
| interwiki     |         85 |
| objectcache   |         10 |
| page          |   10839464 |
| revision      |   11357659 |
| text          |   14491759 |
| user_groups   |          2 |
+---------------+------------+
9 rows in set (0.002 sec)

If all of this worked (amazing!) then search for something like 1001 and be greeted with:

1001: a non odyssey

MySQL disappointments

So with this in place, having some 51GB laying around just seemed lame. Using WSLv2 I setup a compressed folder on NTFS and moved the data directory into there and it gets it down to a somewhat more manageable 20GB. Since the data doesn’t change I had a better idea, SquashFS. Well it compresses down to 12GB, HOWEVER for the life of me I can’t find anything concrete on using a read only backing store to MySQL. Even general mediawiki stuff seems to want to write to all the tables, I guess it’s index searching?! Insane! And it appears MySQL can only use single file storage units per table? Yeah this isn’t MSSQL with stuff like a database from CD-ROM with the log on a floppy. I tried doing a union overlay filesytem but it makes a 100% copy of a file that changes. That’s not good. I guess using qemu-img for a compressed qcow2 with a writable diff file could hide the read only compressed backing store, but I’ve already lost interest.

Maybe it’s just me, but it seems like there should be a way to write logs/updates/scratch to a RW place, and keep the majority of the data read-only (and highly compressed).

Why doesn’t stuff format correctly

There seems to be a lot of formatting nonsense going on, I probably should step up to mediawiki 1.17. And I’ll add in loading the other SQL tables since they are straight up inserts. Also the extensions I know I loaded don’t seem to exist in any form anymore, and the images I snapshotted of the install are all long gone. It’ll require more diving around.

Mirroring Wikipedia

So I had an internet outage, and was thinking if I was trapped on my proverbial desert island what would I want with me?

Well wikipedia would be nice!

So I started with this extreme tech article by Sebastian Anthony, although it has since drifted out of date on a few things.

But it is enough to get you started.

I downloaded my XML dump from Brazil like he mentions.  The files I got were:

  • enwiki-20140304-pages-articles.xml.bz2 10G
  • enwiki-20140304-all-titles-in-ns0.gz 58MB
  • enwiki-20140304-interwiki.sql.gz 728Kb
  • enwiki-20140304-redirect.sql.gz 91MB
  • enwiki-20140304-protected_titles.sql.gz 887Kb

The pages-articles.xml is required.  I added in the others in the hopes of fixing some formatting issues.  I re-compressed it from 10GB using Bzip2 to 8.4GB with 7zip.  It’s still massive, but when you are on a ‘slow’ connection every saved GB matters.

Since I already have apache/php/mysql running on my Debian box, I can’t help you with a virgin install.  I would say it’s pretty much like every other LAMP install.

Although I did *NOT* install phpmyadmin.  I’ve seen too many holes in it, and I prefer the command line anyways.

First I connect to my database instance:

mysql -uroot -pMYBADPASSWORD

And then execute the following:

create database wikimirror;
create user ‘wikimirror’@’localhost’ IDENTIFIED BY ‘MYOTHERPASSWORD’;
GRANT ALL PRIVILEGES ON wikimirror.* TO ‘wikimirror’@’localhost’ WITH GRANT OPTION;
show grants for ‘wikimirror’@’localhost’;

This creates the database, adds the user and grants them permission.

Downloading and setting up mediawiki 1.22.5 is pretty straight forward.  There is one big caveat I found though.  InnoDB is incredibly slow for loading the database. I spent a good 30 minutes trying to find a good solution before going back to MyISAM with utf8 support.

With the empty site created, I do a quick backup incase I want to purge what I have.

/usr/bin/mysqldump -uwikimirror -pw1k1p3d1a wikimirror > /usr/local/wikipedia/wikimedia-1.22.5-empty.sql

This way I can quickly revert as constantly re-installing mediawiki is… a pain.  And it gets repetitive which is good for introducing errors, so it’s far easier to dump the database/user and re-create them, and reload the empty database.

When I was using InnoDB, I was getting a mere 163 inserts a second. That means it would take about 24 hours to import the entire database!!  Which simply is not good enough for someone as impatient as me.  As of this latest dump there are 14,313,024 records that need to be inserted, which would take the better part of forever to do.

So let’s make some changes to the MySQL server config.  Naturally backup your existing /etc/mysql/my.cnf to something else, then I added the following bits:

 key_buffer = 1024M
max_allowed_packet = 384M
query_cache_limit = 18M
query_cache_size = 128M

I should add that I have a lot of system RAM available.  And that my box is running Debian 7.1 x64_86.

Next you’ll want a slightly modified import program,  I used the one from Michael Tsikerdekis’s site, but I did modify it to run the ‘precommit’ portion on it’s own.  I did this because I didn’t want to decompress the massive XML file on the filesystem.  I may have the space but it just seems silly.

With the script ready we can import!  Remember to restart the mysql server, and make sure it’s running correctly.  Then you can run:

bzcat enwiki-20140304-pages-articles.xml.bz2 | perl ./mwimport2 | mysql -f -u wikimirror -pMYOTHERBADPASSWORD  –default-character-set=utf8 wikimirror

And then you’ll see the progress flying by.  While it is loading you should be able to hit a random page, and get back some wikipedia looking data.  If you get an error well obviously something is wrong…

With my slight moddifications I was getting about 1000 inserts a second, which gave me…

 14313024 pages (1041.174/s),  14313024 revisions (1041.174/s) in 13747 seconds

Which ran in just under four hours.  Not too bad!

With the load all done, I shut down mysql, and then copy back the first config.  For the fun of it I did add in the following for day to day usage:

 key_buffer = 512M
max_allowed_packet = 128M
query_cache_limit = 18M
query_cache_size = 128M

I should add that the ‘default’ small config was enough for me to withstand over 16,000 hits a day when I got listed on reddit.  So it’s not bad for small-ish databases (my wordpress is about 250MB) that see a lot of action, but wikipedia is about 41GB.

Now for the weird stuff.  There is numerous weird errors that’ll appear on the pages.  I’ve tracked the majority down to lua scripting now being enabled on the template pages of wikipedia.  So you need to enable lua on your server, and setup the lua extensions.

The two that just had to be enabled to get things looking half right are:

With this done right, you’ll see Lua as part of installed software on the version page:

mediawiki installed softwareAnd under installed extensions:

wikimedia installed extensions

I did need to put the following in the LocalSettings.php file, but it’s in the installation bits for the extensions:

$wgLuaExternalInterpreter = “/usr/bin/lua5.1″;
require_once(“$IP/extensions/Lua/Lua.php”);
$wgScribuntoEngineConf[‘luastandalone’][‘luaPath’] = ‘/usr/bin/lua5.1′;
require_once( “$IP/extensions/Scribunto/Scribunto.php” );

Now when I load a page it still has some missing bits, but it’s looking much better.

The Amiga page...

The Amiga page…

Now I know the XOWA people have a torrent setup for about 75GB worth of images.  I just have to figure out how to get those and parse them into my wikipedia mirror.

I hope this will prove useful for someone in the future.  But if it looks too daunting, just use the XOWA.  Another solution is WP-MIRROR, although it can apparently take several days to load.

Interesting wiki site…

While looking something up, I came across this site:

http://gunkies.org/wiki/Main_Page

It’s a wiki dedicated to retro-computing, but more so in first person.. Not so much in the factual history of wikipedia. Anyways I thought it was worth mentioning, and I’ve started to redo some of my work here over there. Hopefully this will get more people interested in the whole thing. I’ve also been thinking about doing a windows installer for the z80 CP/M module from SIMH.. Hopefully the games & stuff work fine, so it’ll be FUN!

Proxmox VE

Well frankly I’ve been majorly disappointed with Microsoft’s latest offerings in the world of virtualization. Frankly it’s been one BIG step backwards in terms of management.

I mean check this well meaning blog on how “easy” it is to setup remote management. And of course for the most part it NEVER works.

I know this must be a major news flash to Microsoft but you see virtual servers are like mainframes. The zone 0 OS must be able to stand on it’s own, and have just enough to bootstrap the hypervisor and allow itself to be managed in a stand alone fashion. After all if it were in a domain, where do you think those domain controllers are? Yep they are Virtual machines! And how do you ‘manage’ a domain resource with no DC’s? The whole 2008 Hyper-V is a BIG miscalculation on Microsoft’s part. I hope they wake up and notice how they had a good thing and have destroyed it.

All this nonsense sent me searching for an alternative which I’m pretty sure I found a great blend of system emulation, and something like SUN containers for Linux. There is even a Debian etch based quick install version called Proxmox which incorporates KVM (The new Linux hypervisor) and OpenVZ. And of course it’s FREE!

The cool thing is that the main management works on a web page, the consoles can be controlled via a VNC viewer that uses JAVA, and it’s VERY quick to setup.

The system emulation KVM uses the core devices from Qemu so a lot of Qemu virtual machines will “just work” if you copy them over. If you are installing an OS onto the virtual machine the ‘easy’ way is with the physical CD, you can use ISO images, however they are awkard to use. You have to flag the VM to pause on startup switch over to the monitor page and issue the following command:

change ide1-cd0 /directory/isoimage.iso

then tell the emulator to start up with the ‘c’ command which will continue from the pause…. Yeah I know it’s not terribly eligant.

On the OpenVZ front, it’s FAST as there is no real emulated IO it’s native. So I decided to use the wiki template and setup a wikipedia mirror at home. If anyone feels as brave you too can find instructions here:

These are some of the table times to load:

601M pages.sql Query OK, 7,473,186 rows affected, 8 warnings (5 min 10.52 sec)
837M revision.sql Query OK, 7,473,200 rows affected, 65535 warnings (2 min 11.84 sec)
18G text.sql Query OK, 7,473,202 rows affected, 1 warning (12 min 12.07 sec)
20M category.txt Query OK, 471,207 rows affected (13.14 sec)
1.8G categorylinks Query OK, 24,501,837 rows affected, 30177 warnings (28 min 28.31 sec)
5.6G externallinks Query OK, 36,492,925 rows affected (3 min 50.34 sec)
362M latestimage Query OK, 807,906 rows affected, 2 warnings (34.35 sec)
555M imagelinks Query OK, 18,615,721 rows affected (10 min 49.60 sec)
32k interwiki Query OK, 651 rows affected (0.08 sec)
186M langlinks Query OK, 5,780,509 rows affected (2 min 17.75 sec)
2G logging Query OK, 16,398,421 rows affected (2 min 51.75 sec)
45M oldimage Query OK, 118,449 rows affected (1.97 sec)
7.6G pagelinks Query OK, 270,641,297 rows affected (6 hours 12 min 4.83 sec)
104M redirect Query OK, 3,234,481 rows affected (23.71 sec)
1.2G template-link Query OK, 48,885,222 rows affected (50 min 7.08 sec)
68k user_groups Query OK, 3,947 rows affected (0.11 sec)

Even the ‘longest’ part here with the 270 million records took six hours… Not too bad! That’s still 12,122.88 TPS!

Also as a tip for anyone else crazy enough to do a sizable mediawiki (like wikipedia) or any single server wiki look to this page.

The upshot is that by loading this APC
extension into PHP and mediawiki load times for my cached site went from 2-5 minutes to 1-10 seconds.

The OpenVZ portion has various application templates that can be loaded into the zones from CentOS, Debian, Ubuntu, to pre configured applications like the media wiki and a few others.

If anything I’d say that proxmox is what I was hoping Microsoft’s Hyper-V could have been. A container version of windows with easy remote admin along with some system emulation could have made things MASSIVLY easier to deal with. It’s a shame they decided to go with this bizarre WMI based thing.