Setting the time machine for June 20th 2011

John Titor hunting Orange wine, and IBM 5100’s

No, not that time machine, this one is a rehash of the old local Wikipeida mirror.

So sadly I didn’t keep the source files as I thought they were evergreen, and yeah turns out they are NOT. But thankfully there is a 2011 set on archive.org listed as enwiki-20110620-item-1-of-2 and enwiki-20110620-item-2-of-2. Sadly there isn’t any torrents of these files, and it seems as of today the internet archive torrent servers are dead so a direct download is needed.

Getting started

You are going to need a LOT of disk space. It’s about 10GB for the downloaded compressed data, and with the pages blown out to a database it’s ~60GB. Yes it’s massive. Also enough space for a Debian 7 VM, or a lot of your time trying to decode ancient perl. Yes it really is a write only language. I didn’t bother trying to figure out why it doesn’t work instead I used netcat and a Debian 7 VM.

Thanks to trn he suggested aria2c which did a great job of downloading stuff, although one URL at a time, but that’s fine.

aria2c -x 16 -s 16 -j 16 <<URL>>

I downloaded the following files:

  • enwiki-20110620-all-titles-in-ns0.gz
  • enwiki-20110620-category.sql.gz
  • enwiki-20110620-categorylinks.sql.gz
  • enwiki-20110620-externallinks.sql.gz
  • enwiki-20110620-flaggedpages.sql.gz
  • enwiki-20110620-flaggedrevs.sql.gz
  • enwiki-20110620-image.sql.gz
  • enwiki-20110620-imagelinks.sql.gz
  • enwiki-20110620-interwiki.sql.gz
  • enwiki-20110620-iwlinks.sql.gz
  • enwiki-20110620-langlinks.sql.gz
  • enwiki-20110620-oldimage.sql.gz
  • enwiki-20110620-page.sql.gz
  • enwiki-20110620-pagelinks.sql.gz
  • enwiki-20110620-pages-articles.xml.bz2
  • enwiki-20110620-pages-logging.xml.gz
  • enwiki-20110620-page_props.sql.gz
  • enwiki-20110620-page_restrictions.sql.gz
  • enwiki-20110620-protected_titles.sql.gz
  • enwiki-20110620-redirect.sql.gz
  • enwiki-20110620-site_stats.sql.gz
  • enwiki-20110620-templatelinks.sql.gz
  • enwiki-20110620-user_groups.sql.gz

although the bulk of what you want as a single file is enwiki-20110620-pages-articles.xml.bz2, which is 7.5 GB, downloading the rest of the files is another 10GB rouding this out to 17.5GB of files to download. Yikes!

MySQL on WSLv2

I’m using Ubuntu 20.04 LTS on Windows 11, so adding MySQL is done via the MariaDB version with a simple apt-get install:

apt-get install mariadb-server mariadb-common mariadb-client mariadb-common

Installing MySQL is kind of easy although it will need to be setup to assign the pid file to the right place and set so it can write to it:

mkdir -p /var/run/mysqld
chown mysql:mysql /var/run/mysqld

Otherwise you’ll get this:

[ERROR] mysqld: Can't create/write to file '/var/run/mysqld/mysqld.pid' (Errcode: 2 "No such file or directory")

Additionally you’ll need to tell it to bind to 0.0.0.0 instead of 127.0.0.1 as we’ll want this on the network. I’m on an isolated LAN so it’s fine by me, but of course your millage may vary. For me a simple diff of the config directory is this:

diff -ruN etc/mysql/mariadb.conf.d/50-server.cnf /etc/mysql/mariadb.conf.d/50-server.cnf
--- etc/mysql/mariadb.conf.d/50-server.cnf      2021-11-21 08:22:31.000000000 +0800
+++ /etc/mysql/mariadb.conf.d/50-server.cnf     2022-03-11 10:01:45.369272200 +0800
@@ -27,7 +27,7 @@

 # Instead of skip-networking the default is now to listen only on
 # localhost which is more compatible and is not less secure.
-bind-address            = 127.0.0.1
+bind-address            = 0.0.0.0

 #
 # * Fine Tuning
@@ -43,6 +43,11 @@
 #max_connections        = 100
 #table_cache            = 64

+key_buffer_size = 1G
+max_allowed_packet = 1G
+query_cache_limit = 18M
+query_cache_size = 128M
+
 #
 # * Logging and Replication
 #

As far as I know MySQL doesn’t run on WSLv1. So people with that restriction are kind of SOL. At the same time for me, Debian 7 doesn’t run on Hyper-V so I had to run VMware Player. And well if you can’t run Hyper-V/WSLv2 then you can run it all on Debian 7 which is probably eaiser. Although you’ll probably hit some performance issues in the import that either my machine is fast enough I don’t care or the newer stuff is pre-configured for machines larger than an ISA/PCI gen1 Pentium 60.

I run mysqld manually in a window as I am only doing this adhoc not as a service. Although on a Windows 10 machine to reproduce and test this, mysqld wont run interactively, instead I had to do the ‘service mysql start’ to get it running. So I guess you’ll have to find out the hard way.

Next, be sure to create the database and a user to so this will work:

create database wikidb;
create user 'wikiuser'@'%' IDENTIFIED BY 'password';
GRANT ALL PRIVILEGES ON wikidb.* TO 'wikiuser'@'%' WITH GRANT OPTION;
show grants for 'wikiuser'@'%';

Something like this works well. Yes the password is password but it’s all internal so who cares. If you don’t like it, change it as needed.

With the database & user created you’ll want to make sure that you can connect from the Debian 7 machine with something like this:

mysql -h 192.168.6.10 -uwikiuser -ppassword wikidb

As I don’t think PHP 7 or whatever is modern will run the ancient MediaWiki version 1.15.5 (which I’m using).

This is my setup as I’m writing this so bear with me.

Prepping Apache

Since I have that Debian 7 VM, I used that for setting up MediaWiki. Looking at my apt-cache I believe I loaded the following modules:

  • mysql-client
  • mysql-common
  • apache2
  • apache2.2-bin
  • apache2.2-common
  • apache2-mpm-prefork
  • apache2-mpm-worker
  • apache2-utils
  • libapache2-mod-php5
  • php5-cli
  • php5-common
  • php5-mysql
  • lua5.1
  • liblua5.1

On the Apache side I have the following extension enabled:

alias authz_default authz_user deflate mime reqtimeout
auth_basic authz_groupfile autoindex dir negotiation setenvif
authn_file authz_host cgi env php5 status

Which I think is pretty generic.

I used mediawiki-1.15.5 as the basis mostly because I had started with an incomplete 2010 dump, but after finding this 2011 dump I probably should have gone with 1.16.5 or 1.17.5.. Oh well. When connecting from Debian 7 to my ‘modern’ MariaDB there is one table that needs to be updated, otherwise it’ll fail. A simple diff that needs to be applied (that was with the least amount of effort spent by me!) is this:

--- maintenance/tables.sql      2009-03-20 19:20:39.000000000 +0800
+++ /var/www/maintenance/tables.sql     2022-03-07 14:21:25.580318700 +0800
@@ -1099,7 +1099,7 @@

 CREATE TABLE /*_*/trackbacks (
   tb_id int PRIMARY KEY AUTO_INCREMENT,
-  tb_page int REFERENCES /*_*/page(page_id) ON DELETE CASCADE,
+  tb_page int,
   tb_title varchar(255) NOT NULL,
   tb_url blob NOT NULL,
   tb_ex text,

All being well and patched you can do the install! I just do a super basic install, nothing exciting. In my setup the MySQL server is on 192.168.6.10. I don’t think I changed much of anything?

And with that done if all goes well you’ll get the install completed!

If you get anything else, drop the database (the permission grants stay, because MySQL doesn’t actually drop thing associated with databases.. :shrug:.

Next in the extensions folder I grabbed Scribunto-REL1_35-04b897f.tar.gz, which is still on the extensions site. This required Lua 5.1 and the following to be appended to the LocalSetings.php

#
$wgScribuntoEngineConf['luastandalone']['luaPath'] = '/usr/bin/lua5.1';

$wgScribuntoUseGeSHi = true;
$wgScribuntoUseCodeEditor = true;
#

Keep in mind the original extensions I used are not, and appear to not have been archived, so yeah.

Doing the pages.xml import

You can find the version 0.5 media wiki import script on archive.org. Obviously check the first 5-10 lines of the decompressed bz2 file to see what version you have if you are deviating and look around IA to time travel to see if there is a matching one. I have no idea about modern ones as this is hard enough trying to reproduce an old experiment.

First you need to make some files to setup the pre-post conditions of the insert. It’s about 11,124,050 pages, give or take.

pre.sql

SET autocommit=0;
SET unique_checks=0;
SET foreign_key_checks=0;
BEGIN;

post.sql

COMMIT;
SET autocommit=1;
SET unique_checks=1;
SET foreign_key_checks=1;

Running the actual import

I’m assuming that 192.168.6.33 is the Debian 7 machine, 192.168.6.10 is the Windows 11 machine.

On the machine with the data:

netcat 192.168.6.33 9909 < enwiki-latest-pages-articles.xml.bz2

On the machine that can run the mwimport script:

netcat -l -p 9909 | bzip2 -dc | ./mwimport-0.5.pl | netcat 192.168.6.10 9906

And finally on the MySQL machine:

(cat pre.sql; netcat -l -p 9906 ; cat post.sql) | mysql -f --default-character-set=utf8 wikidb

Since I’m using WSLv2 the Windows firewall may screw stuff up so add a rule with netsh (as Administrator CMD prompt)

netsh interface portproxy add v4tov4 listenaddress=192.168.6.10 listenport=3306 connectaddress=172.24.167.66 connectport=3306
netsh interface portproxy add v4tov4 listenaddress=192.168.6.10 listenport=9906 connectaddress=172.24.167.66 connectport=9906

On my setup it takes about 2.5 hours to load the database, which will be about 51GB.

11340000 pages (1231.805/s),  11340000 revisions (1231.805/s) in 9206 seconds

The savvy among you may notice the -f flag to the mysql parser. And yes that is because there *will* be errors during the process.

I’m not sure what how or what to do about it, but without the -f (force) flag the process will stop around the 2 million row mark. Doing it forced allows the process to continue.

With that done I get the following tallies…

MariaDB [(none)]> SELECT table_name, table_rows FROM INFORMATION_SCHEMA.TABLES    WHERE TABLE_SCHEMA = 'wikidb' and
table_rows > 0;
+---------------+------------+
| table_name    | table_rows |
+---------------+------------+
| interwiki     |         85 |
| objectcache   |         10 |
| page          |   10839464 |
| revision      |   11357659 |
| text          |   14491759 |
| user_groups   |          2 |
+---------------+------------+
9 rows in set (0.002 sec)

If all of this worked (amazing!) then search for something like 1001 and be greeted with:

1001: a non odyssey

MySQL disappointments

So with this in place, having some 51GB laying around just seemed lame. Using WSLv2 I setup a compressed folder on NTFS and moved the data directory into there and it gets it down to a somewhat more manageable 20GB. Since the data doesn’t change I had a better idea, SquashFS. Well it compresses down to 12GB, HOWEVER for the life of me I can’t find anything concrete on using a read only backing store to MySQL. Even general mediawiki stuff seems to want to write to all the tables, I guess it’s index searching?! Insane! And it appears MySQL can only use single file storage units per table? Yeah this isn’t MSSQL with stuff like a database from CD-ROM with the log on a floppy. I tried doing a union overlay filesytem but it makes a 100% copy of a file that changes. That’s not good. I guess using qemu-img for a compressed qcow2 with a writable diff file could hide the read only compressed backing store, but I’ve already lost interest.

Maybe it’s just me, but it seems like there should be a way to write logs/updates/scratch to a RW place, and keep the majority of the data read-only (and highly compressed).

Why doesn’t stuff format correctly

There seems to be a lot of formatting nonsense going on, I probably should step up to mediawiki 1.17. And I’ll add in loading the other SQL tables since they are straight up inserts. Also the extensions I know I loaded don’t seem to exist in any form anymore, and the images I snapshotted of the install are all long gone. It’ll require more diving around.

Installing mediawiki on WAMP

Building on our WAMP installation, we are now going to install mediawiki.

The first thing I’d recommend to do is to move the contents of c:\wamp\www into another directory… I just shoved the terminal thing into c:\wamp\terminal .

Now mediawiki is the software that powers wikipedia. It’s a great collaboration platform, it has built in revision control, and best of all it’s free.

It’s also VERY simple to setup, well compared to other web content platforms.

The current version is 1.16, which can be downloaded here. As things change, you may be best served by just visiting the main download site.

Since most ‘AMP’ servers are Linux based, we’ll have to get gzip & tar to extract mediawiki. It’s very easy though.

Simply type this in to extract mediawiki

C:\temp>dir
Volume in drive C has no label.
Volume Serial Number is FC55-C2F4

Directory of C:\temp

12/28/2010 08:15 PM DIR .
12/28/2010 08:15 PM DIR ..
12/28/2010 08:13 PM 49,152 gzip.exe
112/28/2010 08:15 PM 12,647,934 mediawiki-1.16.0.tar.gz
12/28/2010 08:13 PM 114,688 tar.exe
3 File(s) 12,811,774 bytes
2 Dir(s) 7,073,234,944 bytes free

C:\temp>gzip -dc mediawiki-1.16.0.tar.gz| tar -xf –

C:\temp>

Ok, now with mediawiki extracted we just move the contents of c:\temp\mediawiki-1.16.0 into c:\wamp\www

Now before we go on, we are going to set a password for the MySQL process. In the off chance someone is following this on a server to deploy on the internet, it’d be crazy to leave it with no password.

So left click on the WAMP system tray icon, go to MySQL, and bring up the MySQL Console.

media1

Just hit enter for the password as there isn’t one.

Next follow this SQL statement to set the password for the root user to password. Or select your own better password.

mysql> use mysql;
Database changed
mysql> update user set password=PASSWORD(“password”) where User=’root’;
Query OK, 3 rows affected (0.05 sec)
Rows matched: 3 Changed: 3 Warnings: 0

Now restart the mysql service, by clicking on the system tray icon, then mysql, service then ‘restart service’. If you don’t do this the password change will not take effect!

With that out of the way, it’s time to configure mediawiki. Simply open up a web browser to the following location:

http://localhost

And you should see something like this:

media2

Click the setup link, and let’s walk through the options…

First is the wikiname. I’m just going to call mine ‘test wiki’. Put in your own contact email, so that mediawiki will email YOU if anything is going on… I left the language in English, and left the license alone. The next important thing to do is to select a Admin username, and password. This is all up to you. Just remember that the Username is CaSe SeNsItIvE!!!

Leave the caching off.

The next section is for the email notifications, I just left those as default.

The final thing to configure is the database.

Since we are going to keep this simple, just set the DB username to root, and put in the password you configured earlier in the MySQL Console. Next check the ‘superuser account’ box, and specify root and the password again.

You can now click the Install MediaWiki button!

You’ll see some information printed on the page, and if everything goes according to plan, you’ll get the message:

Installation successful! Move the config/LocalSettings.php file to the parent directory, then follow this link to your wiki.

You should change file permissions for LocalSettings.php as required to prevent other users on the server reading passwords and altering configuration data

So simply copy the file c:\wamp\www\config\LocalSettings.php to c:\wamp\www\

then simply click the following link to be taken to your personal wiki:

http://localhost/index.php

media4

And that should take care of it!

Proxmox VE

Well frankly I’ve been majorly disappointed with Microsoft’s latest offerings in the world of virtualization. Frankly it’s been one BIG step backwards in terms of management.

I mean check this well meaning blog on how “easy” it is to setup remote management. And of course for the most part it NEVER works.

I know this must be a major news flash to Microsoft but you see virtual servers are like mainframes. The zone 0 OS must be able to stand on it’s own, and have just enough to bootstrap the hypervisor and allow itself to be managed in a stand alone fashion. After all if it were in a domain, where do you think those domain controllers are? Yep they are Virtual machines! And how do you ‘manage’ a domain resource with no DC’s? The whole 2008 Hyper-V is a BIG miscalculation on Microsoft’s part. I hope they wake up and notice how they had a good thing and have destroyed it.

All this nonsense sent me searching for an alternative which I’m pretty sure I found a great blend of system emulation, and something like SUN containers for Linux. There is even a Debian etch based quick install version called Proxmox which incorporates KVM (The new Linux hypervisor) and OpenVZ. And of course it’s FREE!

The cool thing is that the main management works on a web page, the consoles can be controlled via a VNC viewer that uses JAVA, and it’s VERY quick to setup.

The system emulation KVM uses the core devices from Qemu so a lot of Qemu virtual machines will “just work” if you copy them over. If you are installing an OS onto the virtual machine the ‘easy’ way is with the physical CD, you can use ISO images, however they are awkard to use. You have to flag the VM to pause on startup switch over to the monitor page and issue the following command:

change ide1-cd0 /directory/isoimage.iso

then tell the emulator to start up with the ‘c’ command which will continue from the pause…. Yeah I know it’s not terribly eligant.

On the OpenVZ front, it’s FAST as there is no real emulated IO it’s native. So I decided to use the wiki template and setup a wikipedia mirror at home. If anyone feels as brave you too can find instructions here:

These are some of the table times to load:

601M pages.sql Query OK, 7,473,186 rows affected, 8 warnings (5 min 10.52 sec)
837M revision.sql Query OK, 7,473,200 rows affected, 65535 warnings (2 min 11.84 sec)
18G text.sql Query OK, 7,473,202 rows affected, 1 warning (12 min 12.07 sec)
20M category.txt Query OK, 471,207 rows affected (13.14 sec)
1.8G categorylinks Query OK, 24,501,837 rows affected, 30177 warnings (28 min 28.31 sec)
5.6G externallinks Query OK, 36,492,925 rows affected (3 min 50.34 sec)
362M latestimage Query OK, 807,906 rows affected, 2 warnings (34.35 sec)
555M imagelinks Query OK, 18,615,721 rows affected (10 min 49.60 sec)
32k interwiki Query OK, 651 rows affected (0.08 sec)
186M langlinks Query OK, 5,780,509 rows affected (2 min 17.75 sec)
2G logging Query OK, 16,398,421 rows affected (2 min 51.75 sec)
45M oldimage Query OK, 118,449 rows affected (1.97 sec)
7.6G pagelinks Query OK, 270,641,297 rows affected (6 hours 12 min 4.83 sec)
104M redirect Query OK, 3,234,481 rows affected (23.71 sec)
1.2G template-link Query OK, 48,885,222 rows affected (50 min 7.08 sec)
68k user_groups Query OK, 3,947 rows affected (0.11 sec)

Even the ‘longest’ part here with the 270 million records took six hours… Not too bad! That’s still 12,122.88 TPS!

Also as a tip for anyone else crazy enough to do a sizable mediawiki (like wikipedia) or any single server wiki look to this page.

The upshot is that by loading this APC
extension into PHP and mediawiki load times for my cached site went from 2-5 minutes to 1-10 seconds.

The OpenVZ portion has various application templates that can be loaded into the zones from CentOS, Debian, Ubuntu, to pre configured applications like the media wiki and a few others.

If anything I’d say that proxmox is what I was hoping Microsoft’s Hyper-V could have been. A container version of windows with easy remote admin along with some system emulation could have made things MASSIVLY easier to deal with. It’s a shame they decided to go with this bizarre WMI based thing.