Mirroring Wikipedia

So I had an internet outage, and was thinking if I was trapped on my proverbial desert island what would I want with me?

Well wikipedia would be nice!

So I started with this extreme tech article by Sebastian Anthony, although it has since drifted out of date on a few things.

But it is enough to get you started.

I downloaded my XML dump from Brazil like he mentions.  The files I got were:

  • enwiki-20140304-pages-articles.xml.bz2 10G
  • enwiki-20140304-all-titles-in-ns0.gz 58MB
  • enwiki-20140304-interwiki.sql.gz 728Kb
  • enwiki-20140304-redirect.sql.gz 91MB
  • enwiki-20140304-protected_titles.sql.gz 887Kb

The pages-articles.xml is required.  I added in the others in the hopes of fixing some formatting issues.  I re-compressed it from 10GB using Bzip2 to 8.4GB with 7zip.  It’s still massive, but when you are on a ‘slow’ connection every saved GB matters.

Since I already have apache/php/mysql running on my Debian box, I can’t help you with a virgin install.  I would say it’s pretty much like every other LAMP install.

Although I did *NOT* install phpmyadmin.  I’ve seen too many holes in it, and I prefer the command line anyways.

First I connect to my database instance:

mysql -uroot -pMYBADPASSWORD

And then execute the following:

create database wikimirror;
create user ‘wikimirror’@’localhost’ IDENTIFIED BY ‘MYOTHERPASSWORD’;
GRANT ALL PRIVILEGES ON wikimirror.* TO ‘wikimirror’@’localhost’ WITH GRANT OPTION;
show grants for ‘wikimirror’@’localhost’;

This creates the database, adds the user and grants them permission.

Downloading and setting up mediawiki 1.22.5 is pretty straight forward.  There is one big caveat I found though.  InnoDB is incredibly slow for loading the database. I spent a good 30 minutes trying to find a good solution before going back to MyISAM with utf8 support.

With the empty site created, I do a quick backup incase I want to purge what I have.

/usr/bin/mysqldump -uwikimirror -pw1k1p3d1a wikimirror > /usr/local/wikipedia/wikimedia-1.22.5-empty.sql

This way I can quickly revert as constantly re-installing mediawiki is… a pain.  And it gets repetitive which is good for introducing errors, so it’s far easier to dump the database/user and re-create them, and reload the empty database.

When I was using InnoDB, I was getting a mere 163 inserts a second. That means it would take about 24 hours to import the entire database!!  Which simply is not good enough for someone as impatient as me.  As of this latest dump there are 14,313,024 records that need to be inserted, which would take the better part of forever to do.

So let’s make some changes to the MySQL server config.  Naturally backup your existing /etc/mysql/my.cnf to something else, then I added the following bits:

 key_buffer = 1024M
max_allowed_packet = 384M
query_cache_limit = 18M
query_cache_size = 128M

I should add that I have a lot of system RAM available.  And that my box is running Debian 7.1 x64_86.

Next you’ll want a slightly modified import program,  I used the one from Michael Tsikerdekis’s site, but I did modify it to run the ‘precommit’ portion on it’s own.  I did this because I didn’t want to decompress the massive XML file on the filesystem.  I may have the space but it just seems silly.

With the script ready we can import!  Remember to restart the mysql server, and make sure it’s running correctly.  Then you can run:

bzcat enwiki-20140304-pages-articles.xml.bz2 | perl ./mwimport2 | mysql -f -u wikimirror -pMYOTHERBADPASSWORD  –default-character-set=utf8 wikimirror

And then you’ll see the progress flying by.  While it is loading you should be able to hit a random page, and get back some wikipedia looking data.  If you get an error well obviously something is wrong…

With my slight moddifications I was getting about 1000 inserts a second, which gave me…

 14313024 pages (1041.174/s),  14313024 revisions (1041.174/s) in 13747 seconds

Which ran in just under four hours.  Not too bad!

With the load all done, I shut down mysql, and then copy back the first config.  For the fun of it I did add in the following for day to day usage:

 key_buffer = 512M
max_allowed_packet = 128M
query_cache_limit = 18M
query_cache_size = 128M

I should add that the ‘default’ small config was enough for me to withstand over 16,000 hits a day when I got listed on reddit.  So it’s not bad for small-ish databases (my wordpress is about 250MB) that see a lot of action, but wikipedia is about 41GB.

Now for the weird stuff.  There is numerous weird errors that’ll appear on the pages.  I’ve tracked the majority down to lua scripting now being enabled on the template pages of wikipedia.  So you need to enable lua on your server, and setup the lua extensions.

The two that just had to be enabled to get things looking half right are:

With this done right, you’ll see Lua as part of installed software on the version page:

mediawiki installed softwareAnd under installed extensions:

wikimedia installed extensions

I did need to put the following in the LocalSettings.php file, but it’s in the installation bits for the extensions:

$wgLuaExternalInterpreter = “/usr/bin/lua5.1″;
require_once(“$IP/extensions/Lua/Lua.php”);
$wgScribuntoEngineConf[‘luastandalone’][‘luaPath’] = ‘/usr/bin/lua5.1′;
require_once( “$IP/extensions/Scribunto/Scribunto.php” );

Now when I load a page it still has some missing bits, but it’s looking much better.

The Amiga page...

The Amiga page…

Now I know the XOWA people have a torrent setup for about 75GB worth of images.  I just have to figure out how to get those and parse them into my wikipedia mirror.

I hope this will prove useful for someone in the future.  But if it looks too daunting, just use the XOWA.  Another solution is WP-MIRROR, although it can apparently take several days to load.

Leave a Reply