Mirroring Wikipedia

So I had an internet outage, and was thinking if I was trapped on my proverbial desert island what would I want with me?

Well wikipedia would be nice!

So I started with this extreme tech article by Sebastian Anthony, although it has since drifted out of date on a few things.

But it is enough to get you started.

I downloaded my XML dump from Brazil like he mentions.  The files I got were:

  • enwiki-20140304-pages-articles.xml.bz2 10G
  • enwiki-20140304-all-titles-in-ns0.gz 58MB
  • enwiki-20140304-interwiki.sql.gz 728Kb
  • enwiki-20140304-redirect.sql.gz 91MB
  • enwiki-20140304-protected_titles.sql.gz 887Kb

The pages-articles.xml is required.  I added in the others in the hopes of fixing some formatting issues.  I re-compressed it from 10GB using Bzip2 to 8.4GB with 7zip.  It’s still massive, but when you are on a ‘slow’ connection every saved GB matters.

Since I already have apache/php/mysql running on my Debian box, I can’t help you with a virgin install.  I would say it’s pretty much like every other LAMP install.

Although I did *NOT* install phpmyadmin.  I’ve seen too many holes in it, and I prefer the command line anyways.

First I connect to my database instance:

mysql -uroot -pMYBADPASSWORD

And then execute the following:

create database wikimirror;
create user ‘wikimirror’@’localhost’ IDENTIFIED BY ‘MYOTHERPASSWORD’;
GRANT ALL PRIVILEGES ON wikimirror.* TO ‘wikimirror’@’localhost’ WITH GRANT OPTION;
show grants for ‘wikimirror’@’localhost’;

This creates the database, adds the user and grants them permission.

Downloading and setting up mediawiki 1.22.5 is pretty straight forward.  There is one big caveat I found though.  InnoDB is incredibly slow for loading the database. I spent a good 30 minutes trying to find a good solution before going back to MyISAM with utf8 support.

With the empty site created, I do a quick backup incase I want to purge what I have.

/usr/bin/mysqldump -uwikimirror -pw1k1p3d1a wikimirror > /usr/local/wikipedia/wikimedia-1.22.5-empty.sql

This way I can quickly revert as constantly re-installing mediawiki is… a pain.  And it gets repetitive which is good for introducing errors, so it’s far easier to dump the database/user and re-create them, and reload the empty database.

When I was using InnoDB, I was getting a mere 163 inserts a second. That means it would take about 24 hours to import the entire database!!  Which simply is not good enough for someone as impatient as me.  As of this latest dump there are 14,313,024 records that need to be inserted, which would take the better part of forever to do.

So let’s make some changes to the MySQL server config.  Naturally backup your existing /etc/mysql/my.cnf to something else, then I added the following bits:

 key_buffer = 1024M
max_allowed_packet = 384M
query_cache_limit = 18M
query_cache_size = 128M

I should add that I have a lot of system RAM available.  And that my box is running Debian 7.1 x64_86.

Next you’ll want a slightly modified import program,  I used the one from Michael Tsikerdekis’s site, but I did modify it to run the ‘precommit’ portion on it’s own.  I did this because I didn’t want to decompress the massive XML file on the filesystem.  I may have the space but it just seems silly.

With the script ready we can import!  Remember to restart the mysql server, and make sure it’s running correctly.  Then you can run:

bzcat enwiki-20140304-pages-articles.xml.bz2 | perl ./mwimport2 | mysql -f -u wikimirror -pMYOTHERBADPASSWORD  –default-character-set=utf8 wikimirror

And then you’ll see the progress flying by.  While it is loading you should be able to hit a random page, and get back some wikipedia looking data.  If you get an error well obviously something is wrong…

With my slight moddifications I was getting about 1000 inserts a second, which gave me…

 14313024 pages (1041.174/s),  14313024 revisions (1041.174/s) in 13747 seconds

Which ran in just under four hours.  Not too bad!

With the load all done, I shut down mysql, and then copy back the first config.  For the fun of it I did add in the following for day to day usage:

 key_buffer = 512M
max_allowed_packet = 128M
query_cache_limit = 18M
query_cache_size = 128M

I should add that the ‘default’ small config was enough for me to withstand over 16,000 hits a day when I got listed on reddit.  So it’s not bad for small-ish databases (my wordpress is about 250MB) that see a lot of action, but wikipedia is about 41GB.

Now for the weird stuff.  There is numerous weird errors that’ll appear on the pages.  I’ve tracked the majority down to lua scripting now being enabled on the template pages of wikipedia.  So you need to enable lua on your server, and setup the lua extensions.

The two that just had to be enabled to get things looking half right are:

With this done right, you’ll see Lua as part of installed software on the version page:

mediawiki installed softwareAnd under installed extensions:

wikimedia installed extensions

I did need to put the following in the LocalSettings.php file, but it’s in the installation bits for the extensions:

$wgLuaExternalInterpreter = “/usr/bin/lua5.1″;
require_once(“$IP/extensions/Lua/Lua.php”);
$wgScribuntoEngineConf[‘luastandalone’][‘luaPath’] = ‘/usr/bin/lua5.1′;
require_once( “$IP/extensions/Scribunto/Scribunto.php” );

Now when I load a page it still has some missing bits, but it’s looking much better.

The Amiga page...

The Amiga page…

Now I know the XOWA people have a torrent setup for about 75GB worth of images.  I just have to figure out how to get those and parse them into my wikipedia mirror.

I hope this will prove useful for someone in the future.  But if it looks too daunting, just use the XOWA.  Another solution is WP-MIRROR, although it can apparently take several days to load.

Atari System V UNIX Saga “Part II“ TenoxVGA

(this is a guest post from Tenox)

In the previous article I explained how I got hooked up on Atari UNIX and brief efforts in the hardware department. TL;DR I got stuck on lack of a specialized Atari high resolution 1280—960 CRT monitor, which operates using ECL rather than VGA signaling.

So what is ECL signal? Without going to too much in to the details ECL is a differential signal much like LVD SCSI. These are basically more interference resistant and therefore allow higher bandwidths.

07c0be30313

ECL signaling was used for high resolution monochrome monitors for Sun, SGI, DEC, HP, NeXT, etc. before VGA got it’s high resolution modes via VESA extensions.

sun3-60-small

During the research phase I came across a device made by Extron, which basically converts ECL to VGA signal. These were made to allow SUN/SGI/HP/NeXT workstations to be hooked up to a VGA projector.

rgb202xi-lg

Despite hours spent attempting to get these working by myself and other Atari users alike, this just didn’t work.

At this time the only viable approach would be to make such adapter from scratch and specifically for the Atari TT. I have rather limited experience in electronic circuit design. I was not the first to come up with such an idea. Over last 20 years or so adapters like that were made and tried before by much more experienced people, but still with rather crappy results. It was obvious I needed a professional help. Fortunately for last several years I have been living in the Silicon Valley, which is abundant with required skills.

I have found a helpful company with an extensive expertise in video signal processing, described the problem, they understood, gave me a reasonable quote, I delivered my Atari computer and waited patiently. After a few weeks I’ve got an email that I can come in ad see a preliminary results. They have made this device:

1a

And Atari TT was operating in the monochrome 1280-960 mode on an LCD panel! Well nearly there

3a

The signal was shifted a little bit, and the board was operating from a lab bench power supply. Not a big deal. After a few days they have added a small bread board with a circuit to shift the horizontal sync and a voltage converter to obtain -5V needed for ECL components. Everything was just perfect.

4a

I have ordered qty 25 of the unit to be made and knowing the final price went to look for more serious commitments from Atari users who previously expressed interest. Being out of touch with Atari for quite a long time I realized that most of the interested users are musicians who still use TT for Cubase, and would like to have a larger and more modern LCD screen. They were delighted so see this:

6a

Also in the mean time I started testing more and more LCD panels and realized a rather big issue. TT High mode is 1280-960 where as most LCD panels operate in 1280-1024 mode. This is a 64 pixels difference vertically and also 4:3 to 5:4 aspect change. I have assumed this would work just fine if monitor was set to so called 1:1 mode or expansion/scaling off. Unfortunately to my utter surprise almost zero monitors out there had this mode! Only the newer, larger Dell monitors had it. But who would want to waste a 24 monitor to run such a small resolution on, with large black border. I went through endless monitors, new and used, user manuals, specs, computer junkyards like HSC and WeirdStuff, etc. Nothing seemed to work. Everything was getting anti aliased and blurred. And then someone has recommended me a NEC 1990 LCD panel:

7a

It did have 1:1 mode as well as custom aspect ratios and most advanced display settings known to an LCD panel. Worked like a dream. The screen was 100% perfectly sharp like you were looking on Atari ST emulator on a PC or Mac. The black bars on top and bottom are the lacking 64 pixels, but they are almost not noticeable. Fortunately, the NEC monitor goes for $50 USD on eBay and matches Atari TT in case color. It’s a perfect TTM 19x replacement. Most users who bough the adapter would also invest $50 and get the same monitor to avoid anti aliasing and have a perfect image display. I also became an instant fan of NEC 1990 and now see it as THE LCD panel for all my retro computing activities.

Back to Atari, one problem solved, I was hit with another disaster. The finally assembled converter units were producing rather crappy picture. The picture was oscillating and there were a lot of nasty artifacts and imperfections on the screen. The NEC LCD panel couldn’t even auto adjusting to the signal.  Long story short, once components from the bread board were incorporated on to a much smaller PCB they started interfering with now un-differentiated and unshielded video signal. Also some of the ICs were overheating and changing properties after a while. The investigation and problem solving went over budget rather quickly. The company stopped charging me at some point and worked in spare time to solve the problem. Unfortunately it took several months to finish. This is mainly because each try requires a new PCB reprint, at low cost / priority which is of course made in China. So about 1 month for 1 try just to see if it’s any better. From a time perspective now I can understand why all previous attempts to construct such a device failed miserably. It required a LOT of patience and professional expertise to finish it.

This is how a final version of TenoxVGA looks like:

onkeyboard1

In the mean time I had some time to come up with a case for it. I have experimented with a water jet cutter:

c2c3

but finally settled for a 3d printed version. It’s much cheaper and easier to produce than cutting and bending a sheet metal:

cable

Once I have received a shipment of 25 units and got appropriate power supplies and cables I have winged a website with paypal payment and started selling to the happy Cubase users.

tt-big

Having fixed the 20 year old video problem I was now ready to install Atari System V UNIX

…continued in this post!

QEMU 1.7.1 Stable released

Hi everyone,

I am pleased to announce that the QEMU v1.7.1 stable release is now available at:

http://wiki.qemu.org/download/qemu-1.7.1.tar.bz2

v1.7.1 is now tagged in the official qemu.git repository, and the stable-1.7 branch has been updated accordingly:

http://git.qemu.org/?p=qemu.git;a=shortlog;h=refs/heads/stable-1.7

This release contains 62 build/bug fixes, including important fixes for crashes at certain guest system memory sizes, and unplugging of virtio devices.

Thank you to everyone involved!

In a surprise move, Microsoft opens up the source to MS-DOS & Word for Windows.

I couldn’t believe it!  You can find the official announcement here.

So this is MS-DOS 1.1 & 2.0 source code.  Pitty it’s not 3.x but heck, it’s a start!

Also the Word in question is 1.1a however it does seem to include the OS/2 bits which was a big surprise.

I haven’t tried to build any of it, as I just got up but I know what I’ll be doing today!

8086tiny de-obfuscated!

I came across this site, which is from the author of the winning IOCCC entry, 8086tiny!

It’s ballooned from just under 4kb to 28kb, but still incredibly tiny!

For anyone interested it’s features:

  • Highly accurate, complete 8086 CPU emulation (including undocumented features and behavior)
  • Support for all standard PC peripherals: keyboard, 3.5″ floppy drive, hard drive, video (Hercules/CGA graphics and text mode, including direct video memory access), real-time clock, timers, PC speaker
  • Disk images are compatible with standard Windows/Mac/UNIX mount tools for simple interoperability
  • Complete source code provided (including custom BIOS)

It’s worth checking out for some old time PC/XT nostalgia!

Visual Studio 2005 Express editions

I found myself in need for J# of all things for something with work.  J# is the MS answer to migrating Java code to .net.

Anyways it turns out I was able to find the web installer, but the link for generating a license code no longer exists.  However, the ISO’s never needed the code.  Except they aren’t available for download.

Or so I thought.

Turns out they are still there, but MS pulled the pages.

I figure it’ll help someone out there.

Blinking lights…

I almost cannot believe I’m going to post this, but so many of my machines don’t have LED’s to blink for hard disk activity it is driving me nuts.

Well thankfully, for windows there is a solution:

diskled

So what it does, is it’ll poll  \PhysicalDisk(_Total)\% Disk Time every 30ms, and if it’s doing something it’ll blink the icon colour.

Why is this cool?

It’ll even work with RDP.  So your server can be on the other side of the world, and you’ll know what’s going on.

It's cooler than it looks

It’s cooler than it looks

 

Web Rendering Proxy

(note this is a guest post from Tenox)

WRP is a HTTP proxy service that renders web pages in to GIF images associated with a clickable imagemap of the original web links. It basically allows to use historical and obsolete web browsers on the modern web.

See a gallery of today’s news sites. All links are clickable!

CNN via Internet Explorer 1.5

CNN via Internet Explorer 1.5

 

Reuters via IBM Web Explorer

Reuters via IBM Web Explorer

 

BBC News via Mac Mosaic

BBC News via Mac Mosaic

 

Reddit via NextStep OmniWeb

Reddit via NextStep OmniWeb

 

netscape3

Netscape 3.x visiting DNA Lounge

 

For more background information and screenshots you can see my previous post on the matter.

There are two versions. Cocoa-webkit for Mac OS X and QT-Webkit for Linux/BSD/etc. The script can be downloaded here.