Extracting warc files

well since the catastrauphic outage, I’ve been looking through my backups trying to see how much of my ‘vpsland’ archive I have. And it’s not so hot. The good news is the physical machine that has the last known good copy is fine. It’s just in a place I can’t get to on the other side of the world. And I’m still in exile so shipping it really isn’t an option at the moment.

On the plus side I found a warc archive, some 22GB of the 400GB worth of files. So its a start.

So what are WARC files? why do people gzip them to get maybe 1% compression? How do magnets work anyways?

web archives are single snapshots in time of a site. Sounds like a MHT but something more ‘portable’ and open standard-ish. Which means there is a million tools, none of which seem to do exactly what you want.

All I want to do is extract all my files from the WARC, but that seems to not be what most things are geared to, mostly displaying the WARC like a web page, which means clicking hundreds of thousands of files. –yikes

Thankfully warcat seems to be able to fit the bill

python3 -m warcat extract ../[email protected]015-10-04-fc233ad0-00000.warc.gz

I didn’t see any package on Ubuntu so did the pip install:

pip3 install warcat

And that seems to have done the trick.

Now to figure out how to setup some cheap storage on azure and copy this stuff up or extract over there.

spot pricing

I’m using the new ‘spot‘ pricing model, to try to keep costs down. Obviously it’s not as good as dedicated slices, but it’ll not make me broke either. And I have a lot more messing around with containers to do, trying to string together nonsense.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.