I wanted to get some research into some early space flight, and look into that magical transition when everything went from Cowboys & Indians to moonmen & the race to space. Granted Toy Story covers cultural touchstone pretty well, reading period pieces is fun too. I had a few CD-ROM’s containing the 1980’s National Geographic CD-ROM’s when they were sold up in decade sets, but what always escaped me was the fancy collectors box with the whole thing.
I always figured this was going to be one of those weird collectors’ items that probably was under produced, over sold, and lost to the winds of time. Looking on eBay for a 1950’s and 1960’s set .. and thinking about the 1970’s as well, and it was going to get close to £40. Ouch. So for the heck of it, I look for the fancy box set.
I was surprised for as much as I was going to end up paying for one or two sets, I could get the entire thing. In the legendary fancy wooden box. I got mine, shipped for under £20. Much wow!
Okay what is the catch?
Wait a minute, these National Geographic CDs are obsolete! – Christopher Elliott
I always was running mine on MacOS using Cockatrice III with the monitor resolution set to the absolutely absurd resolution of 1152×870 provided by the 1991 21″ monitor. I couldn’t imagine why these CD’s wouldn’t work. And of course the first step was to rip the CD-ROM’s. There was 31 National Geographic, and an additional clip art disc in my box. I fired up my XP machine to have it’s hard disk give up and die on me. Luckily since I had removed the mechanical disk from the iMac G5, I had a spare SATA disk handy. I didn’t feel like fighting the XP installer, and I’m impatient, so I made a netboot.xyz bootable flash drive, and installed Debian 10 over the internet. Very nice. Now I could get down to ripping CD’s. Luckily the drive from the DVD-RAM drive disaster reads at 48x, so it took me under 4 hours to rip them all. In case you are wondering:
32 File(s) 17,894,987,776 bytes
The flash drive I used is 32Gb. I got it in a 3 pack from Tesco. I think I paid £10 for it. That means UTZOO + NatGEO all fit one of the drives. I wonder if they’ll ever offer high resolution scans on a USB drive in a fancy wooden box?
For the hell of it, I used 7zip to decompress all the ISO’s and that’s when I noticed that although the files were spread over discs there was a clean decade break between volumes from the 1900’s, and that they had a logic to them.
On the NGS_1956_1959 CD-ROM there is a hierarchy something like this:
E:\IMAGES\256I
The 2 is for the 20th century, and 56 is the year. I is the month; in this case I is September. Since we live in the future, and rendering jpeg’s is quicker than real-time, 256IC01A.JPG can be shown to be the cover. The next pattern is for the adverts, 256IA02A.JPG – 256IA30A.JPG are the into adverts. Now, we get to the interior content 256I0287.JPG – 256I0426.JPG. Next is the closing/trailer advertisements 256IZ01Z.JPG – 256IZ13Z.JPG. And Finally the back of the issue, 256IB14Z.JPG is the rear of the magazine.
So we now have some understanding of the format. Putting this into order could be done with something simple like this:
find ./ -name '2[0-9][0-9]IC[0-9]*.JPG' > interior
find ./ -name '2[0-9][0-9]IA[0-9]*A.JPG' >> interior
find ./ -name '2[0-9][0-9]I[0-9]*.JPG' >> interior
find ./ -name '2[0-9][0-9]IZ[0-9]*Z.JPG' >> interior
find ./ -name '2[0-9][0-9]IB[0-9]*Z.JPG' >> interior
Doing so makes a nice list file of what images should go in which order. I could probably use ffmpeg and painstakingly check the images for ‘pullouts/double wide’ ones, and have it stitch the rest together as a two pager ( ffmpeg -i left.jpg -i right.jpg -filter_complex hstack combined.jpg ), but that still sounds like a lot of work. Also the images are very low quality, It’s a shame they didn’t use black & white on the text, and scan the images separately, but that’d require something like PDF, and no doubt a LOT of time. Although Kodak did sponsor the set, the developer, Mindscape didn’t go with fancy PDF technology of the late 1990s, instead it’s just the blurry jpeg scans we have today.
That’s when I found out about Tesseract. Running it against this one paragraph reveals:
Voyager is watching two small moons
National Geographic, July 1981
that-seem to be playing tag as they race
around Saturn in almost the same orbit. The
trailing moon is traveling faster than the
leader, and should catch up with the leader
in January 1982 (pages 20-21). The two pre-
sumably have been playing this game for
billions of years, Through what sleight of
physics do they avoid colliding?
I have to admit, that’s pretty good! And how amazing that I have a LOT of files to scan.
188553 File(s) 11,061,111,690 bytes
That is a LOT of files. Okay that’s nice, but can Tesseract read the list that I generated per issue? YES. The only thing that I’d love to see Tesseract do is create PDF’s with the scanned text embedded. OH WAIT, IT ALREADY DOES THAT!
How on earth did I not know this?
I put together a few scripts, and I was able to separate out all the images into years & months, then I created the needed list files with all the images in the correct order. It’s not a fast process I think it may take me a week or so to do this.
So far, I’m up to 1903, so I’ll update with some rough idea of when this finished.
So, the applications needed are old and obsolete Win16 or Classic MacOS needed machines, with an optical drive. Yes they still work (with emulation) on modern machines, although you still need to read the physical discs. Thankfully the images are easily mapped into the right order, and you can map them as your own. Neat!
The DVD Release (eg https://archive.org/details/thecompletenationalgeographic2010) apparently used an Adobe Air viewer instead.
These have “cng” files, but “The cng files are all jpegs, XOR’d bitwise with 239.”
The same snipped looks like https://imgur.com/5LC0KrS (full page https://imgur.com/nkNl907).
I see you have to set the date back to 2010 to get it to install because of some really poor certificate choices.
I ran the OCR against it, and it’s significantly more scannable than the CD-ROM jpeg’s.
Oh well I have a fancy box now.
A very interesting and timely blog post. My wife brought a set in last night and presented them to me. Trying to figure out if it’s worth the hassle and if so what’s the best approach (create an actual old Mac that might run) vs virtualisation vs what you have actually done.
If you succeed with generating an OCRed, possibly .pdf that would be very interesting. Not sure why they didnt just use .html if they wanted an easy way to publish the information.
You can find them per decade on archive.org. The text is so low quality the OCR struggled. There is a DVD version with much better images, which I may tackle at a later time. I just wanted to do these for the sake of it.
I ran it under Macintosh emulation originally, but once I saw the pattern on how they were named, it was a mission to OCR & PDF them all.
Did you get the decorative wooden box? It’s surprisingly nice. It’s funny it’s 18GB of poorly scanned material, but in 1997 it’d have been such a massive deal.
Thanks for the tip… my wife was really after seeing the photos but after doing some reading has prepared herself to be a little disappointed.
I do have a few options as to real hardware vs emulation but to be honest have more than enough projects and was struggling to find motivation – that said looking at what you were doing made a lot more sense long term.
We have the wooden box and manuals and goodies. Such a difference to read the manual and see that if you were somehow able to download the “massive 100mb of auxiliary files” it would have better performance. Having the entire collection on a USB thumb would have been mind blowing then.
I still think that it would have been a better (and easier) call to have used some form of html – you get the indexing, searching and it would still work today.
Do post on whatever you end up with please.
You can use imagemagick to increase the contrast and then OCR them images with tesseract.
Xargs is perfect for a parallel task like this.