As you may remember from my prior attempt at using Altavista Search I ran out of space, and found out it only serves pages on 127.0.0.1:6688 and is pretty much hardcoded to do so. It’s a “fine” hybrid java 1.01 application, with the bulk of it being java. Â I finally got around to setting up a VM, and unpacking all of the utzoo archives, and indexing them. I should have done something about the IO because this took too long (KVM).
So, to cheat the system, I installed stunnel as a simple https to http proxy, which let me access my search VM anywhere. However, it still embedded 127.0.0.1 in all the pages.
Enter an Apache reverse proxy to talk to stunnel to talk to AltaVista search!
First to enable a few modules:
a2enmod substitute
a2enmod proxy
a2enmod ssl
a2enmod proxy_http
a2enmod rewrite
And adding this into the config:
SSLProxyEngine On
ProxyPass “/altavista/” “https://10.12.0.16”
ProxyPassReverse “/altavista/” “https://10.12.0.16/”
ProxyRequests Off
RewriteEngine On
SetOutputFilter INFLATE;SUBSTITUTE;DEFLATE
AddOutputFilterByType SUBSTITUTE text/html
Substitute “s/1997/2016/ni”
Substitute “s/97/16/ni”
Substitute “s|127.0.0.1:6688|debian7/altavista|n”
Substitute “s|file:///C:\Program Files\DIGITAL\AltaVista Search\My Computer\images\|http://debian7/images/|n”
Substitute “s|launch=app||n”
Substitute “s|<a href=http://debian7/altavista/?pg=q&what=0&fmt=d|<!—|n”
Substitute “s|><strong>|—>|n”
Substitute “s|</strong></a>||n”
Substitute “s|>u:\|->u:\|n”
This let me redirect all of those requests into a VM called debian7 on the /altavista path. I also copied the images to the apache server, and now I get something that looks correct!
I cut the results short… But here is a search of something simple:
I also killed all the ‘working URL’s that simply open a desktop application on the index ‘server’. Naturally it was a personal service, but as a server this isn’t any good. As such you can’t click on any search results now. I need something else to figure out how to take the result blocks like “u:\b128\comp\databases\2852” and turn them into URL’s.
Also, as much as I want to re-index I would be best to cut off the headers, or most of them so the preview lines make sense. Xref, Path, even From & Newsgroups don’t interest me.
I hate to leave it as ‘good enough’ but if anyone has a solution…. I’ll be glad to make this wonderful resource available!
Have you considered using the SWISH-E search engine? It’s open source, and in its day it seemed to be a pretty decent piece of software. It seems to be dead now, so I guess that should mean it meets your criteria for what software you use? 🙂
lol maybe. I’ve nearly gotten a “solution” for altavista, and I kind of like how this required middleware to work against it’s built in limitations.
I also found a linkbait site with all the utzoo stuff extracted. If I was smart I’d make a better one. It wouldn’t take tooooo much effort. Just time I suppose.
If you used SWISH-E, there would still be a lot of opportunities for writing a bunch of code 🙂 You should be able to write a “filter” or something so the engine gets given some metadata/properties like which newsgroup a post is from, and then provide a field for specifying that in the search form. It depends on what you find fun obviously 🙂