Fun with Apache, (mod_proxy, mod_rewrite), stunnel, And AltaVista Personal search

As you may remember from my prior attempt at using Altavista Search I ran out of space, and found out it only serves pages on 127.0.0.1:6688 and is pretty much hardcoded to do so.  It’s a “fine” hyrid java 1.01 application, with the bulk of it being java.  I finally got around to setting up a VM, and unpacking all of the utzoo archives, and indexing them.  I should have done something about the IO because this took too long (KVM).

SIXTEEN HOURS!!!

SIXTEEN HOURS!!!

So to cheat the system, I installed stunnel as a simple https to http proxy, which let me access my search VM anywhere.  However it still embedded 127.0.0.1 in all the pages.

via stunnel

via stunnel

Enter an Apache reverse proxy to talk to stunnel to talk to AltaVista search!

First to enable a few modules:

a2enmod substitute
a2enmod proxy
a2enmod ssl
a2enmod proxy_http
a2enmod rewrite

And adding this into the config:

SSLProxyEngine On
ProxyPass “/altavista/” “https://10.12.0.16”
ProxyPassReverse “/altavista/” “https://10.12.0.16/”
ProxyRequests Off
RewriteEngine On
SetOutputFilter INFLATE;SUBSTITUTE;DEFLATE
AddOutputFilterByType SUBSTITUTE text/html
Substitute “s/1997/2016/ni”
Substitute “s/97/16/ni”
Substitute “s|127.0.0.1:6688|debian7/altavista|n”
Substitute “s|file:///C:\Program Files\DIGITAL\AltaVista Search\My Computer\images\|http://debian7/images/|n”
Substitute “s|launch=app||n”
Substitute “s|<a href=http://debian7/altavista/?pg=q&what=0&fmt=d|<!—|n”
Substitute “s|><strong>|—>|n”
Substitute “s|</strong></a>||n”
Substitute “s|>u:\|->u:\|n”

This let me redirect all of those requests into a VM called debian7 on the /altavista path.  I also copied the images to the apache server, and now I get something that looks correct!

Apache in the mix!

Apache in the mix!

I cut the results short… But here is a search of something simple:

About 16598 documents match your query.

About 16598 documents match your query.

I also killed all the ‘working URL’s that simply open a desktop application on the index ‘server’.  Naturally it was a personal service, but as a server this isn’t any good.  As such you can’t click on any search results now.  I need something else to figure out how to take the result blocks like “u:\b128\comp\databases\2852” and turn them into URL’s.

Also, as much as I want to re-index I would be best to cut off the headers, or most of them so the preview lines make sense.  Xref, Path, even From & Newsgroups don’t interest me.

I hate to leave it as ‘good enough’ but if anyone has a solution…. I’ll be glad to make this wonderful resource available!

3 thoughts on “Fun with Apache, (mod_proxy, mod_rewrite), stunnel, And AltaVista Personal search

  1. Have you considered using the SWISH-E search engine? It’s open source, and in its day it seemed to be a pretty decent piece of software. It seems to be dead now, so I guess that should mean it meets your criteria for what software you use? 🙂

    • lol maybe. I’ve nearly gotten a “solution” for altavista, and I kind of like how this required middleware to work against it’s built in limitations.

      I also found a linkbait site with all the utzoo stuff extracted. If I was smart I’d make a better one. It wouldn’t take tooooo much effort. Just time I suppose.

      • If you used SWISH-E, there would still be a lot of opportunities for writing a bunch of code 🙂 You should be able to write a “filter” or something so the engine gets given some metadata/properties like which newsgroup a post is from, and then provide a field for specifying that in the search form. It depends on what you find fun obviously 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.