How not to write a web scraping engine

Seriously, the sad thing is that they wasted all their time, and bandwidth. The good thing is that PHP7 is far better at this, than PHP5. I know it’s not hipster neaveaux but PHP7 is really awesome stuff.

I didn’t wake up to a million messages about being down, rather a look at what was going on showed me this:

Obviously something is up.

I checked Cloudflare:

All those hits, and no change in user traffic. So I’m not popular.

And it’s all from the United States. Going back to WordPress I can see it’s mostly from one user!

A look through the 30MB log, and it’s a broken site scrape

174.127.114.00 - - [20/Aug/2020:20:43:39 -0400] "GET /wordpress/category/windows-nt-4-0/%5C'https://virtuallyfun.com/wordpress/2020/03/17/philip-pembertons-3b1-emulator-moved/%5C' HTTP/1.1" 301 - "https://virtuallyfun.com/wordpress/category/windows-nt-4-0/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"
174.127.114.00 - - [20/Aug/2020:20:43:40 -0400] "GET /wordpress/category/windows-nt-4-0/%5Chttps:/virtuallyfun.com/wordpress/2020/03/17/philip-pembertons-3b1-emulator-moved/%5C HTTP/1.1" 404 32750 "https://virtuallyfun.com/wordpress/category/windows-nt-4-0/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"
174.127.114.00 - - [20/Aug/2020:20:43:40 -0400] "GET /wordpress/category/windows-nt-4-0/%5C'https://virtuallyfun.com/wordpress/2020/03/12/the-price-of-commodore-64s-is-getting-out-of-control/%5C' HTTP/1.1" 301 - "https://virtuallyfun.com/wordpress/category/windows-nt-4-0/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"
174.127.114.00 - - [20/Aug/2020:20:43:41 -0400] "GET /wordpress/category/windows-nt-4-0/%5Chttps:/virtuallyfun.com/wordpress/2020/03/12/the-price-of-commodore-64s-is-getting-out-of-control/%5C HTTP/1.1" 404 32752 "https://virtuallyfun.com/wordpress/category/windows-nt-4-0/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"
174.127.114.00 - - [20/Aug/2020:20:43:41 -0400] "GET /wordpress/category/windows-nt-4-0/%5C'https://virtuallyfun.com/wordpress/2020/03/08/thanks-to-lgr-i-just-found-out-about-simcity-for-palm-pilots-was-a-thing/%5C' HTTP/1.1" 301 - "https://virtuallyfun.com/wordpress/category/windows-nt-4-0/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"
174.127.114.00 - - [20/Aug/2020:20:43:42 -0400] "GET /wordpress/category/windows-nt-4-0/%5Chttps:/virtuallyfun.com/wordpress/2020/03/08/thanks-to-lgr-i-just-found-out-about-simcity-for-palm-pilots-was-a-thing/%5C HTTP/1.1" 404 32751 "https://virtuallyfun.com/wordpress/category/windows-nt-4-0/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"

This is just terrible you are quoting the link to the link, having slashes going every which way. What the heck!? Don’t be a jerk, pace the thing, why are you trying to DOS it? While my hosting is clearly not robust, why can’t you wait a few seconds between each query? Also look at the output and URL’s it’s sending it’ll never work!

Oh and speaking of WTF:

[Sat Aug 15 02:54:16.376342 2020] [:error] [pid 2176618:tid 46981722752768] [client 5.188.84.35:0] [client 5.188.84.35] ModSecurity: Access denied with code 403 (phase 2). Match of "rbl nxdomain.v2.rbl.imunify.com." against "TX:rbl_ip" required. [file "/etc/apache2/conf.d/modsec_vendor_configs/imunify360-full-apache/001_i360_1_generic.conf"] [line "18"] [id "77140164"] [msg "Infectors: PHP Injection Low value||MVN:TX:rbl_ip||T:APACHE||MV:02-54.5.188.84.35||PC:74973||SC:/home/virtuall/public_html/wordpress/wp-comments-post.php"] [severity "WARNING"] [maturity "1"] [accuracy "7"] [tag "service_o"] [tag "service_i360"] [hostname "virtuallyfun.com"] [uri "/wordpress/wp-comments-post.php"] [unique_id "XzeGmAmHABK5lhefvTgiWQAAAQI"], referer: https://virtuallyfun.com/wordpress/2019/07/25/c-beams-glitter-in-the-dark-near-the-tannhauser-gate/#comment-217724/<br>
[Sat Aug 15 00:59:09.378579 2020] [:error] [pid 2121712:tid 46981729056512] [client 5.188.84.25:0] [client 5.188.84.25] ModSecurity: Access denied with code 403 (phase 2). Match of "rbl nxdomain.v2.rbl.imunify.com." against "TX:rbl_ip" required. [file "/etc/apache2/conf.d/modsec_vendor_configs/imunify360-full-apache/001_i360_1_generic.conf"] [line "18"] [id "77140164"] [msg "Infectors: PHP Injection Low value||MVN:TX:rbl_ip||T:APACHE||MV:00-59.5.188.84.25||PC:71624||SC:/home/virtuall/public_html/wordpress/wp-comments-post.php"] [severity "WARNING"] [maturity "1"] [accuracy "7"] [tag "service_o"] [tag "service_i360"] [hostname "virtuallyfun.com"] [uri "/wordpress/wp-comments-post.php"] [unique_id "XzdrnVSmjfAxcRCyCxt4rAAAAIU"], referer: https://virtuallyfun.com/wordpress/2019/07/25/c-beams-glitter-in-the-dark-near-the-tannhauser-gate/#comment-217724/

There is no comment on that page. There never was. Look I know it sucks he’s dead, but there is nothing you or me can do about it.

Digital.com purges DIGITAL

while finally getting around to renaming aux to aux_ for my AltaVista based search engine, I noticed that the product link, http://www.altavista.software.digital.com/search/index.htm, is suddenly not found.

Well isn’t that a shame.

Ironically in a twist of fate, I found this article, “AltaVista Search Engine History Lesson For Internet Nerds“, with a nice overview of the amazing rise, and tragic neglectful decline of AltaVista. Then what struck me was this line:

Digital was the original owner of the domain that you’re reading now; www.digital.com

Wait!? What?!

Did digital.com just purge DIGITAL’s history?

Now I feel like an idiot for not having archived the archive. Always in motion is the past, it’s a shame that DEC’s pages had to be destroyed. History in digital form, especially Digital’s is always in motion and subject to $CURRENT_YEAR.

Sad.

Web Rendering Proxy – Full Page Scrolling

(This is a guest post by Antoni Sawicki aka Tenox)

Due to a popular demand I have added an option of generating full page height screenshot and allowing client browser to do the scrolling.

This makes the browsing experience much smoother, you have resources for it. Beware, a full page screenshot can be several MB in size encoded as gif/png and much more as a decoded raw bitmap on the client. I managed to crash Mosaic and OmniWeb a few times. Fortunately typical Wikipedia page is under 1 MB so for most part is should be fine. To activate just put 0 in page Height.

I have drafted a pre-release on github for testing. Please let me know any feedback. I’m also thinking whether enable this by default, or not.

December 14th, the end of Yahoo! groups

Oddly enough these things have been going on since 2001, and have been curated gardens instead of the mess that is usenet (which is still operational!), or private mailing lists.

The 2 that Im on are the Hercules 390 group which moved to group.io. While other groups like H390-music (MUSIC/SP the Canadian mainframe OS with internet hooks) that sadly died along with it’s author, is probably going to be purged from the internet.

Along with other things like pdos, or even the board game Supremacy.

I don’t know what the answer is, other than to always have downloadable mailing list archives, and never trust a single place. So much stuff is deleted to save trivial amounts of space, and neither corporations nor government institutions can be trusted to maintain anything.

Thankfully there is archive.org, but who backs them up?

Confessions of a paranoid DEC Engineer: Robert Supnik talks about the great Dungeon heist!

What an incredible adventure!

Apparently this was all recorded in 2017, and just now released.

It’s very long, but I would still highly recommend watching the full thing.

Bob goes into detail about the rise of the integrated circuit versions of the PDP-11 & VAX processors, the challenges of how Digital was spiraling out of control, and how he was the one that not only championed the Alpha, but had to make the difficult decisions that if the Alpha succeeded that many people were now out of a job, and many directions had to be closed off.

He goes into great detail how the Alpha was basically out maneuvered politically and how the PC business had not only dragged them down by management not embracing the Alpha but how trying to pull a quick one on Intel led to their demise.

Also of interest was his time in research witnessing the untapped possibilities of AltaVista, and how Compaq had bogged it down, and ceded the market to the upstart Google, the inability to launch a portable MP3 player (Although to be fair the iPod wasn’t first to market by a long shot, it was the best user experience by far).

What was also interesting was his last job, working at Unisys and getting them out of the legacy mainframe hardware business and into emulation on x86, along with the lesson that if you can run your engine in primary CPU cache it’s insanely fast (in GCC land -Os is better than -O9).

The most significant part towards the end of course is where he ‘rewinds’ his story to go into his interest in simulations, and of course how he started SIMH when he had some idle time in the early 90’s. SIMH of course has done an incredible amount of work to preserve computing history of many early computers. He also touches on working with the Warren’s TUHS to get Unix v0 up and running on a simulated PDP-7 and what would have been a challenge in the day using an obscure Burroughs disk & controller modified from the PDP-9.

Yes it’s 6 hours long! But really it’s great!

WRP 4.0 Preview

(This is a guest post from Antoni Sawicki aka Tenox)

Welcome a completely new and absolutely insane mode of Web Rendering Proxy. ISMAP on steroids!

While v3.0 was largely just a port from Python/Webkit to GoLang/Chromedp, the new version is a whole new game. Previously WRP worked by walking the DOM and making a clickable imagemap out of <A HREF> nodes. Version 4.0 works by using x,y coordinates obtained from ISMAP to perform a simulated mouse click in Chrome browser. This way you can click on any element of the page. From annoying cookie warnings, to various drop down menus and even play some online games. Also pagination has been replaced with a clickable scroll bar.

Enough talking, you can watch this video:

Or download the new version and try it yourself!

Please report bugs on github.com. Thank you!

WRP 3.0 Beta ready for testing

(This is a guest post from Antoni Sawicki aka Tenox)

I have released WRP 3.0 for testing. It’s currently a browser-in-browser server rather than a true proxy, but that’s in the works. Please try it out and let me know. Usage instructions are on the main github project page.

Today using trickery I was able to login to my reddit account from Mosaic:

Update: just added the missing image quantizer so that the color number input box actually does something useful. Now you can browse porn even with 16 colors:

WRP Runs on Windows

(This is a guest post by Antoni Sawicki aka Tenox)

Thats right, the new beta version of Web Rendering Proxy runs natively on Windows. Single EXE, no libraries or dependencies required. Only Chrome Browser.

I took a Internet Explorer 1.5 for a spin today while WRP was running on my Windows 10 PC. Worked just fine.

I have added Prev/Next buttons so that you can easily “scroll” through long pages.

ISMAP support has been added, proof:

You can download a preview build on github.

Web Rendering Proxy – Overdue Status Update

(This is a guest post from Antoni Sawicki aka Tenox)

There hasn’t been a major update to WRP (Web Rendering Proxy) in 5 years or so. Some new features have been added thanks to efforts of Claunia but the whole project was mostly impeded with mass migration of the whole Internet to SSL/TLS/https. It does semi work somehow thanks to sslstrip but the whole stack is an unmaintainable pile of crap which I’m not going to update any more.

A new rewrite from scratch is well under way. This time written in GoLang and using Chrome DevTools Protocol. Things should be much more stable and future proof.

Far from complete but I have a fully functional prototype now working in just under 100 lines of code:

UPDATE 1: You can play with it if you want. Please do not submit any bug reports just yet, as this is just a development version. Note that WRP is currently not a true HTTP proxy but rather browser-in-browser. Proxy may be supported later.

UPDATE 2: As of today online setting of size, scaling and scrolling is supported. I’m specifically happy about the scrolling feature albeit it probably needs a better user input, like prev/next page.

Windows version still doesn’t work due to an upstream bug, which is probably easy to fix.

ISMAP is currently in development.