A weekend project – fromthecache.com

I was playing around on the weekend screen-scraping and analyzing word-frequencies for various sites (don’t ask), and was getting some slow responses (and accidentally got my IP blocked from one site when I hit them a few too many times).

Eventually I hit upon the idea of hitting Google Cache for each URL (the pages I was scraping had sequential ?id=xxx URLs so it was easy to automate), with the aim of speeding things up a bit and taking some load off the target sites.

With this in mind, I spent a few hours Saturday and Sunday developing fromthecache.com – it’s built on rails, and designed to provide transparent access to the Google cache, while fetching the original page as a fallback if necessary.

It occurred to me halfway through that it’s also useful for providing mirror links if a site gets slash-dotted – just put fromthecache.com/ in front of the URL and you have an instant cache link.

There’s a fairly good chance that the server’s IP will get blocked from Google for looking like a bot, but I’m hoping requests out of Heroku might come from a few different IPs and mix things up a bit.

You can view the demo at fromthecache.com, browse the source, or
download it from the project page and try it out for yourself.


If you enjoyed this post, please follow us on twitter or subscribe to our feed!