Archive Wayback Machine: Recover websites


Premise

The Archive Wayback Machine is the Web interface used by the Internet Archive for the extrapolation from the archives of data on websites. The archived sites represent a kind of “still images” collected at the time of the acquisition of the pages through indexing software of the Internet Archive. The name “Wayback Machine” comes from the term ” WABAC Machine “used in one of the stories of the animated series Rocky and Bullwinkle . The service, thanks to the spider of Alexa , stores over time the changes and developments of the different websites . For smaller sites it does not have a frequent caching , or the pages are stored only rarely.

It proves to be a useful service in the following cases:

  • study of the evolution of Web sites;
  • recovery of lost pages and sites;
  • looking for evidence once published and then deleted.

The service allows its users access to archived versions of Web pages of the past, what the Internet Archive calls a “three-dimensional archive”. Millions of websites with their data (images, text, linked documents, etc.) Are stored in a giant database. Not all websites are available because many website owners choose to exclude their sites. As with all sites based on data from web crawlers, the Internet Archive missing large areas of the web for a number of technical reasons. Also various legal issues relating to the storage and coverage or less sites have been found over the years although these are not the result of deliberate actions.

Internet Archive Wayback Machine

The use of the term “Wayback Machine” in the context of the Internet Archive has become so common that “Wayback Machine” and “Internet Archive” are almost synonymous in popular culture; For example, in the television series Law and Order: Criminal Intent (episode “Legacy”, which aired for the first time on Aug. 3, 2008, titled Love virtual counterpart in Italian), one of the protagonists of the episode uses “ Wayback Machine” to find the archived copy of a website.

The “snapshot” of the sites stored during the various steps of the crawler become publicly accessible usually after 6-18 months.

Some considerations before you begin

First, the actual ability to recover a site from the Wayback Machine depends on whether this “time machine” has stored or not the site that interests us, the date you are interested in and making a full copy (and therefore all linked sites, all style sheets, images, JavaScript, applets, and anything else necessary). Secondly, if the site in question was a site “static”, that is made up of html file written by hand or with an editor, the recovery can actually return the site “as it was”, but if the site was “dynamic”, to example created with a CMS linked to a database, all that we will get a copy of “static” of this site, which is a set of html files regardless of php or asp pages that generated them.

That said, having already recovered some sites important to me from the Wayback Machine, I can assure you save on your computer and put online a site taken from the Wayback Machine is not a job neither simple nor short, but it may take a few days off work and not There is no guarantee that the end result is the desired one .

How to save locally a site present in the Wayback Machine

There is no universal method to retrieve the sites from the Wayback Machine.

Generally, in my experience, the steps to follow are:

  1. Initial rescue wget all the files of the website that interests us, through specific options tailored to the Wayback Machine.
  2. Manual control of all the HTML pages saved from wget
  3. Removal using regex (regular expression, regular expression) of Wayback Machine Toolbar from all pages
  4. Html code analysis of all the saved pages, to see what amendments should be made (with particular attention to all the links).
  5. Creation and application of all regex needed to do what has been decided in the previous paragraph
  6. Checking the accuracy of all internal and external links by LinkChecker
  7. Repeat steps 4, 5 and 6 until the site is not properly restored
  8. Changes any other business at its discretion, based on what we will do site saved

A concrete example?

The command wget I performed was:

wget -e robots = off -r -nH --cut-dirs = 3 --page-requisites --convert-links --adjust-extension --user-agent = "Mozilla / ver31.0 (X11; Ubuntu Linux x86_64; RV: 24.0) Gecko / 20100101 Firefox / 24.0 "--timestamping --accept-regex = / name-site / https://web.archive.org/web/20080603013541/name-site .asp

For each file downloaded, I removed the following strings using sed :

<Script type = "text / javascript" src = "https://web.archive.org/static/js/analytics.js"> </ script> 
<link type = "text / css" rel = "stylesheet" href = "https://web.archive.org/static/css/banner-styles.css" />

I removed all that followed the </ html>, or information such as:

<! - 
     ARCHIVED FILE ON date Retrieved from AND THE 
     INTERNET ARCHIVE ON date. 
     Appended JAVASCRIPT BY Wayback Machine, INTERNET COPYRIGHT ARCHIVE. 

     ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT. 
->

I removed the code from the Wayback TOOLBAR, namely between the strings:

<! - BEGIN INSERT Wayback TOOLBAR -> 
... 
<! - END Wayback TOOLBAR INSERT ->

I removed several side script, for example:

<Script language = "JavaScript" type = "text / javascript"> </ script> 
<script language = "JavaScript" type = "text / javascript"> </ script> 
<script language = "JavaScript" src = "https://web.archive.org/web/20080212044754js_/name-site - stats.js.php "type =" text / javascript "> </ script>

I then turned all the absolute links into relative using substitution allowed by Perl (with a lazy approach instead of greedy):

perl -p 's / http. *? Massamarittima \ ///' temp.html> temp.html.new

I made sure that all the internal links were pointing to files instead of .html .asp:

perl -p 's / \. asp / \. html /' temp.html> temp.html.new

and finally I renamed all the files from .asp to .html.

Bottom line …

I renamed all the files in a Bash command like:

for file in * .asp.html; do 
    mv "$ file" "` basename $ file .asp.html`.html " 
done

I then created (and made executable) file pulisci.sh so:

#! / Bin / bash 
sed -i '/ <script type = "text \ / javascript" src = "https: \ / \ / web.archive.org \ / static \ / js \ /analytics.js"> <\ / script> / d '$ 1 
sed -i '/ <link type = "text \ / css" rel = "stylesheet" href = "https: \ / \ / web.archive.org \ / static \ / css \ / banner -styles.css "\ /> / d '$ 1 
sed -i '/ <\ / html> / q' $ 1 
sed -i '/ <! - BEGIN INSERT Wayback TOOLBAR -> /, / <! - END Wayback TOOLBAR INSERT -> / d '$ 1 
sed -i '/<script.*counter_name-site\.asp.*\/script>/d' $ 1 sed -i '/<script.*counter_complessive\.asp.*\/ script> / d '$ 1 sed -i '/<script.*php-stats\.js\.php.*\/script>/d' $ 1 perl -p 's / http. *? name-site \ ///' $ 1> $ 1.new mv $ 1.new $ 1 perl -p 's / \. asp / \. html /' $ 1> $ 1.new mv $ 1.new $ 1

and I run:

for file in * .html; do 
    ./pulisci.sh "$ file"; 
    echo "$ file done"; 
done

So far I assumed that they were all html files in a single folder, but this did not happen: the procedure is repeated for each folder, or you have to use a recursive procedure with the command find .

END OF SCRIPT, I then had to manually check all files.

Further later corrections:
* manual correction of several links
* change the encoding, with:

for file in * .html; do 
    perl -p 's / charset = iso-8859-1 / charset = utf-8 /' $ file> $ file.new 
    file.new mv $ file $ 
    echo "$ file done" 
done

* Control of all links with linkchecker: github.io/linkchecker

Good recovery of your sites … and especially good study and good luck to succeed;)