Around the Web: Internet Archive to Stop Referring to robots.txt Files

MARA Blog

Published: August 2, 2017 by Anna Maloney

The Internet Archive is planning to discontinue the recognition of robots.txt files that can limit an archived website’s availability to future viewers. 

One of the biggest challenges facing archivists and records managers in a Web 3.0 world is understanding the best way to preserve websites and webpages. For the past 21 years, the Internet Archive, a non-profit based in San Francisco, California, has been working on this front to preserve websites and provide their surrogates to users free of charge. The Wayback Machine is one of their most popular products and is a great way to spend an afternoon. In an April 2017 blog post, Mark Graham, the Director of the Wayback Machine, explains the organization’s decision to start “ignoring” the robots.txt file that previously allowed organizations to block search engines from crawling certain websites and webpages. “Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve archival purposes…We see the future of web archiving relying less on robots.txt file declarations geared toward search engines, and more on representing the web as it really was, and is, from a user’s perspective.” 

Check out other interesting posts on the Internet Archive’s blog

Comments

Post new comment