Category Archives: Wayback Machine

Major expansion of Wayback Machine’s archive of the historical internet

The Next Web reports that the Internet Archive has vastly increased its historical database of the web:

The Internet Archive has updated its Wayback Machine with a significant bump in coverage: the service has gone from 150,000,000,000 URLs to having 240,000,000,000 URLs, a total of about 5 petabytes of data. More specifically, the Wayback Machine now covers the Web from late 1996 to December 9, 2012.

Cross-posted to Infoglut Tumblr.

Social networking word-of-the-day: “thinvisibility”

A new word for Facebookers and social networkers who cavalierly post embarrassing information about themselves to the web: thinvisibility:  Here’s a starting definition:

Thinvisibility: n.

  1. Being neither completely visible nor completely invisible.
  2. Being a tiny, shiny needle in a haystack of information overload.
  3. Being invisible to everyone except data aggregators and digital preservationists such as Google, the Wayback Machine, the NSA, and others.
  4. Being invisible to employers, colleges, police, neighbors, friends, exes, stalkers, acquaintances, and others, who are not interested in you, until they are.
  5. Being visible.

NARA hosting “lite” Bush website archive

There are plenty of good changes in the new whitehouse.gov site, such as a better copyright policy that enables clearer copying and remix, and a much shorter robots.txt file, which makes it easier for search engines and archivists to index and archive the site.  (Compare the current 4-line Obama robots file to a 2300+ version from apparently late in the Bush era.)

But what about Bush’s old website?  Shouldn’t that be preserved?  (Well, yeah!)  But when President Obama took the oath of office, things switched over and the Bush site was gone from public view.  Did anybody keep a copy?  Well, yes, kind of.  The Internet Archive archives the whitehouse.gov site, but I have deep concerns about the completeness of its archive.  See below for a screen cap of the Internet Archive’s database of http://www.whitehouse.gov.

Internet Archive captures of whitehouse.gov

I think it can be taken as an axiom that in a free society, it’s vital that governmental sites are archived frequently, deeply, accurately, and made available for scrutiny quickly.  But the depth of the Internet Archive’s archive of whitehouse.gov is unclear.  First, to the extent that the Bush administration’s robots.txt file told search engines and archives to stay away, did the Internet Archive fail to archive governmental content?  (Maybe not, but how can we be sure?)  Second, the Internet Archive is not up-to-date: as of this writing, the most recent public archive of whitehouse.gov is dated Mar. 25, 2008.  Finally and even more disturbingly, the Internet Archive’s frequency is poor.  It contains only 53 captures of the main whitehouse.gov page for 2007, and only 15 have yet been posted from calendar year 2008.  We can do better.

Interestingly, it appears that government archivists are now dipping their feet in the water.  At least part of the legacy Bush 2009 website is now being hosted by the National Archives and Records Administration (“NARA”), which administers the George W. Bush Presidential Library.  According to the site:

To preserve the historical record of the George W. Bush administration’s presence on the web, the White House took a “snapshot” of the Whitehouse.gov web site. This is historical material, “frozen in time.” The web site is no longer updated and links to external web sites and some internal pages will not work.

Having NARA archivists maintain an archive is a good start.  (Though there should always be archives maintained by disinterested third parties as well.)  But it’s not enough to have a “snapshot” of a presidential website.  Not only does the archive lack temporal depth (it’s only from materials existing in January 2009), but it appears to be incomplete as well, as even some internal links are admitted not to function.  Plus, as the site indicates, the “White House” took the snapshot.  I take this to mean that it was taken by interested White House insiders rather than by (hopefully) disinterested professional archivists at NARA.

H/T on Bush Archive to BushLegacy via Twitter.

BoingBoing “unpublishing” blog posts

When is it ok to delete a blog post?  Dan Solove wrote about this a few years back at Concurring Opinions, where he points to additional posts at Prawfsblawg (here, here, and here). More recently, BoingBoing faced public scrutiny when one of its authors removed posts related to blogger and sex columnist Violet Blue, although nobody noticed the removals for about a year.  A message board dedicated to the issue has generated over 1600 messages since July 1, some very heated.  The moderator for the board writes:

It’s our blog and so we made an editorial decision, like we do every single day. We didn’t attempt to silence Violet. We unpublished our own work. There’s a big difference between that and censorship.

We hope you’ll respect our choice to keep the reasons behind this private. We do understand the confusion this caused for some, especially since we fight hard for openness and transparency. We were trying to do the right thing quietly and respectfully, without embarrassing the parties involved.

Clearly, that didn’t work out. In attempting to defuse drama, we inadvertently ignited more. Mind you, we weren’t the ones splashing gasoline around; but we did make the fire possible. We’re sorry about that. In the meantime, Boing Boing’s past content is indexed on the Wayback Machine, a basic Internet resource; so the material should still be available for those who would like to read it.

Oddly, BoingBoing speaks in terms of “unpublishing” rather than deletion.   (Their policy page states “We reserve the right to unpublish or refuse to unpublish anything for any or no reason.”)  Sure, “unpublishing” sounds less big-brothery than deletion, but I don’t really see the difference.

Moreover, “unpublishing” isn’t quite accurate: BoingBoing doesn’t mean “unpublished” in the sense of a book (or blog posting) that has yet to be published.  They mean disabling public access to something that has already been posted, like in the DMCA 512(c) sense where material is removed or access to it is disabled.  (WordPress does have an “unpublishing” function, but that’s still a misnomer.)  A more accurate term might be deposting, depublishing, or good ‘ol deletion.

Nevertheless, it’s useful to explore a potential distinction between deletion and depublishing, and other questions raised when a blogger wants to remove posted materials:

  • As a starting point, what is the meaning of “publication” in an age where materials can be changed or removed?
  • Under what circumstances is depublication justified?
  • What practices are needed to distinguish “depublication” from “deletion?”  Is a reservation of rights declaring a right of depublication sufficient?  Should a notice be posted where the materials used to be (as Dan Markel suggests)?
  • BoingBoing notes that the removed materials remain on the Wayback Machine web archive.  Do web archives help to justify depublication?
  • Does depublication serve an important social function by severing the association between author and depublished content?

Hat tip to Noam Cohen.  And a disclaimer: I did make some edits to this post after posting.

Inheritability of blogs: You take Aunt Esther’s silverware, I’ll take her blog…

Over at the user forums on WordPress.com, there’s an interesting thread on “web logs and wills.” Forum user timethief writes:

What happens to . . . web logs if a person dies and their executor notifies [the weblog's host] of their demise. Can one leave their account, username, password and API key number to another person in their will?

What a great question! It reminds me of the case last year of Lance Corporal Justin Ellsworth, who died in Iraq. After his death, his family asked Yahoo for access to his emails. Yahoo refused. After a court ordered Yahoo to hand over the contents of the account, Yahoo complied. But the parallel to Ellsworth has its limits. With emails, there are significant concerns over privacy: it just cannot be assumed that every deceased person wants his or her executors and heirs poring through their private and potentially embarrassing emails.

In contrast, blogs are intended for some level of public consumption and the privacy issues generally don’t run as high. (Though even with blogs, privacy concerns can exist, such as with David Lat, the formerly anonymous “Article III Groupie” who writes Underneath Their Robes.) Indeed, although many blogs are quickly abandoned, others are intended to serve as lasting statements of authorship, whether professional or personal (or both). As timethief noted in a later post, “Blogging is now and will remain part of what defined me as a unique individual.” But blogs aren’t books or magazines. After we’re gone, existing copies of books we wrote can continue to exist without additional effort on the part of our estates or heirs. And our estates and heirs can’t force consumers to return legally acquired copies of books.

But the book analogy is hard to apply to blogs. Blogs aren’t material objects and they’ll disappear without maintenance or preservation. But long-term maintenance isn’t really practical, at least yet, for blogs whose owners have passed away. If hosting accounts aren’t kept active, or applicable payments stop, or hosting providers go out of business, or computers fail, or blogging code & databases become incompatible with future technologies, our blogs — like other web-only publications — may disappear or break. Plus, a blog might be shut down by an author’s estate or heirs, unless perhaps some sort of enforceable provisions can be made by the author that the blog be maintained posthumously.

Communal blogs like The Volokh Conspiracy stand a better chance of lengthy lives, since maintenance tasks can be undertaken as new members arrive. But most other sites, even highly successful ones like Howard Bashman’s How Appealing, are run by only one person. For an estate or heir, long-term maintenance after an author’s demise is not necessarily simple or — excuse the pun — appealing. In a rare case, successful blogs like Bashman’s could be valuable estate assets that would encourage continued maintenance and even eventual profitable transfer, but most blogs will utterly lack any such kind of maintenance incentive. (Of course, this is all illustrative, and Eugene and Howard should be blogging for many decades to come!)

This raises the question of digital preservation. Because long-term maintenance may not always be feasible, digital preservation of old sites becomes really important, and the utility of the Internet Archive’s Wayback Machine can’t be overstated. But I think that Wayback Machine is just the beginning of a dialogue over how — and when — to preserve web-only materials. Putting copyright issues to the side for the moment, the Internet Archive doesn’t archive all sites, and when it does, it archives some sites more often than others. Plus, it’s not entirely clear whether the Wayback Machine is currently capable of properly archiving all types of blogs: the Internet Archive states that sites that are database-driven or that generate dynamic web pages can’t be archived. I’d think this limitation could apply to at least some blogs (such as this WordPress blog, which is driven by a PHP & MySQL database).

But a quick review of the Wayback Machine suggests that, despite the disclaimer, the Internet Archive may be improving its ability to archive blogs — here’s links to a WordPress-run site that was archived incorrectly in March 2004, but appears to be much better represented in an archive from November 2004. Hopefully, the Internet Archive is continuing to improve its capability to archive different kinds of webpages. Needless to say, as web publishing technologies evolve, it will remain a struggle to find ways to accurately and authoritatively preserve such materials. My quick review of a number of blawgs suggests that some appear to have been pretty nicely archived, whereas others have not. I’ll address this more in a future post.

Thus, I think that timethief’s question — a really good one — leads to additional questions about whether web-only materials should be kept online, and if so, to even more questions about how, where, and by whom they should be maintained or preserved. I don’t think the answers to these questions are easy or obvious.