Obama’s Change.gov promise to protect whistleblowers? Scrubbed from the Web

Well, this pissed me off. Long-time readers of this site may recall my interest in the Internet Archive’s Wayback Machine, which aims to preserve the historical web. I’ve previously written to criticize the Bush administration for its lengthy robots.txt exclusion file (thousands of lines long), which could be viewed as an attempt to prevent the Wayback Machine and others from archiving portions of his White House website. I also wrote to compliment the new Obama White House website for its much shorter, and much more archive-friendly robots file.

But now the Obama administration is scrubbing the web, too. John Wonderlich at the Sunlight Foundation reports that materials from Obama’s old transition website at Change.gov have recently been deleted. Although the main page has referred users for a while to the Whitehouse.gov site, internal pages regarding his agenda were still online, and “until recently, you could still continue on to see the materials and agenda laid out by the administration.”

So why the change? Wonderlich speculates — and I think 100% correctly — that the internal Change.gov pages were removed due to broken and now inconvenient promises made in the transition team’s “Obama-Biden Plan” to protect whistleblowers. Considering the administration’s consistent actions in aggressively prosecuting whistleblowers such as Edward Snowden and others, the administration likely decided to scrub inconvenient promises it made during the transition period.

But in an era of permanent digital records (hello, NSA and its yottabytes of storage in Utah!), how can the Obama administration be so naïve as to think that somebody wouldn’t: 1) notice the missing pages; 2) find the old site; and 3) point it out? As a prosecutor might say, destroying evidence may be proof of a guilty conscience. The administration’s naïveté is positively striking, considering that Obama’s people are widely touted as being extremely tech-savvy.

See for yourself. In an Internet Archive capture of the Change.gov site from June 7, 2013 (barely a month ago), a page on ethics (!) in the Obama-Biden Plan promised to protect whistleblowers:

Protect Whistleblowers: Often the best source of information about waste, fraud, and abuse in government is an existing government employee committed to public integrity and willing to speak out. Such acts of courage and patriotism, which can sometimes save lives and often save taxpayer dollars, should be encouraged rather than stifled. We need to empower federal employees as watchdogs of wrongdoing and partners in performance. Barack Obama will strengthen whistleblower laws to protect federal workers who expose waste, fraud, and abuse of authority in government. Obama will ensure that federal agencies expedite the process for reviewing whistleblower claims and whistleblowers have full access to courts and due process.

Here’s a screen cap. According to the Wayback Machine, this was still online as recently as June 7:


Post-Snowden, this is what you see today:

Untitled picture

The difference? No doubt it’s the Snowden affair, which broke in early June. A Google search of Change.gov for “whistleblowers” conducted today (screen cap here) shows no hits, so the page apparently has not been moved to another URL on the site. It simply seems to be gone.

Even more disturbingly, this may reflect a broader trend of digital scrubbing. Wonderlich notes that this is not the first time that Obama administration documents have disappeared from the internet. An earlier posting of his includes a letter the Sunlight Foundation and others sent to the Department of Labor criticizing the administration for removing materials. As the letter states, “No major administration decision should be accompanied by related materials disappearance from public view.”

HT Animal. Cross-posted to Infoglut Tumblr.

Major expansion of Wayback Machine’s archive of the historical internet

The Next Web reports that the Internet Archive has vastly increased its historical database of the web:

The Internet Archive has updated its Wayback Machine with a significant bump in coverage: the service has gone from 150,000,000,000 URLs to having 240,000,000,000 URLs, a total of about 5 petabytes of data. More specifically, the Wayback Machine now covers the Web from late 1996 to December 9, 2012.

Cross-posted to Infoglut Tumblr.

NARA hosting “lite” Bush website archive

There are plenty of good changes in the new whitehouse.gov site, such as a better copyright policy that enables clearer copying and remix, and a much shorter robots.txt file, which makes it easier for search engines and archivists to index and archive the site.  (Compare the current 4-line Obama robots file to a 2300+ version from apparently late in the Bush era.)

But what about Bush’s old website?  Shouldn’t that be preserved?  (Well, yeah!)  But when President Obama took the oath of office, things switched over and the Bush site was gone from public view.  Did anybody keep a copy?  Well, yes, kind of.  The Internet Archive archives the whitehouse.gov site, but I have deep concerns about the completeness of its archive.  See below for a screen cap of the Internet Archive’s database of http://www.whitehouse.gov.

Internet Archive captures of whitehouse.gov

I think it can be taken as an axiom that in a free society, it’s vital that governmental sites are archived frequently, deeply, accurately, and made available for scrutiny quickly.  But the depth of the Internet Archive’s archive of whitehouse.gov is unclear.  First, to the extent that the Bush administration’s robots.txt file told search engines and archives to stay away, did the Internet Archive fail to archive governmental content?  (Maybe not, but how can we be sure?)  Second, the Internet Archive is not up-to-date: as of this writing, the most recent public archive of whitehouse.gov is dated Mar. 25, 2008.  Finally and even more disturbingly, the Internet Archive’s frequency is poor.  It contains only 53 captures of the main whitehouse.gov page for 2007, and only 15 have yet been posted from calendar year 2008.  We can do better.

Interestingly, it appears that government archivists are now dipping their feet in the water.  At least part of the legacy Bush 2009 website is now being hosted by the National Archives and Records Administration (“NARA”), which administers the George W. Bush Presidential Library.  According to the site:

To preserve the historical record of the George W. Bush administration’s presence on the web, the White House took a “snapshot” of the Whitehouse.gov web site. This is historical material, “frozen in time.” The web site is no longer updated and links to external web sites and some internal pages will not work.

Having NARA archivists maintain an archive is a good start.  (Though there should always be archives maintained by disinterested third parties as well.)  But it’s not enough to have a “snapshot” of a presidential website.  Not only does the archive lack temporal depth (it’s only from materials existing in January 2009), but it appears to be incomplete as well, as even some internal links are admitted not to function.  Plus, as the site indicates, the “White House” took the snapshot.  I take this to mean that it was taken by interested White House insiders rather than by (hopefully) disinterested professional archivists at NARA.

H/T on Bush Archive to BushLegacy via Twitter.

Is Zoetrope the next-gen Internet Archive?

Although the Internet Archive’s Wayback Machine is a great research tool, its utility is hampered but a lack of basic search mechanisms.  One can search by URL and archived links, but basic Google-style boolean searching isn’t available.  The Archive once offered a beta boolean search tool, but it never worked and it was later withdrawn.

However, a new application may significantly expand our ability to data-mine archived webdata. Reports give a sneak peek at Zoetrope, an application being developed by researchers at Adobe and the University of Washington.  As put by the researchers:

The Web is ephemeral. Pages change frequently, and it is nearly impossible to find data or follow a link after the underlying page evolves. We present Zoetrope, a system that enables interaction with the historical Web (pages, links, and embedded data) that would otherwise be lost to time. Using a number of novel interactions, the temporal Web can be manipulated, queried, and analyzed from the context of familar [sic] pages. Zoetrope is based on a set of operators for manipulating content streams. We describe these primitives and the associated indexing strategies for handling temporal Web data. They form the basis of Zoetrope and enable our construction of new temporal interactions and visualizations.

The demo video shows how historical webdata could be manipulated and compared, as the authors note, in a variety of “novel” ways.  Even more significantly, researcher Eytan Adar “hopes to eventually incorporate information from the Internet Archive’s nearly 14 years of records.” Such a combination would massively increase the utility of web archives, but would also — as discussed in a paper I’m writing — exacerbate concerns over informational autonomy.


The research paper can be found here.

BoingBoing “unpublishing” blog posts

When is it ok to delete a blog post?  Dan Solove wrote about this a few years back at Concurring Opinions, where he points to additional posts at Prawfsblawg (here, here, and here). More recently, BoingBoing faced public scrutiny when one of its authors removed posts related to blogger and sex columnist Violet Blue, although nobody noticed the removals for about a year.  A message board dedicated to the issue has generated over 1600 messages since July 1, some very heated.  The moderator for the board writes:

It’s our blog and so we made an editorial decision, like we do every single day. We didn’t attempt to silence Violet. We unpublished our own work. There’s a big difference between that and censorship.

We hope you’ll respect our choice to keep the reasons behind this private. We do understand the confusion this caused for some, especially since we fight hard for openness and transparency. We were trying to do the right thing quietly and respectfully, without embarrassing the parties involved.

Clearly, that didn’t work out. In attempting to defuse drama, we inadvertently ignited more. Mind you, we weren’t the ones splashing gasoline around; but we did make the fire possible. We’re sorry about that. In the meantime, Boing Boing’s past content is indexed on the Wayback Machine, a basic Internet resource; so the material should still be available for those who would like to read it.

Oddly, BoingBoing speaks in terms of “unpublishing” rather than deletion.   (Their policy page states “We reserve the right to unpublish or refuse to unpublish anything for any or no reason.”)  Sure, “unpublishing” sounds less big-brothery than deletion, but I don’t really see the difference.

Moreover, “unpublishing” isn’t quite accurate: BoingBoing doesn’t mean “unpublished” in the sense of a book (or blog posting) that has yet to be published.  They mean disabling public access to something that has already been posted, like in the DMCA 512(c) sense where material is removed or access to it is disabled.  (WordPress does have an “unpublishing” function, but that’s still a misnomer.)  A more accurate term might be deposting, depublishing, or good ‘ol deletion.

Nevertheless, it’s useful to explore a potential distinction between deletion and depublishing, and other questions raised when a blogger wants to remove posted materials:

  • As a starting point, what is the meaning of “publication” in an age where materials can be changed or removed?
  • Under what circumstances is depublication justified?
  • What practices are needed to distinguish “depublication” from “deletion?”  Is a reservation of rights declaring a right of depublication sufficient?  Should a notice be posted where the materials used to be (as Dan Markel suggests)?
  • BoingBoing notes that the removed materials remain on the Wayback Machine web archive.  Do web archives help to justify depublication?
  • Does depublication serve an important social function by severing the association between author and depublished content?

Hat tip to Noam Cohen.  And a disclaimer: I did make some edits to this post after posting.