19 August 2015

Archiving BMCR

Just a few days after my last post on archiving to fight link rot, Ryan Baumann (@ryanfb) wrote up his impressive efforts to make sure that all the links in the recently announced AWOL Index were archived. Since I was thinking about this sort of thing for the open-access Bryn Mawr Classical Review for which I'm on the editorial board, I figured I'd just use his scripts to make sure all the BMCR reviews were on the Internet Archive. (Thanks, Ryan!)

Getting all the URLs was fairly simple, though there was a little bit of brute-force work involved for the earlier years, before BMCR settled on a standard URL format. Actually there are still a few PDFs from scans of the old print versions which I completely missed on the first pass, but once I found out they were out there, it was easy enough to go get them. (I was looking for "html" as a way of pulling all the reviews, so the ".pdf" files got skipped.)

In the end less than 10% of the 10,000+ reviews weren't already on the Archive, but are now, assuming I got them all up there. Let me know if you find one I missed.

I'm still looking at WebCite too.

14 August 2015

Fight link rot!

Argh, it just happened to me again. I clicked on a link on a webpage only to find that the page on the other end of the link was gone. 404. A victim of link rot.

This kind of thing is more than a hassle. It's a threat to the way we work as scholars, where citing one's sources and evidence is at the heart of what we do. Ideally links would never go away, but the reality is that they do. How often? A study cited in a 2013 NYTimes article found that 49% of links in Supreme Court decisions were gone. The problem has gotten big enough that The New Yorker had a piece on it back in January.

What I wanted to do here is point out that there are ways for us scholars to fight link rot, mostly thanks to the good work of others (and isn't that the whole point of the Internet?). Back in that ideal world, publishers would take care that their links never died, even if they went out of business, but we users can help them out by making sure that their work gets archived when we use it. Instead of simply linking to a page, either link to an archived version of it or archive it when you cite it, so that others can go find it later.

I've used two archiving services, Archive.org and Webcite. Both services respect the policies of sites with regard to saving copies (i.e., via their robots.txt files), but Archive.org will actually keep checking policies, so it's possible that a page you archived will later disappear. That won't happen on WebCite. WebCite will also archive down a few links deep on the page you ask it to archive, while Archive.org just does that one page.

WebCite is certainly more targeted to the scholarly community, and their links are designed to be used in place of the originals in your work. But both of them are way better than nothing, and you'll find lots of sites using them. For convenience there are bookmarklets for each that you can put in your browser bar for quick archiving (WebCite, Archive.org).

So next time you cite a page, make sure you archive it. Maybe even use WebCite links in your stuff (like I did in this post on the non-Wikipedia links).

(FYI, another service is provided by Perma.cc, which is designed for the law, covered in this NPR story.)

Added 18 August 2015: Tom Elliott (@paregorios) notes this article on using WebCite in the field of public health (which I link to via its DOI).