Showing posts with label DigHum. Show all posts
Showing posts with label DigHum. Show all posts

01 August 2016

Spreadsheet to GeoJSON: Part 2, in which I do not re-invent the wheel

My last post was about doing a simple, but automated conversion of our growing Google sheet of Roman temples into GeoJSON. At the end, I noted that there were some existing methods for doing that, and in the interim I’ve been exploring those a bit.

Most of them involve some kind of javascript and are designed for use on websites (or with node). That’s pretty nice in some circumstances, and in fact I’ve been playing around with them, trying to improve my leaflet.js skills in the process.

However I’m still interested in being able to download our data and convert all of it to GeoJSON which gets posted to GitHub, for which javascript isn’t ideal (and I think I could make it work, if I knew js better), but there is some software out there that means I don’t have to worry about the actual conversion from sheet to json, though I do have to do some clean up. Ideally the script will not require any interaction, and it will convert all the data we have into GeoJSON, not just the bits that I tell it about. This will let us add and delete columns in the sheet without worrying about whether it will screw up the conversion.

The Details

Temples and Mithraea around Rome.
Temples and mithraea mapped in the area of Rome.
I settled on the nifty ogr2ogr utility from GDAL, the Geospatial Data Abstraction Library, to do the conversion (or conversions, as it turned out), and still did some cleanup with jq and the usual text-manipulation tools. The script was written in bash, so it can be run from the command line (or automatically via cron, or LaunchDaemon these days, I guess). Here’s how the script goes:
  1. Download the sheet from Google as xml: https://spreadsheets.google.com/feeds/list/key/1/public/values, where key is the key for the sheet. Pretty sure that I can get csv if I authenticate to Google, but I couldn’t figure out how to do that in a timely way, so I stuck with the public feed. When I tried to download those versions via the browser, it did work, likely because I’m logged into Google services all the time, but possibly because Google doesn’t like people using curl to get data, but will allow a real browser to. In either case, I didn’t want to have to rely on the browser for this, so I stuck with what I could get via the command line.
  2. Reformat and clean up the xml with tidy -xml -iq and then sed 's/gsx://g'. (Google really likes those gsx prefixes on the field names.)
  3. Use ogr2ogr to convert the xml to csv. ogr2ogr seems to need files to work on, so at this point I also stored the data in temporary files. (Happy to be wrong about the necessity for files, so let me know!)
  4. Now use ogr2ogr again, the time to convert the csv to GeoJSON. You’d think I could convert the xml directly to GeoJSON, but, again, I couldn’t figure that out (.vrt files) and so resorted to the double conversion. Help welcome!
  5. At this point, I do a little more clean up, getting rid of some fields from the xml that I don’t want (via grep, jq, and perl) and removing the false 0,0 coordinates that get generated instead of null when the sheet cells for longitude and latitude are empty.
  6. Finally I save the resultant file to my GitHub sync folder. I also generate a version that has only entries with coordinates, which can be mapped without complaints. That last is handled with jq and tr.
In the end then I’ve got a nice little script that does the job. Some improvements include getting ogr2ogr to do the xml-to-GeoJSON directly and using Google auth codes to avoid having to have a public sheet. One thing perhaps best left for another script is the creation of leaflet-related properties connected to how the icon gets displayed (icon color, image, etc.).

As usual the script is up on GitHub, where it’s annotated with more detail, and you can see the resultant map up there too (the area around Rome is shown in the image up above).

08 July 2016

Google Sheets to GeoJSON

Part of my summer is being spent working with a Drew University College of Liberal Arts student, Alexis Ruark, on growing a database of Roman temples. This is part of Drew's new Digital Humanities Institute, partly funded by the fine people at the Mellon Foundation (on which more soon, I hope).

There are of course plenty of data out there on ancient places, including Pleiades, the Digital Atlas of the Roman Empire, and Vici.org, but what we're trying to do is create a database with more detail than those more generic projects (and I don't mean "generic" in a bad way here). In turn we hope to be able to contribute back to them (especially as we've relied on some of their data to kickstart our own work).

Alexis is working in a Google spreadsheet for a number of reasons, including easy sharing between us and the advantages that spreadsheets offer in general (e.g., sorting rows and moving columns around). But it isn't so easy to share data in that format, and there is already an existing format for sharing geographical data, namely, GeoJSON, so I'd like to be able to convert from the sheet to that format. (I'm also thinking ahead a little bit to when the projects grows up a little, and having the data in a different format will be more useful, if not necessary.)

First step, of course, was to do an internet search for converting Google sheets to JSON. Turns out the Google APIs themselves support conversion to one kind of JSON, so I figured this might be a nice little project for me to work on my coding skills while I learned more about JSON and the software that's out there already.

What I found

One page with some hints on converting Google sheets to JSON can be found here. In brief Google provides a feed of your spreadsheet in JSON format as long as you publish the spreadsheet to the web. Here's what that URI looks like:

https://spreadsheets.google.com/feeds/list/<sheet_ID>/1/public/values?alt=json

where the "<sheet_ID>" is that long code that shows up in the URI to your spreadsheet. One change that I had to make to the instructions on the site was to the part of the path that shows up right after that ID (a "1" here). It seems from the Google documentation to indicate the key of the sheet in your file that should be exported. (Happy to be corrected on that. See my comment on that article for some more links.)

The Process

Here's what I came up with:
  1. Get the JSON via the Google API and curl.
  2. That JSON isn't GeoJSON, so it needs to be processed. This was a chance for me to do some more work with the very powerful command-line app, jq, which I learned about from a great post by Matthew Lincoln on the Programming Historian. That took a few step:
    1. Remove the odd prefixes Google sticks on the column headers: "gsx$". It's not strictly necessary, but it does make the JSON—and the rest of this script—a bit more readable. For this I just used sed 's/gsx\$//g'.
    2. Pull out just the JSON for the rows, leaving out the info about the spreadsheet that is prepended to it. Here's the first use of jq: jq -c '.feed.entry[]'.
    3. Create a proper GeoJSON file with those rows, using only the necessary data (just longitude, latitude, and name for now): jq -c '{type: "Feature", geometry: {type: "Point", coordinates: [(.longitude."$t"|tonumber), (.latitude."$t"|tonumber)]}, "properties": {name: .temple."$t"}}' | tr '\n' ',' | sed 's/,$//g'. There are a couple of things going on there:
      • First, the coordinates had to be interpreted as numbers, but the API quotes them as if they were text. jq's tonumber function takes care of that, used inside parentheses with | (a new one for me).
      • jq also spits out each row as a separate JSON object, but they need to form part of a bigger object. This requires commas between them in place of the new lines that jq leaves when it's doing compact output, indicated by the -c option. tr took care of that, and sed removed the comma that got inserted at the end of the file.
      • The rest just uses jq to take the appropriate fields from Google's JSON and puts them where GeoJSON requires.
  3. Finally, I fill a file with this data, flanked by some needed opening and closing code:
    • Prefix for the GeoJSON file: {\"type\": \"FeatureCollection\",\"features\": [
    • All that nice JSON from the previous step.
    • Closing brackets: ]}
    • Then, for esthetics and readability, I use jq to reformat the JSON: jq '.'
    That file gets saved to my local copy of the GitHub repository for this project, so that when it gets synced, the work is backed up with a version history, and we get the added bonus that GitHub shows GeoJSON files as maps by default.
I saved the whole thing as a bash script with a little more error checking than I discussed here. You can check it out on GitHub.

Other methods

Turns out I should have searched for "google sheet to GeoJSON" instead of just "JSON" when I started this, as there are several existing ways to do this. My own offers some advantages for me (like saving to my GitHub repository), and I'm glad I took the time to work through the coding myself, but I'm looking more closely at these others to see if I can't use them or contribute to them to come up with a better solution.

One nice approach, called Geo, uses a script that you add to your spreadsheet. It will then let you export a GeoJSON file. Like my script (so far), it's limited to exporting just the geographical coordinates and an ID for the point. It will also look up addresses for you and fill in coordinates for them, which is not something that our project needs, but is very nice regardless.

A second method, csv2geojson, uses javascript to convert csv files to GeoJSON. In addition to making a collection of individual points, it can convert a list of points into another type of geographical entity, a line string.

A third looks very nice, but isn't working for me, gs2geojson. It adds a color option for the markers, which is appealing and suggests that it might not be too difficult to handle other columns as well. My javascript skills are poor, so I'm hoping it hasn't been abandoned...or maybe it's time to take on another student researcher who knows more than I do!

The last project I'll mention looks the most appealing to me right now: sheetsee.js, maintained by Jessica Lord, a software engineer at GitHub. It can read your sheet and grab all of the columns. The demo shows them being used in a pop-up upon hovering over the point. It also relies on tabletop, which is what actually reads the sheet and returns it as a simple array of JSON objects, so add that to the list.

The Future

Ultimately I may need to do some significant manipulation of some of the data in the sheet, so I think I'm going to talk to a few people who know more than I do about this to find out what they do, and I'll also delve a little more deeply into some of these other methods. At the very least, I'll learn more about what's out there and improve my coding skills.

19 August 2015

Archiving BMCR

Just a few days after my last post on archiving to fight link rot, Ryan Baumann (@ryanfb) wrote up his impressive efforts to make sure that all the links in the recently announced AWOL Index were archived. Since I was thinking about this sort of thing for the open-access Bryn Mawr Classical Review for which I'm on the editorial board, I figured I'd just use his scripts to make sure all the BMCR reviews were on the Internet Archive. (Thanks, Ryan!)

Getting all the URLs was fairly simple, though there was a little bit of brute-force work involved for the earlier years, before BMCR settled on a standard URL format. Actually there are still a few PDFs from scans of the old print versions which I completely missed on the first pass, but once I found out they were out there, it was easy enough to go get them. (I was looking for "html" as a way of pulling all the reviews, so the ".pdf" files got skipped.)

In the end less than 10% of the 10,000+ reviews weren't already on the Archive, but are now, assuming I got them all up there. Let me know if you find one I missed.

I'm still looking at WebCite too.

14 August 2015

Fight link rot!

Argh, it just happened to me again. I clicked on a link on a webpage only to find that the page on the other end of the link was gone. 404. A victim of link rot.

This kind of thing is more than a hassle. It's a threat to the way we work as scholars, where citing one's sources and evidence is at the heart of what we do. Ideally links would never go away, but the reality is that they do. How often? A study cited in a 2013 NYTimes article found that 49% of links in Supreme Court decisions were gone. The problem has gotten big enough that The New Yorker had a piece on it back in January.

What I wanted to do here is point out that there are ways for us scholars to fight link rot, mostly thanks to the good work of others (and isn't that the whole point of the Internet?). Back in that ideal world, publishers would take care that their links never died, even if they went out of business, but we users can help them out by making sure that their work gets archived when we use it. Instead of simply linking to a page, either link to an archived version of it or archive it when you cite it, so that others can go find it later.

I've used two archiving services, Archive.org and Webcite. Both services respect the policies of sites with regard to saving copies (i.e., via their robots.txt files), but Archive.org will actually keep checking policies, so it's possible that a page you archived will later disappear. That won't happen on WebCite. WebCite will also archive down a few links deep on the page you ask it to archive, while Archive.org just does that one page.

WebCite is certainly more targeted to the scholarly community, and their links are designed to be used in place of the originals in your work. But both of them are way better than nothing, and you'll find lots of sites using them. For convenience there are bookmarklets for each that you can put in your browser bar for quick archiving (WebCite, Archive.org).

So next time you cite a page, make sure you archive it. Maybe even use WebCite links in your stuff (like I did in this post on the non-Wikipedia links).

(FYI, another service is provided by Perma.cc, which is designed for the law, covered in this NPR story.)

Added 18 August 2015: Tom Elliott (@paregorios) notes this article on using WebCite in the field of public health (which I link to via its DOI).

20 July 2015

Frequent Latin Vocabulary - sharing the data(base)

When I first started teaching after grad school, I did a lot of elementary-Latin instruction. I felt well prepared for this because I did my graduate work at the University of Michigan, where the Classical Studies Department has a decades-long tradition of paedagogical training. It includes people like Waldo "Wally" Sweet, Gerda Seligson, Glenn Knudsvig, and Deborah Ross. One consequence of this teaching and preparation was that I became very interested myself in Latin paedagogy and my first research output was in this direction.

In particular I started looking at the Latin vocabulary that students were learning and how that related to the vocabulary that they were reading in the texts they encountered in intermediate and upper-level classes. As I investigated this, I learned that there had been a lot of work on exactly this area not only among people studying second-language acquisition, but also in Classics circles back in the 1930s, 40s and 50s. One of the more interesting people in this area was not someone that many classicists will not know, Paul B. Diederich. Diederich had quite an interesting career, working even at that early date in what is now the trendiest of educational concerns, assessment, mainly in writing and language instruction, and eventually making his way to the Educational Testing service, ETS, which gave us the SAT.

Diederich's University of Chicago thesis was entitled "The frequency of Latin words and their endings." As the title suggests it involves determining the frequency of both particular Latin words and endings for both nouns/adjectives/pronouns and verbs. In other words, a bit of what would now qualify as Digital Humanities, Diederich of course lacked a corpus of computerized texts and he had to do this counting by hand. So he made copies of the pages of major collections of Latin works, using different colors for different genres (big genres, like poetry and prose), and then cut these sheets of paper up so that each piece contained one word. Then he counted up the words (over 200,000!) and calculated the frequencies. This biggest challenge he faced was the way his method completely destroyed the context of the individual words; once the individual words were isolated, it was impossible to know where they came from. One result of this was acknowledged by Diederich in the thesis: not all Latin words are unique. For example the word cum is both a preposition meaning "with" and a subordinating conjunction "when/after/because." This meant that Diederich needed either to combine counts for these words (which he did for cum), or label such ambiguities before cutting up the paper. As he himself admits, he did a fairly good job of the latter, but didn't quite get them all. Another decision that had to be made was what to do with periphrases, that is, constructions that consist of more than one word. Think of the many English verb forms that fall into this category: did go, will go, have gone, had gone, am going, etc. Do you want to count "did go" as one word or two?

Interesting to me was that Diederich was careful to separate words that the Romans normally wrote together. These usually short words, called enclitics, were appended in Latin to the preceding words, a bit like the "not" in "cannot" (which I realize not everyone writes as one word these days). This was a good choice on Diederich's part, as one of these words, -que meaning "and," was the most frequent word in his sample. (As a side note, some modern word-counting tools, like the very handy vocab tool in Perseus, do not do count enclitics at all. Such modern tools also can't disambiguate like Diederich could, so you'll see high counts for words like edo, meaning "eat," since it shares forms with the very common word esse, "to be." Basically we're trading off automation, and its incredible speed increases, for lack of ambiguity.)

The article I eventually published (“Frequent Vocabulary in Latin Instruction,” The Classical World 97, no. 4 (2004): 409-433) involved me using the computer to create a database of Latin vocabulary and then counting frequencies for a number of textbooks, comparing them to another set of frequent-vocabulary lists. I put some of the results of this work up on the internet (here, for example), but didn't do a lot of sharing of the database itself. This wasn't so easy way back in the early 'aughts, but it is now. Hence this post (which is a great example of burying the lede, I suppose).

I created the database in FileMake Pro version 3. Then migrated to version 6, then 8, and now 12. (Haven't made the jump to 13 yet.) Doing this work in a tool like FMP has its pros and cons—and was the subject of some debate at our LAWDI meetings a few years ago. Big on the pro side is the ease of use of FMP and the overall power of the relational-database model. On the con side is the difficulty in getting the data back out so that it can be worked on with other tools that can't really do the relational thing. For me FMP also allowed the creation of some very nice handouts for my classes, and powerful searches once I got the data into it. In the end though, if I'm going to share some of this work, it should be in a more durable and easily usable form, and put someplace where people can easily get to it and I won't have to worry too much about it. I decided on a series of flat text files for the format, and GitHub for the location. I'm going to quote the README file from the repository for a little taste of what the conversion was like:

Getting the data out of FMP and into this flat format required a few steps. First was updating the files. I already had FMP3 versions along with the FMP6 versions that I had done most of the work in. (That's .fp3 and .fp5 for the file extensions.) Sadly FMP12, which is what I'm now using, doesn't directly read the .fp5 format at all, and FMP6 is a Classic app, which OS X 9 (Mavericks) can't run directly. So hereʻs what I did:
  • Create a virtual OS X 10.6 (Snow Leopard) box on my Mavericks system. Snow Leopard was the last OS X version to be able to run the Apple Classic emulator, Rosetta. That took a little doing, since I updated the various sytem pieces as needed. Not that this version of OS X can be super secure, but I just wanted it as close as possible.
  • Convert the old .fp5 files to .fp7 with FMP 8 (I keep a few versions of FMP around).
  • Archive the old .fp5 files as a zip file.
  • Switch back to Mavericks.
  • Archive the old .fp7 files. I realized that the conversion process forced me to rename the originals, and zipping left them there, so I could skip the step of restoring the old filenames.
  • Convert the .fp7 to .fmp12.
  • Export the FMP files as text. Iʻm using UTF-16 for this, because the database uses diarheses for long vowels (äëïöü). Since this is going from relational to flat files, I had to decide which data to include in the exports.
  • Convert the diarheses to macrons (āēīōū). I did this using BBEdit.
  • Import the new stems with macrons back into FMP. I did it this way because search and replace on BBEdit is faster than in FMP.
  • Put the text files [on Github].
FMP makes the export process very easy. The harder part was deciding which information to include in which export. An advantage of the relational database is that you can keep a minimal amount of information in each file and combine it via relations with the information in other files. In this case, for example, the lists of vocabulary didn't have to contain all the vocabulary items within their files, but simply a link to those items. For exports though you'd want those items to be there for each list. Otherwise you end up doing a lot of cross-referencing. It's this kind of extra work, which admittedly can be difficult, especially when you have a complicated database that you designed a while back, that makes some avoid FMP (and other relational databases) from the start.

In the end though, I think I was successful. I created three new text files, which reflect the three files of the relational database:
  1. Vocabulary is in vocab.tab. These are like dictionary entries.
  2. Stems, smaller portions of vocal items, are in stems.tab. The vocab items list applicable stems, an example of something that was handled relationally in the original.
  3. The various sources for the vocabulary items are in readings.tab. It lists, for example, Diederich's list of 300 high-frequency items.
I also included the unique IDs that each item had in each database, so it would be possible to put them back together again, if you wanted (though you could just ask me for the files too). See the README and the files themselves for more detail. I feel pretty good though about my decision to use FMP. It was—and is, even if I'm not teaching a lot of Latin these days—a great tool to do this project in, and getting the data back out was fairly straightforward. 

You can check out the entire set of files at my GitHub repository. And here's a little article by Diederich on his career. He was really an interesting guy, a classicist working on what became some very important things in American higher education.

15 January 2015

The Humanities Open Book Program (@NEH_ODH)

The NEH and Mellon Foundation just announced a new project today, under the broader auspices of the NEH's The Common Good: The Humanities in the Public Square project. It's called the Humanities Open Book Program and will provide funds so that organizations can "digitize [out-of-print scholarly] books and make them available as Creative Commons-licensed 'ebooks' that can be read by the public at no charge on computers, mobile devices, and ebook readers."

This is great. There are lots of books that fall into this category and would see a lot more use if there were available digitally. I'd love them for myself, but I can also imagine assigning them (or parts of them) more frequently to my students. Also great is that the program insists that the books be released in the EPUB format, which is open, looks good on lots of readers, and makes it fairly easy to get the text out.

Regarding that last, a potential limitations that I hope we don't actually see too much of results from the program's lack of a specific requirement that the work be re-usable. Instead what's required is a CC license. Any CC license. That means that in reality there's no guarantee that it will be possible to reuse the work (apart from the usual fair-use ways). I tweeted this question and @NEH_ODH replied quickly (love those guys):
So let's hope that lots of publishers do make the choice to allow such re-use. I'm worried about it in part because we know what publishers can be like. On the other hand, the program explicitly solicits applications from more than just presses: "scholarly societies, museums, and other institutions that publish books in the humanities," and these groups might be a little more inclined to use a more permissive license. A little outside pressure might not hurt either.

(If @NEH_ODH would like to comment, I'd be curious to know why they didn't impose a more open licensing requirement. Worried that publishers might not respond so openly?)

03 November 2013

Teaching with ORBIS #lawdi

On the last day of our Linked Ancient-World Data Institute this summer, sponsored by the NEH Office of Digital Humanities, I argued that it was important to show how all this exciting work could have practical implications for what professional classicists (and other ancient-world types) spend a lot of time on, teaching. To that end, I promised to do a post describing how I had assigned my Classical-archaeology students some short homework using Stanford's great new tool, ORBIS. What's ORBIS? In the words of the site, ORBIS "reconstructs the time cost and financial expense associated with a wide range of different types of travel in antiquity." More simply it allows you to map routes between two places in the Roman world given certain constraints for cost, time, and type of route.

Naturally the fine people at ORBIS did a nice upgrade to the service after I made that promise, but before I got it completed, so I had some more work to do before this post. (Fair enough, I dragged my feet for too long anyway. Nemesis strikes!) But finally here it is, suitable for framing (or at least bookmarking).

A Very Short Guide to Using ORBIS in your Classical-Archaeology Course

1. RTFM

Make sure that you understand as much as possible about ORBIS, what it is, how it works, and so on. You don't need to be an expert, but you should at least be able to do more than your students will by the time they finish the assignment. It won't take more than an hour to read the "Introduction to ORBIS", "Understanding ORBIS" and "Using ORBIS" tabs on the website. Don't miss the nifty how-to videos. Although there's more there to read, these three sections will get you far enough for step 2.

2. Make sure you know how to use it

Play around a bit yourself on the "Mapping ORBIS" section. Try to get from one place to another. Change the various parameters. Use all the controls, so you know how to change the views of the route and so on. Click the buttons and links and sliders. Go nuts. Depending on your technological prowess, this will take you a few hours at most.

3. Demo it in class

Once you're confident that you can show your students the basics of the site with confidence, have your student read those same three sections of the ORBIS website that you read up in #1 in preparation for a short demo of ORBIS that you'll do in class for them. Nothing fancy, just enough to show them the basics. I like to point out to them how long travel takes when you don't have motorized vehicles and how much faster travel over sea is than over land, but be sure to walk them through creating a route and choosing the various options, no matter what extra details you cover.

4. Assignment 1 of 2

Have your students use ORBIS to find a simple route between two places that you specify. Have them do it under multiple conditions. (I used three different sets.) Then have them either print out or take a screen shot of the result, with all routes shown. Here's one with routes between three sets of cities, one taking the fastest route, another the cheapest, and the third the shortest.
Since you've set the parameters, you'll know what the correct routes should be, and thanks to the different colors, it's easy to tell at a glance whether the student got it right. Successful completion of this will indicate that your students can handle using the basics of ORBIS. Make sure they all successfully complete this first assignment. Then they're ready for part 2.

5. Assignment 2 of 2

This part is up to you. Depending on which section of your course you're using ORBIS in, you'll want to find some question you can answer, or some issue you can illuminate via ORBIS. I actually did something that was completely out of ORBIS' chronological span.

To help my students understand the rationale behind some of the placement of early Greek colonies in the west, I had them examine routes between Delphi and Naples. The latter was used as a proxy for the earliest colony of Pithecoussae. (This was actually a variation on an assignment I had made up years ago using a QuickTime movie with links to the Perseus website.) The biggest travel difference between the later Roman empire, the time in which ORBIS is "located", and the geometric period was the roads in use. Obviously none of the vast Roman road system was in place, and so I made sure the students used only routes that avoided long portions over land. (I want to use this assignment again, so I'm not giving away all the details!)

6. Put the students to work

If you subsequently have your students come up with their own ORBIS projects, odds are they'll find something useful and interesting to do, perhaps something you hadn't quite thought of. A set of mine, for example, used ORBIS to explore the different travel experiences of three characters from the ancient world with differing socio-economic backgrounds, complete with clever backstories!


And there it is. Hope this encourages you to use ORBIS and other terrific ancient-world-related DH tools in the classroom! I'd love to hear in the notes about your experiences with ORBIS or with any other tool.

12 February 2012

Under God

Back in January, Kevin Kruse penned an op-ed for the Gray Lady on the 20th-century rise in American usage of "under God" with various prefixed words. It's an idea I'm generally sympathetic to (i.e., that many things that pass under the mantle of tradition aren't really traditional and in particular this emphasis on America as a Godly nation), but is the language part true? Kruse has written a book on the general subject, so I'm not going to try to tackle the whole thing, but what about some of these factual claims on usage? I think there might be some data the uninitiated among us could look at.

What exactly does Kruse claim in the NYT piece? First:
After Lincoln, however, the phrase ["this nation, under God"] disappeared from political discourse for decades. But it re-emerged in the mid-20th century...
and this:
Throughout the 1930s and ’40s, Mr. Fifield and his allies advanced a new blend of conservative religion, economics and politics that one observer aptly anointed “Christian libertarianism.” Mr. Fifield distilled his ideology into a simple but powerful phrase — “freedom under God.”
then this:
Indeed, in 1953, President Dwight D. Eisenhower presided over the first presidential prayer breakfast on a “government under God” theme
and finally:
In 1954, as this “under-God consciousness” swept the nation
So where could we go to check up on all this? Google n-grams! This is a nice little service from the king of data mining himself (corporately speaking) that let's you look for phrases in book corpora. This won't get us everything, but it will get us a lot. So what is there?

First let's look at Ike's phrase of "government under God." Not one I was familiar with, but I wasn't around in 1953 either, so perhaps no surprise. I'm going to look in the corpus of American (not British) books for the entire period Google makes available. Here's the result at right.

Hmmm. Those early years give large results, but since there's nothing in between that and the later occurrences, I'm going to cut those years out of the next one. Now what do we have? Well, clearly a rise around 1953, as Kruse suggests, but the rise actually begins a bit earlier (which you can see by zooming in on the n-gram), and not just in spurious hits that this service sometimes finds, like "government under God's..." (because it pays attention to capitalization, but not punctuation, for some reason): there's even a book called, yep, Government Under God. So, sure, that's a phrase we might reasonably associate with Ike's usage, but it sure wasn't original to him.

How about "freedom under God," another phrase tied with a particular person? The first search shows nothing before the late 19th century, so let's zoom in on the period after 1850. Now there's a clear rise in 1941, following some decades of fairly minimal usage. Here too there's a book entitled with the phrase, this time by Bishop Sheen. Since books might take a little time to write and publish, this suggests increasing popularity, starting in the 1930s, consistent with what Kruse writes.

And that most famous of phrases, Lincoln's "this nation, under God"? I've taken out the smoothing in this plot to show the spikiness of the trend. The first peak back in the early 19th century is actually not a good one; it's got an apostrophe following "God," so all we have in the 19th century is the Gettysburg Address. Clicking through the later references, they too are mainly quotations of Lincoln. And once again there's a book by that title, this time from 1923. Overall though there isn't really an obvious increase in references to the speech in the early to middle 20th century. In fact if anything references die off a bit in the 1930s and 1940s. The n-grams therefore don't support Kruse's argument at all.

Finally what about the simplest, unprefixed form of the phrase: "under God"? The first run shows that it's a fairly popular phrase, on its way out in the 1900s after a peak in the middle of the previous century. At the same time there is definitely a little hump there, so let's zoom in (right). In the plot we can clearly see a resurgence in the phrase around 1940, though the increase seems smaller than the underlying trend. That's the sort of thing Kruse talks about, but it doesn't seem very impressive. Naturally there are a bunch of books with this phrase in their title, especially after the change to the Pledge of Allegiance, which seems more the consequence of what Kruse discusses.

In conclusion? The n-grams don't seem to offer much support for Kruse's contention. In particular the simple phrase "under God"—which dwarfs the more complex phrases in frequency—shows only a minor increase in usage after 1940, and a lot of that has to be attributed to the Pledge in 1954. But Google's just looking at books. How about the NYTimes itself?

Well, "government under God" gets us one hit, in 1957, which suggests this wasn't a big success. The Pledge's "one nation under God" first appears in an article about the Pledge itself, also no help for the theory. "Freedom under God" does seem to appear numerous times in the 1940s, including not a few by Catholic leaders. In fact Sheen's book, already mentioned above, seems to be the first reference in the newspaper. This seems consistent with the n-gram results in which the phrase appears in several Catholic sources from the 1920s.

As for the simple "under God," that appears a whole bunch of times even in the earliest years available in the search engine (1851-). For example, here's President Harding in his proclamation of Thanksgiving Day, 1921, which was subsequently quoted by the American ambassador to England, George Harvey.:
Under God, our responsibility is great; to our own first; to all men afterward, to all mankind in God's justice.
So where are we then? Well, this quick look at easily available data hasn't really allowed us to look closely at the context for a lot of the usage, though it does seem clear that

  • "under God" was a phrase Americans would have been familiar with and 
  • it enjoyed a bit of a resurgence in the 1940s and after, though did not see even a doubling is usage
  • Lincoln's phrase was pretty much his invention and was (and is?) quoted, but hardly re-used
  • "Government under God" went nowhere
  • "Freedom under God" seems to have been popular in Catholic circles from the 1920s on

In conclusion, it's not unfair to say that the 20th century was a popular time for connecting God with the nation in a linguistic way, but it's not clear from what we've looked at here that the resurgence didn't start a bit earlier than Kruse seems to talk about, and didn't ride over a strong substratum of "under God"-liness. It also might have its origins not in the Protestant Fifield, but among Catholics (who may also have been allies of Fifield), something that might not be too surprising in today's political climate.

01 February 2012

AIA Comes out in Favor of the Research Works Act

In the middle of the holiday break, our own AIA, the Archaeological Institute of America, submitted to the  Office of Science and Technology Policy of the US government their statement on the recently proposed Research Works Act, which is in essence an attack on the growing Open Access movement. (Follow the link to Thomas.gov and check out the Wikipedia entry too.)

Leaving aside the apparent absence of this document from the AIA's own website (site search engines can be remarkably crappy when it comes to this kind of thing), why didn't they think this would be worth letting me, a member in good standing, know about? Especially now that the AAA's response—to which the AIA explicitly refers in their document—has raised a ruckus in that group!

But more importantly, where's the membership on this? Are we in favor of this stance? I'm certainly not. Anyone else?

Many of us in the profession are advocates for Open Access (a term which the AIA doesn't even seem to understand, to judge from their response), and I suspect would have a thing or two to say about the stance of our professional organization. Others have made the case already, so I won't re-argue it here, but I encourage you to read some of them (by, e.g., Kristina Kilgrove or Derek Lowe).

What's most galling though, is that this statement was made by the AIA literally days before our  annual meeting, when it would have been a trivial matter to bring up the subject in official venues and get some important feedback. I wasn't there (off in Rome with students), but I haven't had any official word of anything. And given the decades-long prominence of some of our members in what's now known as the Digital Humanities, this is profoundly disappointing.

I certainly hope that others in the AIA feel the same way about this, and I'm fixing to find out who they are!

Correction: After some discussion with a few others, including Sebastian Heath, I have to correct myself. This AIA's letter was not in response the RWA per se, but, as I wrote, to the RFI from the OSTP. The issues are the same, in that the RWA addresses the question of mandating Open Access to publications dealing with federally funded research which is what the AIA statement dealt with (along with some other things I disagree with). I'll deal with this more in another post, but I wanted to get a correction in right away and apologize for the error.

11 December 2011

Dude, where's my diss? ''

Part III, in which I produce the document

In the second installment of this multi-post topic, I ended wondering where I might store copies of my dissertation for public download. For some reason it hadn't occurred to me to use the Box account that I have. (Box is like DropBox.) Since I have now figured this out, I present below two pdf versions. This first is to a copy of the UMI version which is, as I wrote before, essentially a photocopy of the original paper version I submitted to them back in 1998. The second pdf is a searchable version I recently created. That wasn't as easy as it sounds.

It's true that it's a trivial matter to create a pdf these days. On my Mac, I can just print directly to pdf, and this was my first approach. The problem is that the latest version of Microsoft Word renders the text slightly differently from the way version 5.1a (of blessed memory) did it. As a result the page numbering got way off. (I will refrain from the obvious rant about the problems this version issue causes.) The first remedy I tried was tweaking the margins a bit, thinking that the fonts (mainly Times) were being rendered at a consistently different width. No dice. In some cases lines were longer, in other shorter. I haven't a clue why. OK, I think, so I'll just fire up version 5.1a. Well, that requires at least Classic, which doesn't run on Intel Macs anymore. No problem, SheepSaver emulates such a machine, even on my nifty new MacBook Pro. First new problem: OS 9 doesn't allow such easy printing to pdf. Solved with PrintToPDF, which creates a virtual Chooser (remember that?) printer that really sends output to a pdf file. Great. The second problem wasn't new nor was it so easily solved.

Word 5.1a does a better job than the 2011 version at reproducing the layout of my original document, but not a perfect one. For some reason it was just not matching up and once again it wasn't a simple matter of adjusting margins. So here's what I did. I figured that the smallest unit of text I had to worry about was the page, and many of them were the same, that is, they started and ended on the same word as my original dissertation printout. Some of the intervening lines look different, but since no one was going to be citing my dissertation that way, it would be OK. Where the pages didn't line up, I went in and inserted extra spaces to force line breaks, with the occasional tweak to margins, mainly in indented quotations. That got the pages right, and let my virtual 1998 Mac create a searchable pdf.

The only remaining problem with the pdf is that the text in ancient Greek is not real text. Back in the 90s we still weren't using Unicode everywhere, so the ancient Greek is really just regular Latin character codes shown in a font that uses Greek glyphs instead of Latin ones. (In reality lots of the accented Greek characters are punctuation of some kind.) The pdf displays the font fine, but it really isn't Greek text that you can copy or search for.

Here are the links. Again, they lead to my Box account, which I haven't upgraded to allow direct downloads, so you'll have to do something else to get the pdf itself:


What I'd really like to do is make the dissertation available as an e-book of some kind. The problem remains the Greek and the page breaks. The Greek isn't a big deal, even if I had to type it all out agin (which I don't); there's not a lot of it. Also it's not difficult to turn a pdf into one of the popular e-book formats, but my footnotes mean I can't do that without some work. Ideally I'd start from the Word doc, so a little research is needed to see what the options are.

Meanwhile...where's your diss?

16 October 2011

Dude, where's my diss? '

Part II, in which I wonder about my copyright

As noted last time, I clearly assert my ownership of copyright on the title page of my dissertation. However UMI also asserts a copyright:
UMI Microform 9840610 Copyright 1998, by UMI Company, All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code.
Honestly I don't know what this means. Sounds like they might claim a copyright on the particular microform instantiation of my dissertation, that is, the microfilm, though given the appearance of this text on a pdf, I suspect they may also be claiming a copyright on that particular pdf as well.
Let's see what their website reveals. Off to the support pages and I find this as the second item on a search for "copyright":
No, you do not have to copyright your work unless your school requires you to do so.
Well, mine did, so that seems to rule me out. Since I don't see any other obvious choices, I guess I'll e-mail support and see what they say. Here's my question:
My dissertation cays that I have the copyright, per my university's instructions. The UMI version says that UMI claims a copyright as well, though you also recognize mine. What exactly are the rights that I retain and what are the ones that you hold?
Thanks. 
Two days later (which is actually the first business day after), their reply in its entirety:
You are the copyright owner of your dissertation not us
So that seems good, but I remain a bit suspicious, given their fairly clear claim, so I ask back:
Does that mean I can freely distribute the pdf you made of my dissertation?
This time the reply is:
You would need to call the copyright office to ask them
So I do. The answer from them is that I am free to do what I want with the pdf of the dissertation; that they merely store my work. Awesome.

Next up, where to put the diss for long-term availability.

28 September 2011

NEH ODH Project Directors Meeting

Spent the day yesterday with Tom Elliott and a slew of other digital humanists(?) at the NEH Office of Digital Humanities' day-long meeting for project directors. We were talking about our Advanced Institute on linked open data that's a joint project between Drew University and NYU's ISAW. Sebastian Heath is the third co-PI, notable by his absence.

Loads of fun, and nice to meet a bunch of people IRL whom I know from their work on-line, or Twitter, or some other on-line place.

Twitter hashtag #SUG2011.
Inside Higher Ed's story is here.

24 September 2011

Dude, where's my diss?

Part I, in which I search for my diss on-line


I've been reading my nice free e-copy of Hacking the Academy on my train ride to work lately. Among other things, it's gotten me to thinking about my own scholarship and the way in which it has been shared and sequestered. In this context the most prominent thing in my mind is my dissertation, a longish bit of writing that I spent several years in Ann Arbor working on. I have some vague memories of a title page with copyright language on it, but Hacking has prompted me to think harder about that...which led me to think a bit about the modern academic book.
Let me be clear up front: I don't have a book. I got tenure at an institution that—at the time—didn't require one, and my several articles and teaching and service were enough to get me the coveted title of "Associate Professor." Since my own tastes run more Callimachean than Homeric, that worked out well for me at the time.
But I do have some problems with the book. In a nutshell I think it's part of a mostly bankrupt system that has young scholars taking perfectly good pieces of academic writing, on which they spent years of hard work, and essentially saying these things were of such little value that they need to be worked over and turned into something else, something an academic press can sell (imagine!)...not that the young authors will see any direct gain from these sales.
There are too many examples of the warmed-over dissertation-cum-book for me to need to cite them. Any academic can surely name more than a few without much effort. Indeed there are entire series that are composed of such works (I'm looking at you, Oxford). Add to these the lightly revised articles bound together into a "new" book, the Festschrifts, the conference proceedings, and you've got a whole industry that revolves around books that aren't needed, at least not in that format. In my own research too, I've found precious few books that were influential on me. Instead I can easily point to well crafted articles that made their forceful points in fewer than 100 pages. I could add a bunch more reasons that depend on changes in technology, the history of books, and an improved functioning of academic publishing, but better to read that book I mentioned at the top.
So how does this fit with the topic at hand, my dissertation? Well, I've long been bothered by the way in which the many disses that are produced each year are more or less ignored, only to have the books that are based on them get all the attention, limited as even that may be in the end.
Were the disses that poor? I hope not, because that suggests that maybe we shouldn't have been given those Ph.D.s. Are the books that much better than the disses? Mostly, no, I'd say. So what's the deal? I think it's that we just like books. And by "we" I mean the whole industry of academe. (Again, Read the book!)
One roadblock to the further reading and use of dissertations, at least in the Humanities, is the difficulty in finding them. They don't get promoted by universities, unlike books by presses, and their authors are often looking to turn them into books, so they'd rather not see their own disses widely read. In fact it's an interesting profession that encourages its practitioners to ignore their first major piece of output.
But what about me? Since I didn't make mine into a book and I've just about given up on that movie deal, I'm happy to have more people read my own dissertation, so where is it? I figured I'd take a little time to try to find out. (And before I start, let me come more clean and say that I have written a few articles and given a few conference papers that are based on the work in my diss.)
I start by pulling out the diss (OK, the Word files) and find that title page. After convincing Word 2011 that it was OK to open such an old file, I see this little notice: "© John D. Muccigrosso All Rights Reserved 1998." Here's what it looks like on the page:
Very pretty, I think, and IIRC, all according the the then prevailing stylebook. To the sharing of said dissertation, that also seems good and right and fair: I own the copyright.
Next...what about that whole UMI thing? Academics will know UMI from experience: some vague entity that gets a copy of every dissertation made in the US. I'll try the obvious URL and http://UMI.com/ sends me right to the ProQuest Microfilm vault, where surely my acid-free papers from back in the last century sit protected for the upcoming millennia. But what if someone wanted to read those pages now? I click to find out more info on the UMI project and get taken to the somewhat pristine ProQuest Support Center, where there's a link to Dissertation Products. Not quite the phrase I'd use, but there you are. Clicking that opens up several new options, which are not really helping me out. Let me try the FAQ.
Ah, there it is: How do I order a dissertation online? Seems they want me to log in, but I'll just try going through my library's proxy...bingo, I'm in! Now a search for my whole name (you might think my last name is unique enough, but it turns out there's an unrelated, fairly prolific US historian with the same one, also in the wider NYC area). Yep, that'll do...and the diss is fairly high up in the list. (Unfortunately they're pumping the pdf through Flash, but most users probably have that beast installed, and there is an option to get the pdf directly.) The pdf itself is just a scan of the physical pages of my diss. No surprise, since what I sent them back in the last century was a copy printed on nice archival paper. Here's the only text that seems to be embedded in the pdf: Factional competition and monumental construction in mid-Republican Rome Muccigrosso, John D ProQuest Dissertations and Theses; 1998; ProQuest Dissertations & Theses: The Humanities and Social Sciences Collection pg. n/a Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. That's the basic metadata (title, author, date, etc) along with ProQuest's info, and a copyright notice, all of which is presumably fairly nice for search engines, if they can ever get a look at the file behind the login.
So from start to finish that was a quick 10 minutes or so to come up with my dissertation. Not bad for the academic user who knows about me, UMI and has an institutional (or private) subscription to ProQuest's dissertation service, all of which leaves me with a few questions...which I'll start on in my next post.