20 July 2015

Frequent Latin Vocabulary - sharing the data(base)

When I first started teaching after grad school, I did a lot of elementary-Latin instruction. I felt well prepared for this because I did my graduate work at the University of Michigan, where the Classical Studies Department has a decades-long tradition of paedagogical training. It includes people like Waldo "Wally" Sweet, Gerda Seligson, Glenn Knudsvig, and Deborah Ross. One consequence of this teaching and preparation was that I became very interested myself in Latin paedagogy and my first research output was in this direction.

In particular I started looking at the Latin vocabulary that students were learning and how that related to the vocabulary that they were reading in the texts they encountered in intermediate and upper-level classes. As I investigated this, I learned that there had been a lot of work on exactly this area not only among people studying second-language acquisition, but also in Classics circles back in the 1930s, 40s and 50s. One of the more interesting people in this area was not someone that many classicists will not know, Paul B. Diederich. Diederich had quite an interesting career, working even at that early date in what is now the trendiest of educational concerns, assessment, mainly in writing and language instruction, and eventually making his way to the Educational Testing service, ETS, which gave us the SAT.

Diederich's University of Chicago thesis was entitled "The frequency of Latin words and their endings." As the title suggests it involves determining the frequency of both particular Latin words and endings for both nouns/adjectives/pronouns and verbs. In other words, a bit of what would now qualify as Digital Humanities, Diederich of course lacked a corpus of computerized texts and he had to do this counting by hand. So he made copies of the pages of major collections of Latin works, using different colors for different genres (big genres, like poetry and prose), and then cut these sheets of paper up so that each piece contained one word. Then he counted up the words (over 200,000!) and calculated the frequencies. This biggest challenge he faced was the way his method completely destroyed the context of the individual words; once the individual words were isolated, it was impossible to know where they came from. One result of this was acknowledged by Diederich in the thesis: not all Latin words are unique. For example the word cum is both a preposition meaning "with" and a subordinating conjunction "when/after/because." This meant that Diederich needed either to combine counts for these words (which he did for cum), or label such ambiguities before cutting up the paper. As he himself admits, he did a fairly good job of the latter, but didn't quite get them all. Another decision that had to be made was what to do with periphrases, that is, constructions that consist of more than one word. Think of the many English verb forms that fall into this category: did go, will go, have gone, had gone, am going, etc. Do you want to count "did go" as one word or two?

Interesting to me was that Diederich was careful to separate words that the Romans normally wrote together. These usually short words, called enclitics, were appended in Latin to the preceding words, a bit like the "not" in "cannot" (which I realize not everyone writes as one word these days). This was a good choice on Diederich's part, as one of these words, -que meaning "and," was the most frequent word in his sample. (As a side note, some modern word-counting tools, like the very handy vocab tool in Perseus, do not do count enclitics at all. Such modern tools also can't disambiguate like Diederich could, so you'll see high counts for words like edo, meaning "eat," since it shares forms with the very common word esse, "to be." Basically we're trading off automation, and its incredible speed increases, for lack of ambiguity.)

The article I eventually published (“Frequent Vocabulary in Latin Instruction,” The Classical World 97, no. 4 (2004): 409-433) involved me using the computer to create a database of Latin vocabulary and then counting frequencies for a number of textbooks, comparing them to another set of frequent-vocabulary lists. I put some of the results of this work up on the internet (here, for example), but didn't do a lot of sharing of the database itself. This wasn't so easy way back in the early 'aughts, but it is now. Hence this post (which is a great example of burying the lede, I suppose).

I created the database in FileMake Pro version 3. Then migrated to version 6, then 8, and now 12. (Haven't made the jump to 13 yet.) Doing this work in a tool like FMP has its pros and cons—and was the subject of some debate at our LAWDI meetings a few years ago. Big on the pro side is the ease of use of FMP and the overall power of the relational-database model. On the con side is the difficulty in getting the data back out so that it can be worked on with other tools that can't really do the relational thing. For me FMP also allowed the creation of some very nice handouts for my classes, and powerful searches once I got the data into it. In the end though, if I'm going to share some of this work, it should be in a more durable and easily usable form, and put someplace where people can easily get to it and I won't have to worry too much about it. I decided on a series of flat text files for the format, and GitHub for the location. I'm going to quote the README file from the repository for a little taste of what the conversion was like:

Getting the data out of FMP and into this flat format required a few steps. First was updating the files. I already had FMP3 versions along with the FMP6 versions that I had done most of the work in. (That's .fp3 and .fp5 for the file extensions.) Sadly FMP12, which is what I'm now using, doesn't directly read the .fp5 format at all, and FMP6 is a Classic app, which OS X 9 (Mavericks) can't run directly. So hereʻs what I did:
  • Create a virtual OS X 10.6 (Snow Leopard) box on my Mavericks system. Snow Leopard was the last OS X version to be able to run the Apple Classic emulator, Rosetta. That took a little doing, since I updated the various sytem pieces as needed. Not that this version of OS X can be super secure, but I just wanted it as close as possible.
  • Convert the old .fp5 files to .fp7 with FMP 8 (I keep a few versions of FMP around).
  • Archive the old .fp5 files as a zip file.
  • Switch back to Mavericks.
  • Archive the old .fp7 files. I realized that the conversion process forced me to rename the originals, and zipping left them there, so I could skip the step of restoring the old filenames.
  • Convert the .fp7 to .fmp12.
  • Export the FMP files as text. Iʻm using UTF-16 for this, because the database uses diarheses for long vowels (äëïöü). Since this is going from relational to flat files, I had to decide which data to include in the exports.
  • Convert the diarheses to macrons (āēīōū). I did this using BBEdit.
  • Import the new stems with macrons back into FMP. I did it this way because search and replace on BBEdit is faster than in FMP.
  • Put the text files [on Github].
FMP makes the export process very easy. The harder part was deciding which information to include in which export. An advantage of the relational database is that you can keep a minimal amount of information in each file and combine it via relations with the information in other files. In this case, for example, the lists of vocabulary didn't have to contain all the vocabulary items within their files, but simply a link to those items. For exports though you'd want those items to be there for each list. Otherwise you end up doing a lot of cross-referencing. It's this kind of extra work, which admittedly can be difficult, especially when you have a complicated database that you designed a while back, that makes some avoid FMP (and other relational databases) from the start.

In the end though, I think I was successful. I created three new text files, which reflect the three files of the relational database:
  1. Vocabulary is in vocab.tab. These are like dictionary entries.
  2. Stems, smaller portions of vocal items, are in stems.tab. The vocab items list applicable stems, an example of something that was handled relationally in the original.
  3. The various sources for the vocabulary items are in readings.tab. It lists, for example, Diederich's list of 300 high-frequency items.
I also included the unique IDs that each item had in each database, so it would be possible to put them back together again, if you wanted (though you could just ask me for the files too). See the README and the files themselves for more detail. I feel pretty good though about my decision to use FMP. It was—and is, even if I'm not teaching a lot of Latin these days—a great tool to do this project in, and getting the data back out was fairly straightforward. 

You can check out the entire set of files at my GitHub repository. And here's a little article by Diederich on his career. He was really an interesting guy, a classicist working on what became some very important things in American higher education.

07 July 2015

List of undergrad Classics programs - more pandoc & github

I've kept ("maintained" would be too strong a word) a list of undergraduate classics program for a while now. I first started it when I was the webmaster for the American Classical League back in grad school and early in my career, but then I just kept it around because it seemed like a shame not too.

The other day I got a now rare update from someone on the list, and I thought this would be a good time to change the way I handle the whole thing.

First I figured I'd switch over the basic form of this very simple page from html to the easier to manage markdown, and then change my workflow to use pandoc to generate the html from this markdown whenever necessary. This part was fairly simple, though there were a few complications. For example, pandoc doesn't like html name tags, so in order to keep a bunch of internal links I had, I needed to convert those anchors to spans with IDs that matched the names I was using. BBEdit and its nice search-and-replace functionality to the rescue. I then did some minimal tweaks to the new pandoc-generated markdown file, so that pandoc could now generate some decent html from it.

Step 2 was to put the new markdown and html up on GitHub, instead of in my institutional filespace. Not only does this give me some version control over the files, but I can let other people edit the markdown and send me pull requests, instead of emailing me updates that I then put into the file. I don't think I'll get a lot of these, but you never know.

So I can now (1) make changes to the markdown myself, or accept pull requests with them in it, (2) sync my local github repo with the on-line version, (3) run pandoc on my local copy, and (4) sync it back to GitHub where I release a new tagged version. I then link to the files directly via rawgit. (The tag in GitHub is needed to make rawgit grab the correct version instead of the one it caches.)

The relevant links are here:

Check out the list and let me know about any updates via a pull request!