03 October 2015

A Practical application of markdown & pandoc

As I recounted in another post to this blog, I started using the text-only markdown format for my writing a few months ago. In part I did this to avoid future problems with file formats, and in part to be able to have one central file that I could convert to various formats as needed, ideally by using pandoc.
Most people probably will understand the problem with obsolete file formats. I’ve got some not so old WordPerfect files on my computer right now, for example (and you can find a bunch of them on my employer’s servers as well), but no WordPerfect app (or “program” as we used to call it). I can still read the ones I’ve tried to open, but I need to be a bit clever about it and I doubt many of my colleagues would succeed on their own. I also had a heck of a time converting my 1998 dissertation to a newer version of Word, losing a little formatting along the way.
But the major impetus for the effort in recent months has been the second goal of having one master file in markdown (mentioned very briefly in that other post). In particular I was thinking about streamlining the workflow for updating my CV, which I do as often as every few months. I keep a version of the CV on line as a web page, and another version in PDF for me to share more formally and for the curious to download from the web page. I wasn’t too thrilled with how the html printed, so I didn’t spend much time cleaning it up for that purpose and instead generated the PDF from an OpenDoc file which is easy to format (and which I edit with LibreOffice, BTW). My workflow then was something like:
  1. Edit the html with BBEdit.
  2. Edit the odt file with exactly the same content changes.
  3. Print the odt file to PDF.
  4. Post the html on line.
  5. Post the PDF on line.
  6. Archive the successive .odt and PDF versions by saving the new files with an appropriate name (I stuck the date in it).
In short, not a super long workflow, but still plenty of opportunity for those two versions to go out of sync (which they did). More annoyingly though, it was just too many steps.
When I learned about pandoc, I figured this was going to do it for me. Instead of worrying about odt and html, I’d just have one markdown file and spin that off into html and a PDF, avoiding odt altogether. That turned out to be harder than I’d expected, mainly because I knew virtually nothing about the tool pandoc uses to generate PDF, TeX/laTeX, and the default format didn’t look much like what I was used to generating. (My current default document format is based on the ideas in Matthew Butterick’s Practical Typography, which I highly recommend.) I also had the problem that the html wasn’t really printable in the format I was using: the links were underlined so visitors would know where to click, and it contained some info at the bottom including links that led to my department’s website and a few other places, all of which did not need to be included in the PDF.
At the heart of the problem is that markdown is a nice little tool, but it isn’t designed for formatting the text. That’s a pretty standard text-markup thing: put the document’s logical structure in it, but do the formatting elsewhere. Content not layout. So while you can say “this bit of text is a heading”, you can’t say what a heading should look like from within markdown. For that you need to use whatever formatting system your destination file requires (css for html, stylesheets for odt, and so on) and that’s where my ignorance of latex came into play. After a little bit of trying to get up to speed on laTeX, I decided to drop the idea of going directly to PDF and continue as I had been: print a file to PDF after it had been generated by pandoc. This was still an improvement over what I had been doing…as long as I could get the format right so I could use the html file for printing.
I put the project aside for a bit, but when I had to update the CV a few weeks ago, a few things occurred to me. First, I wasn’t using css to full advantage. I’d forgotten that it allows you to change styles depending on how the page is viewed. For example, the css can indicate one font for a big computer monitor and another for a smart phone. So I used that feature (specifically @media print coupled with the css pseudo-class last-of-type) to make that pesky final material disappear upon printing. I also made the links look like the rest of the text when printed. The other thing I realized was that my two-column format could easily be handled with standard html tables which I had abandoned years ago in an earlier version of the document and was now fudging in the markdown by using standard html…ugly. The last thing to do was use another css pseudo-class (first-of-type) to get my leading table to format differently from the other tables throughout the document. Here's what that last looks like. The contact details at the start are one table (the first in the document) and the educational institutions with dates are a second:

The last table is similar and disappears upon printing:
Where am I now with that workflow? Here’s the new one:
  1. Edit the markdown.
  2. Convert the markdown to html
  3. Print the html file to PDF.
  4. Post the html on line.
  5. Post the PDF on line.
  6. Archive the markdown and html on Github.
Strictly speaking the PDF in step 3 is only for me now, since viewers of the web page version could just print it themselves and get the right formatting. Nevertheless I post the PDF for downloading to make it easy. (Now that I think of it, I should probably keep track of downloads to see if that’s worthwhile.) Step 6 was my solution to all the archiving work I was doing. It still requires some action on my part, but with the OS X Github Desktop app it’s much easier and quicker than the old method. In fact although overall the number of steps is still the same, they entail a lot less work. Importantly the repeated and error-prone editing step is gone.

Lessons learned?

  • Pandoc is a great tool, but getting it to generate (nearly) identical-looking documents in different formats can be fairly hard. To be fair, it’s not really for that anyway.
  • If you don’t want to settle for default styles, you’re going to have to do a little work in your output format to get your document to look like you want. You may also have to compromise to get it to work in markdown.
  • It’s not always easy to step back and look at a problem with fresh eyes. I was so intent on rendering the original html in markdown that I didn’t realize that an easy(-ish) css solution was staring me in the face in the form of new html. Setting the project down for a while helped.
A final note: I continue to work (slowly) on my laTex knowledge, and have now got a generic document style going that looks fairly similar to my odt and html.
PS I don’t know why I’m capitalizing PDF, but not the other extensions.

19 August 2015

Archiving BMCR

Just a few days after my last post on archiving to fight link rot, Ryan Baumann (@ryanfb) wrote up his impressive efforts to make sure that all the links in the recently announced AWOL Index were archived. Since I was thinking about this sort of thing for the open-access Bryn Mawr Classical Review for which I'm on the editorial board, I figured I'd just use his scripts to make sure all the BMCR reviews were on the Internet Archive. (Thanks, Ryan!)

Getting all the URLs was fairly simple, though there was a little bit of brute-force work involved for the earlier years, before BMCR settled on a standard URL format. Actually there are still a few PDFs from scans of the old print versions which I completely missed on the first pass, but once I found out they were out there, it was easy enough to go get them. (I was looking for "html" as a way of pulling all the reviews, so the ".pdf" files got skipped.)

In the end less than 10% of the 10,000+ reviews weren't already on the Archive, but are now, assuming I got them all up there. Let me know if you find one I missed.

I'm still looking at WebCite too.

14 August 2015

Fight link rot!

Argh, it just happened to me again. I clicked on a link on a webpage only to find that the page on the other end of the link was gone. 404. A victim of link rot.

This kind of thing is more than a hassle. It's a threat to the way we work as scholars, where citing one's sources and evidence is at the heart of what we do. Ideally links would never go away, but the reality is that they do. How often? A study cited in a 2013 NYTimes article found that 49% of links in Supreme Court decisions were gone. The problem has gotten big enough that The New Yorker had a piece on it back in January.

What I wanted to do here is point out that there are ways for us scholars to fight link rot, mostly thanks to the good work of others (and isn't that the whole point of the Internet?). Back in that ideal world, publishers would take care that their links never died, even if they went out of business, but we users can help them out by making sure that their work gets archived when we use it. Instead of simply linking to a page, either link to an archived version of it or archive it when you cite it, so that others can go find it later.

I've used two archiving services, Archive.org and Webcite. Both services respect the policies of sites with regard to saving copies (i.e., via their robots.txt files), but Archive.org will actually keep checking policies, so it's possible that a page you archived will later disappear. That won't happen on WebCite. WebCite will also archive down a few links deep on the page you ask it to archive, while Archive.org just does that one page.

WebCite is certainly more targeted to the scholarly community, and their links are designed to be used in place of the originals in your work. But both of them are way better than nothing, and you'll find lots of sites using them. For convenience there are bookmarklets for each that you can put in your browser bar for quick archiving (WebCite, Archive.org).

So next time you cite a page, make sure you archive it. Maybe even use WebCite links in your stuff (like I did in this post on the non-Wikipedia links).

(FYI, another service is provided by Perma.cc, which is designed for the law, covered in this NPR story.)

Added 18 August 2015: Tom Elliott (@paregorios) notes this article on using WebCite in the field of public health (which I link to via its DOI).

20 July 2015

Frequent Latin Vocabulary - sharing the data(base)

When I first started teaching after grad school, I did a lot of elementary-Latin instruction. I felt well prepared for this because I did my graduate work at the University of Michigan, where the Classical Studies Department has a decades-long tradition of paedagogical training. It includes people like Waldo "Wally" Sweet, Gerda Seligson, Glenn Knudsvig, and Deborah Ross. One consequence of this teaching and preparation was that I became very interested myself in Latin paedagogy and my first research output was in this direction.

In particular I started looking at the Latin vocabulary that students were learning and how that related to the vocabulary that they were reading in the texts they encountered in intermediate and upper-level classes. As I investigated this, I learned that there had been a lot of work on exactly this area not only among people studying second-language acquisition, but also in Classics circles back in the 1930s, 40s and 50s. One of the more interesting people in this area was not someone that many classicists will not know, Paul B. Diederich. Diederich had quite an interesting career, working even at that early date in what is now the trendiest of educational concerns, assessment, mainly in writing and language instruction, and eventually making his way to the Educational Testing service, ETS, which gave us the SAT.

Diederich's University of Chicago thesis was entitled "The frequency of Latin words and their endings." As the title suggests it involves determining the frequency of both particular Latin words and endings for both nouns/adjectives/pronouns and verbs. In other words, a bit of what would now qualify as Digital Humanities, Diederich of course lacked a corpus of computerized texts and he had to do this counting by hand. So he made copies of the pages of major collections of Latin works, using different colors for different genres (big genres, like poetry and prose), and then cut these sheets of paper up so that each piece contained one word. Then he counted up the words (over 200,000!) and calculated the frequencies. This biggest challenge he faced was the way his method completely destroyed the context of the individual words; once the individual words were isolated, it was impossible to know where they came from. One result of this was acknowledged by Diederich in the thesis: not all Latin words are unique. For example the word cum is both a preposition meaning "with" and a subordinating conjunction "when/after/because." This meant that Diederich needed either to combine counts for these words (which he did for cum), or label such ambiguities before cutting up the paper. As he himself admits, he did a fairly good job of the latter, but didn't quite get them all. Another decision that had to be made was what to do with periphrases, that is, constructions that consist of more than one word. Think of the many English verb forms that fall into this category: did go, will go, have gone, had gone, am going, etc. Do you want to count "did go" as one word or two?

Interesting to me was that Diederich was careful to separate words that the Romans normally wrote together. These usually short words, called enclitics, were appended in Latin to the preceding words, a bit like the "not" in "cannot" (which I realize not everyone writes as one word these days). This was a good choice on Diederich's part, as one of these words, -que meaning "and," was the most frequent word in his sample. (As a side note, some modern word-counting tools, like the very handy vocab tool in Perseus, do not do count enclitics at all. Such modern tools also can't disambiguate like Diederich could, so you'll see high counts for words like edo, meaning "eat," since it shares forms with the very common word esse, "to be." Basically we're trading off automation, and its incredible speed increases, for lack of ambiguity.)

The article I eventually published (“Frequent Vocabulary in Latin Instruction,” The Classical World 97, no. 4 (2004): 409-433) involved me using the computer to create a database of Latin vocabulary and then counting frequencies for a number of textbooks, comparing them to another set of frequent-vocabulary lists. I put some of the results of this work up on the internet (here, for example), but didn't do a lot of sharing of the database itself. This wasn't so easy way back in the early 'aughts, but it is now. Hence this post (which is a great example of burying the lede, I suppose).

I created the database in FileMake Pro version 3. Then migrated to version 6, then 8, and now 12. (Haven't made the jump to 13 yet.) Doing this work in a tool like FMP has its pros and cons—and was the subject of some debate at our LAWDI meetings a few years ago. Big on the pro side is the ease of use of FMP and the overall power of the relational-database model. On the con side is the difficulty in getting the data back out so that it can be worked on with other tools that can't really do the relational thing. For me FMP also allowed the creation of some very nice handouts for my classes, and powerful searches once I got the data into it. In the end though, if I'm going to share some of this work, it should be in a more durable and easily usable form, and put someplace where people can easily get to it and I won't have to worry too much about it. I decided on a series of flat text files for the format, and GitHub for the location. I'm going to quote the README file from the repository for a little taste of what the conversion was like:

Getting the data out of FMP and into this flat format required a few steps. First was updating the files. I already had FMP3 versions along with the FMP6 versions that I had done most of the work in. (That's .fp3 and .fp5 for the file extensions.) Sadly FMP12, which is what I'm now using, doesn't directly read the .fp5 format at all, and FMP6 is a Classic app, which OS X 9 (Mavericks) can't run directly. So hereʻs what I did:
  • Create a virtual OS X 10.6 (Snow Leopard) box on my Mavericks system. Snow Leopard was the last OS X version to be able to run the Apple Classic emulator, Rosetta. That took a little doing, since I updated the various sytem pieces as needed. Not that this version of OS X can be super secure, but I just wanted it as close as possible.
  • Convert the old .fp5 files to .fp7 with FMP 8 (I keep a few versions of FMP around).
  • Archive the old .fp5 files as a zip file.
  • Switch back to Mavericks.
  • Archive the old .fp7 files. I realized that the conversion process forced me to rename the originals, and zipping left them there, so I could skip the step of restoring the old filenames.
  • Convert the .fp7 to .fmp12.
  • Export the FMP files as text. Iʻm using UTF-16 for this, because the database uses diarheses for long vowels (äëïöü). Since this is going from relational to flat files, I had to decide which data to include in the exports.
  • Convert the diarheses to macrons (āēīōū). I did this using BBEdit.
  • Import the new stems with macrons back into FMP. I did it this way because search and replace on BBEdit is faster than in FMP.
  • Put the text files [on Github].
FMP makes the export process very easy. The harder part was deciding which information to include in which export. An advantage of the relational database is that you can keep a minimal amount of information in each file and combine it via relations with the information in other files. In this case, for example, the lists of vocabulary didn't have to contain all the vocabulary items within their files, but simply a link to those items. For exports though you'd want those items to be there for each list. Otherwise you end up doing a lot of cross-referencing. It's this kind of extra work, which admittedly can be difficult, especially when you have a complicated database that you designed a while back, that makes some avoid FMP (and other relational databases) from the start.

In the end though, I think I was successful. I created three new text files, which reflect the three files of the relational database:
  1. Vocabulary is in vocab.tab. These are like dictionary entries.
  2. Stems, smaller portions of vocal items, are in stems.tab. The vocab items list applicable stems, an example of something that was handled relationally in the original.
  3. The various sources for the vocabulary items are in readings.tab. It lists, for example, Diederich's list of 300 high-frequency items.
I also included the unique IDs that each item had in each database, so it would be possible to put them back together again, if you wanted (though you could just ask me for the files too). See the README and the files themselves for more detail. I feel pretty good though about my decision to use FMP. It was—and is, even if I'm not teaching a lot of Latin these days—a great tool to do this project in, and getting the data back out was fairly straightforward. 

You can check out the entire set of files at my GitHub repository. And here's a little article by Diederich on his career. He was really an interesting guy, a classicist working on what became some very important things in American higher education.

07 July 2015

List of undergrad Classics programs - more pandoc & github

I've kept ("maintained" would be too strong a word) a list of undergraduate classics program for a while now. I first started it when I was the webmaster for the American Classical League back in grad school and early in my career, but then I just kept it around because it seemed like a shame not too.

The other day I got a now rare update from someone on the list, and I thought this would be a good time to change the way I handle the whole thing.

First I figured I'd switch over the basic form of this very simple page from html to the easier to manage markdown, and then change my workflow to use pandoc to generate the html from this markdown whenever necessary. This part was fairly simple, though there were a few complications. For example, pandoc doesn't like html name tags, so in order to keep a bunch of internal links I had, I needed to convert those anchors to spans with IDs that matched the names I was using. BBEdit and its nice search-and-replace functionality to the rescue. I then did some minimal tweaks to the new pandoc-generated markdown file, so that pandoc could now generate some decent html from it.

Step 2 was to put the new markdown and html up on GitHub, instead of in my institutional filespace. Not only does this give me some version control over the files, but I can let other people edit the markdown and send me pull requests, instead of emailing me updates that I then put into the file. I don't think I'll get a lot of these, but you never know.

So I can now (1) make changes to the markdown myself, or accept pull requests with them in it, (2) sync my local github repo with the on-line version, (3) run pandoc on my local copy, and (4) sync it back to GitHub where I release a new tagged version. I then link to the files directly via rawgit. (The tag in GitHub is needed to make rawgit grab the correct version instead of the one it caches.)

The relevant links are here:

Check out the list and let me know about any updates via a pull request!

10 June 2015

Self-publishing with pandoc, etc

Depending on your thinking, I'm either just done with or approaching the end of a sabbatical. ("Just done with" if you think that once commencement occurs, it's just a regular summer.) Among the things I produced in the past few months is a very short "note" on a topic that doesn't fall within my usual area of research. I sent it to a couple of OA journals, but neither wanted to publish it as is. I'm not interested in doing more with it at this time, but it seems silly to have it just sit on my hard drive doing nothing. It's the kind of thing I'd do as a conference paper, if I went to a conference at which I think it'd be welcome. But since I don't go to such conferences, I figure I'll just put it out there for people to check out anyway. (The advantages of tenure and the internet!)

I could do it as a blog post, though it's already written in a more "academic" style than I write this blog in. Instead I'm going to post it as html on github and as a pdf on my account at figshare, where it's easily accessible, archived, and even gets a DOI. I'll also link to it from my academia.edu page (as well as here, obviously).

The Workflow

I've started using markdown with pandoc to generate documents. I was inspired by Dennis Tenen and Grant Wythoff's post last year, "Sustainable Authorship in Plain Text using Pandoc and Markdown," but I've long been a fan of avoiding proprietary formats that are likely to become obsolete (no doubt in part because I work with very old texts and materials professionally). It's easy enough to do simple stuff this way, but getting to more complex documents requires some work. Here's a list of stuff I do/use:

  • For editing my markdown documents, I use the free MacDown, which gives a nice split screen, showing the raw markdown on the left and the interpreted version on the right. There are a number of pandoc "enhancements" to markdown that MacDown can't handle, but it gets the vast majority of the formatting right and it prevents me from making stupid mistakes in that majority.
  • I keep all my bibliography in Zotero. I export it all as a a bibtex file using Better BibTeX, which provides some nice customization of the export entries. Once this bibtex file is created, I can easily cite the works in it within markdown and then let pandoc-citeproc expand them as appropriate.
  • I've given up on using pandoc to produce final versions of the same file in different formats. I'm mostly interested in html, OpenDoc, pdf and—sadly—Word. There are just too many complications in academic documents (footnotes, etc) and my skills and time are limited. LateX PDFs have a certain look to them, but anything I can print, OS X can turn into a PDF, so that's not a big deal for me. In most other cases, I don't need both html and odt/docx versions, so I can skip it there too. (I was really hoping that I could generate my CV in html and PDF directly, since I've been maintaining html and odt versions, the latter of which I turned into PDF via OS X, but I've yet to get enough into LateX to be able to reproduce my mildly complicated CV format.)
  • For html, I use a few different versions of my standard css file. It's all up on github, so you can see what I've done. I just discovered rawgit.com, so now I'm converting my html files to refer to the css files there, instead of on the institutional server that I've been using (and which has recently become a bit more difficult to keep updated from home). FYI, I usually edit css field manually with BBEdit, but I'm trying out CSSEdit, which seems to have been EOL'ed.
  • On the pandoc side, I've tweaked the default templates for html, odt and docx, so that they can handle multiple authors, as well as a license field in the header. Again, all on github.
  • To tie it together a bit, I wrote an applescript that takes the frontmost document in MacDown and runs it through pandoc, outputting whatever type of file you want (based on extension) and also allowing user-inputted pandoc switches.
The overall process then runs like this: write in MacDown, incorporating the citation keys from Zotero; process that with pandoc to generate the desired final file type; publish/share/whatever.

It's seems easy when I write it like that.

Numbering for citation

My one concern with html output of the note that I just wrote was that html has no default pagination, and pages are usually the way one cites an article. So instead of numbered pages, I decided to go with numbered paragraphs. (Read about Sebastian Heath's approach to articles he's editing for ISAW.) But how to number them, so that the numbers were visible (for easy citation) and so that I didn't have to manually put them in? With a little help from the pandoc Google group, I combined some features of pandoc with css. Since pandoc automatically gives IDs to headers and css allows for formatting those headers and even for auto-numbering them, I put a nearly empty level-6 header at the start of each paragraph in my markdown document and I used css to number them and put that number off in the margin. (They're nearly empty because markdown won't create empty headers.) Although the numbers are visible to a human reader, the IDs aren't ideal: section, section-1, section-2, and so on, but they are sequential and linkable. The headers are also a bit ugly in markdown, but they work and they also make it possible for me to indicate logical paragraphs instead of the actual ones. This is useful, for example, when there's a block quote, which technically creates a new paragraph, right in the middle of a logical paragraph.

One more thing, since I'm using css to number the paragraphs in the html version, those numbers are technically part of the display of the article, not part of its content. So you can see them, but you can't find them if you search in your browser. That's not the case in the PDF; there the numbers are "real" and you can find them in a search.

The Article

So about the article itself...it has to do with the original nature of the Golden Calf in Exodus 32. I speculate that it was in origin a "corn calf," to be associated with a lost harvest ritual. Go have a read:

15 January 2015

The Humanities Open Book Program (@NEH_ODH)

The NEH and Mellon Foundation just announced a new project today, under the broader auspices of the NEH's The Common Good: The Humanities in the Public Square project. It's called the Humanities Open Book Program and will provide funds so that organizations can "digitize [out-of-print scholarly] books and make them available as Creative Commons-licensed 'ebooks' that can be read by the public at no charge on computers, mobile devices, and ebook readers."

This is great. There are lots of books that fall into this category and would see a lot more use if there were available digitally. I'd love them for myself, but I can also imagine assigning them (or parts of them) more frequently to my students. Also great is that the program insists that the books be released in the EPUB format, which is open, looks good on lots of readers, and makes it fairly easy to get the text out.

Regarding that last, a potential limitations that I hope we don't actually see too much of results from the program's lack of a specific requirement that the work be re-usable. Instead what's required is a CC license. Any CC license. That means that in reality there's no guarantee that it will be possible to reuse the work (apart from the usual fair-use ways). I tweeted this question and @NEH_ODH replied quickly (love those guys):
So let's hope that lots of publishers do make the choice to allow such re-use. I'm worried about it in part because we know what publishers can be like. On the other hand, the program explicitly solicits applications from more than just presses: "scholarly societies, museums, and other institutions that publish books in the humanities," and these groups might be a little more inclined to use a more permissive license. A little outside pressure might not hurt either.

(If @NEH_ODH would like to comment, I'd be curious to know why they didn't impose a more open licensing requirement. Worried that publishers might not respond so openly?)