03 October 2015

A Practical application of markdown & pandoc

As I recounted in another post to this blog, I started using the text-only markdown format for my writing a few months ago. In part I did this to avoid future problems with file formats, and in part to be able to have one central file that I could convert to various formats as needed, ideally by using pandoc.
Most people probably will understand the problem with obsolete file formats. I’ve got some not so old WordPerfect files on my computer right now, for example (and you can find a bunch of them on my employer’s servers as well), but no WordPerfect app (or “program” as we used to call it). I can still read the ones I’ve tried to open, but I need to be a bit clever about it and I doubt many of my colleagues would succeed on their own. I also had a heck of a time converting my 1998 dissertation to a newer version of Word, losing a little formatting along the way.
But the major impetus for the effort in recent months has been the second goal of having one master file in markdown (mentioned very briefly in that other post). In particular I was thinking about streamlining the workflow for updating my CV, which I do as often as every few months. I keep a version of the CV on line as a web page, and another version in PDF for me to share more formally and for the curious to download from the web page. I wasn’t too thrilled with how the html printed, so I didn’t spend much time cleaning it up for that purpose and instead generated the PDF from an OpenDoc file which is easy to format (and which I edit with LibreOffice, BTW). My workflow then was something like:
  1. Edit the html with BBEdit.
  2. Edit the odt file with exactly the same content changes.
  3. Print the odt file to PDF.
  4. Post the html on line.
  5. Post the PDF on line.
  6. Archive the successive .odt and PDF versions by saving the new files with an appropriate name (I stuck the date in it).
In short, not a super long workflow, but still plenty of opportunity for those two versions to go out of sync (which they did). More annoyingly though, it was just too many steps.
When I learned about pandoc, I figured this was going to do it for me. Instead of worrying about odt and html, I’d just have one markdown file and spin that off into html and a PDF, avoiding odt altogether. That turned out to be harder than I’d expected, mainly because I knew virtually nothing about the tool pandoc uses to generate PDF, TeX/laTeX, and the default format didn’t look much like what I was used to generating. (My current default document format is based on the ideas in Matthew Butterick’s Practical Typography, which I highly recommend.) I also had the problem that the html wasn’t really printable in the format I was using: the links were underlined so visitors would know where to click, and it contained some info at the bottom including links that led to my department’s website and a few other places, all of which did not need to be included in the PDF.
At the heart of the problem is that markdown is a nice little tool, but it isn’t designed for formatting the text. That’s a pretty standard text-markup thing: put the document’s logical structure in it, but do the formatting elsewhere. Content not layout. So while you can say “this bit of text is a heading”, you can’t say what a heading should look like from within markdown. For that you need to use whatever formatting system your destination file requires (css for html, stylesheets for odt, and so on) and that’s where my ignorance of latex came into play. After a little bit of trying to get up to speed on laTeX, I decided to drop the idea of going directly to PDF and continue as I had been: print a file to PDF after it had been generated by pandoc. This was still an improvement over what I had been doing…as long as I could get the format right so I could use the html file for printing.
I put the project aside for a bit, but when I had to update the CV a few weeks ago, a few things occurred to me. First, I wasn’t using css to full advantage. I’d forgotten that it allows you to change styles depending on how the page is viewed. For example, the css can indicate one font for a big computer monitor and another for a smart phone. So I used that feature (specifically @media print coupled with the css pseudo-class last-of-type) to make that pesky final material disappear upon printing. I also made the links look like the rest of the text when printed. The other thing I realized was that my two-column format could easily be handled with standard html tables which I had abandoned years ago in an earlier version of the document and was now fudging in the markdown by using standard html…ugly. The last thing to do was use another css pseudo-class (first-of-type) to get my leading table to format differently from the other tables throughout the document. Here's what that last looks like. The contact details at the start are one table (the first in the document) and the educational institutions with dates are a second:

The last table is similar and disappears upon printing:
Where am I now with that workflow? Here’s the new one:
  1. Edit the markdown.
  2. Convert the markdown to html
  3. Print the html file to PDF.
  4. Post the html on line.
  5. Post the PDF on line.
  6. Archive the markdown and html on Github.
Strictly speaking the PDF in step 3 is only for me now, since viewers of the web page version could just print it themselves and get the right formatting. Nevertheless I post the PDF for downloading to make it easy. (Now that I think of it, I should probably keep track of downloads to see if that’s worthwhile.) Step 6 was my solution to all the archiving work I was doing. It still requires some action on my part, but with the OS X Github Desktop app it’s much easier and quicker than the old method. In fact although overall the number of steps is still the same, they entail a lot less work. Importantly the repeated and error-prone editing step is gone.

Lessons learned?

  • Pandoc is a great tool, but getting it to generate (nearly) identical-looking documents in different formats can be fairly hard. To be fair, it’s not really for that anyway.
  • If you don’t want to settle for default styles, you’re going to have to do a little work in your output format to get your document to look like you want. You may also have to compromise to get it to work in markdown.
  • It’s not always easy to step back and look at a problem with fresh eyes. I was so intent on rendering the original html in markdown that I didn’t realize that an easy(-ish) css solution was staring me in the face in the form of new html. Setting the project down for a while helped.
A final note: I continue to work (slowly) on my laTex knowledge, and have now got a generic document style going that looks fairly similar to my odt and html.
PS I don’t know why I’m capitalizing PDF, but not the other extensions.