21 January 2017

A Millennium is 1,000 years long

There's been a lot already written about Trump's inaugural address. (Here's a good piece by James Fallows.) I'm a little surprised not to have found any comments so far on a line that struck me:
We stand at the birth of a new millennium...
"Millennium" isn't one of the "first time in an inaugural" words that Trump used yesterday.

It occurs in two earlier addresses, Nixon's first, where he refers to the upcoming bicentennial and observes that some alive at that time would live to see "the beginning of the third millennium"; and Clinton's second, where he notes that they were "at the edge of a new century, in a new millennium." (Interestingly George W. Bush, sworn in at the start of the new millennium, didn't mention it at all, preferring to refer to the "new century.")

The idea of "ages" of civilization is a pretty old one. The Greek poet Hesiod described the now conventional "metal" ages (γένεα) of gold, silver, bronze, and iron (with some demigods thrown in for good measure) with their continual decline. The Romans too, perhaps taking their lead from the Etruscans, had their ages, saecula, which were more like centuries (the Latin word saeculum means a literal "century" as well as a less precise "age" and gives us the English "secular"). The idea was picked up by some at the beginning of the United States and, in a lightly adapted line from Vergil's fourth Eclogue, appears on the reverse of the Great Seal of the United States, and thence on our money: Novus ordo seclorum ("a new order of the ages").

This phrase also plays a role in some of the "new world order" conspiracies, but we'll leave that alone here and get back to Trump's phrase.

Both earlier inaugural uses of "millennium" make some sense in that they were referring to an actual start of a millennium. But that's a bit of a stretch for us here in 2017. While in relative terms it's now pretty early in the current millennium—and maybe it seems more so to those of us who've spent most of our lives in the 20th century—the millennium has been "born" for a while now (in its "infancy"?). But it strikes me that there is a precedent or declaring the start of a new thousand-year period, and that unfortunately is to be found in Hitler's notion of a "tausendjähriges Reich," referred to, for example, in Riefenstahl's film Triumph of the Will. This is sadly not the only bit of Nazi-related rhetoric in the speech. As has been noted repeatedly (see this March 2016 letter to the NYT), the Trump campaign's "America First" slogan, repeated in the address yesterday, has its origins in the isolationist America First movement associated with anti-Semitism.

In something apparently mostly written by Steve Bannon, former executive chair of Breitbart news, which he called "the platform for the alt-right," this is not surprising and continues the dangerous pattern of Trump's dog whistles—and worse—to the intolerant right.

01 August 2016

Spreadsheet to GeoJSON: Part 2, in which I do not re-invent the wheel

My last post was about doing a simple, but automated conversion of our growing Google sheet of Roman temples into GeoJSON. At the end, I noted that there were some existing methods for doing that, and in the interim I’ve been exploring those a bit.

Most of them involve some kind of javascript and are designed for use on websites (or with node). That’s pretty nice in some circumstances, and in fact I’ve been playing around with them, trying to improve my leaflet.js skills in the process.

However I’m still interested in being able to download our data and convert all of it to GeoJSON which gets posted to GitHub, for which javascript isn’t ideal (and I think I could make it work, if I knew js better), but there is some software out there that means I don’t have to worry about the actual conversion from sheet to json, though I do have to do some clean up. Ideally the script will not require any interaction, and it will convert all the data we have into GeoJSON, not just the bits that I tell it about. This will let us add and delete columns in the sheet without worrying about whether it will screw up the conversion.

The Details

Temples and Mithraea around Rome.
Temples and mithraea mapped in the area of Rome.
I settled on the nifty ogr2ogr utility from GDAL, the Geospatial Data Abstraction Library, to do the conversion (or conversions, as it turned out), and still did some cleanup with jq and the usual text-manipulation tools. The script was written in bash, so it can be run from the command line (or automatically via cron, or LaunchDaemon these days, I guess). Here’s how the script goes:
  1. Download the sheet from Google as xml: https://spreadsheets.google.com/feeds/list/key/1/public/values, where key is the key for the sheet. Pretty sure that I can get csv if I authenticate to Google, but I couldn’t figure out how to do that in a timely way, so I stuck with the public feed. When I tried to download those versions via the browser, it did work, likely because I’m logged into Google services all the time, but possibly because Google doesn’t like people using curl to get data, but will allow a real browser to. In either case, I didn’t want to have to rely on the browser for this, so I stuck with what I could get via the command line.
  2. Reformat and clean up the xml with tidy -xml -iq and then sed 's/gsx://g'. (Google really likes those gsx prefixes on the field names.)
  3. Use ogr2ogr to convert the xml to csv. ogr2ogr seems to need files to work on, so at this point I also stored the data in temporary files. (Happy to be wrong about the necessity for files, so let me know!)
  4. Now use ogr2ogr again, the time to convert the csv to GeoJSON. You’d think I could convert the xml directly to GeoJSON, but, again, I couldn’t figure that out (.vrt files) and so resorted to the double conversion. Help welcome!
  5. At this point, I do a little more clean up, getting rid of some fields from the xml that I don’t want (via grep, jq, and perl) and removing the false 0,0 coordinates that get generated instead of null when the sheet cells for longitude and latitude are empty.
  6. Finally I save the resultant file to my GitHub sync folder. I also generate a version that has only entries with coordinates, which can be mapped without complaints. That last is handled with jq and tr.
In the end then I’ve got a nice little script that does the job. Some improvements include getting ogr2ogr to do the xml-to-GeoJSON directly and using Google auth codes to avoid having to have a public sheet. One thing perhaps best left for another script is the creation of leaflet-related properties connected to how the icon gets displayed (icon color, image, etc.).

As usual the script is up on GitHub, where it’s annotated with more detail, and you can see the resultant map up there too (the area around Rome is shown in the image up above).

08 July 2016

Google Sheets to GeoJSON

Part of my summer is being spent working with a Drew University College of Liberal Arts student, Alexis Ruark, on growing a database of Roman temples. This is part of Drew's new Digital Humanities Institute, partly funded by the fine people at the Mellon Foundation (on which more soon, I hope).

There are of course plenty of data out there on ancient places, including Pleiades, the Digital Atlas of the Roman Empire, and Vici.org, but what we're trying to do is create a database with more detail than those more generic projects (and I don't mean "generic" in a bad way here). In turn we hope to be able to contribute back to them (especially as we've relied on some of their data to kickstart our own work).

Alexis is working in a Google spreadsheet for a number of reasons, including easy sharing between us and the advantages that spreadsheets offer in general (e.g., sorting rows and moving columns around). But it isn't so easy to share data in that format, and there is already an existing format for sharing geographical data, namely, GeoJSON, so I'd like to be able to convert from the sheet to that format. (I'm also thinking ahead a little bit to when the projects grows up a little, and having the data in a different format will be more useful, if not necessary.)

First step, of course, was to do an internet search for converting Google sheets to JSON. Turns out the Google APIs themselves support conversion to one kind of JSON, so I figured this might be a nice little project for me to work on my coding skills while I learned more about JSON and the software that's out there already.

What I found

One page with some hints on converting Google sheets to JSON can be found here. In brief Google provides a feed of your spreadsheet in JSON format as long as you publish the spreadsheet to the web. Here's what that URI looks like:

https://spreadsheets.google.com/feeds/list/<sheet_ID>/1/public/values?alt=json

where the "<sheet_ID>" is that long code that shows up in the URI to your spreadsheet. One change that I had to make to the instructions on the site was to the part of the path that shows up right after that ID (a "1" here). It seems from the Google documentation to indicate the key of the sheet in your file that should be exported. (Happy to be corrected on that. See my comment on that article for some more links.)

The Process

Here's what I came up with:
  1. Get the JSON via the Google API and curl.
  2. That JSON isn't GeoJSON, so it needs to be processed. This was a chance for me to do some more work with the very powerful command-line app, jq, which I learned about from a great post by Matthew Lincoln on the Programming Historian. That took a few step:
    1. Remove the odd prefixes Google sticks on the column headers: "gsx$". It's not strictly necessary, but it does make the JSON—and the rest of this script—a bit more readable. For this I just used sed 's/gsx\$//g'.
    2. Pull out just the JSON for the rows, leaving out the info about the spreadsheet that is prepended to it. Here's the first use of jq: jq -c '.feed.entry[]'.
    3. Create a proper GeoJSON file with those rows, using only the necessary data (just longitude, latitude, and name for now): jq -c '{type: "Feature", geometry: {type: "Point", coordinates: [(.longitude."$t"|tonumber), (.latitude."$t"|tonumber)]}, "properties": {name: .temple."$t"}}' | tr '\n' ',' | sed 's/,$//g'. There are a couple of things going on there:
      • First, the coordinates had to be interpreted as numbers, but the API quotes them as if they were text. jq's tonumber function takes care of that, used inside parentheses with | (a new one for me).
      • jq also spits out each row as a separate JSON object, but they need to form part of a bigger object. This requires commas between them in place of the new lines that jq leaves when it's doing compact output, indicated by the -c option. tr took care of that, and sed removed the comma that got inserted at the end of the file.
      • The rest just uses jq to take the appropriate fields from Google's JSON and puts them where GeoJSON requires.
  3. Finally, I fill a file with this data, flanked by some needed opening and closing code:
    • Prefix for the GeoJSON file: {\"type\": \"FeatureCollection\",\"features\": [
    • All that nice JSON from the previous step.
    • Closing brackets: ]}
    • Then, for esthetics and readability, I use jq to reformat the JSON: jq '.'
    That file gets saved to my local copy of the GitHub repository for this project, so that when it gets synced, the work is backed up with a version history, and we get the added bonus that GitHub shows GeoJSON files as maps by default.
I saved the whole thing as a bash script with a little more error checking than I discussed here. You can check it out on GitHub.

Other methods

Turns out I should have searched for "google sheet to GeoJSON" instead of just "JSON" when I started this, as there are several existing ways to do this. My own offers some advantages for me (like saving to my GitHub repository), and I'm glad I took the time to work through the coding myself, but I'm looking more closely at these others to see if I can't use them or contribute to them to come up with a better solution.

One nice approach, called Geo, uses a script that you add to your spreadsheet. It will then let you export a GeoJSON file. Like my script (so far), it's limited to exporting just the geographical coordinates and an ID for the point. It will also look up addresses for you and fill in coordinates for them, which is not something that our project needs, but is very nice regardless.

A second method, csv2geojson, uses javascript to convert csv files to GeoJSON. In addition to making a collection of individual points, it can convert a list of points into another type of geographical entity, a line string.

A third looks very nice, but isn't working for me, gs2geojson. It adds a color option for the markers, which is appealing and suggests that it might not be too difficult to handle other columns as well. My javascript skills are poor, so I'm hoping it hasn't been abandoned...or maybe it's time to take on another student researcher who knows more than I do!

The last project I'll mention looks the most appealing to me right now: sheetsee.js, maintained by Jessica Lord, a software engineer at GitHub. It can read your sheet and grab all of the columns. The demo shows them being used in a pop-up upon hovering over the point. It also relies on tabletop, which is what actually reads the sheet and returns it as a simple array of JSON objects, so add that to the list.

The Future

Ultimately I may need to do some significant manipulation of some of the data in the sheet, so I think I'm going to talk to a few people who know more than I do about this to find out what they do, and I'll also delve a little more deeply into some of these other methods. At the very least, I'll learn more about what's out there and improve my coding skills.

29 June 2016

Exposure Fusion for better on-site photos

Introduction

1. Awkward group of students casting shade.
(Credit John Muccigrosso)
If you've spent a few seasons on an archaeological site, odds are you've seen something like the scene in the first photo here: a bunch of people standing awkwardly close together, possibly holding some largish item, and bending their limbs at odd angles. They're trying to provide some shade so that the (possibly ad hoc) dig photographer has a subject with roughly consistent lighting. There are more elegant ways of achieving the same thing, of course, as in photos 2 and 3, but the idea is the same.


Large plastic sheet(?) used as shade. Credit Esri.
 2. Large plastic sheet(?) used as shade. (Credit Esri)

3. Reflector used for casting shade. (Credit Kate Devlin)
Sometimes though it's just not possible to completely eliminate the problem. Maybe you're trying to take a photo of a really large area, or maybe you're the only one around. Fortunately with the advent of digital photography, there is another approach: exposure fusion.

The Problem

Let's look at what the actual problem is.

Detection

Any light receptor has a range of brightness that it can detect. Any values beyond a certain brightness register the same maximum value ("white") and any below a certain point register the same minimum ("black"). This is called the dynamic range of the detector, or, more typically in photography, its exposure range. In the photography world, this range is measured in stops, which work in a base-2 log system, so that for every one-stop change, the value either doubles or halves. If you're handy with a camera, you'll be used to this because exposure times, f-stops, and ISO are all traditionally adjusted in increments of one stop...or so: note how shutter speeds are (were?) given as fractions of a second: 1/125, 1/60 (not twice 1/125), 1/30, 1/15, 1/8 (not twice 1/15), 1/4, 1/2, 1, and so on. (Camera buffs should note that 1 EV is equal to one stop.)

What's the range of common detectors? Well, the human eye is pretty good at this. It can detect a total range of something like 46.5 stops, which equates to the brightest thing being one hundred trillion (100,000,000,000,000:1) brighter than the darkest. That range foes the brightest daylight all the way to dimly lit nightscapes, and the eye accomplishes this remarkable feat by constantly adjusting to changing light levels. In daylight the range might be more like ~20 stops (or 1,000,000:1), but some sources say it's more like 10-14 stops (1,000:1 - 16,000:1). In any case, it's pretty big, but remember that this large range can't all be successfully detected at the same time.

Camera print film (remember that?) has an exposure range of approximately 10-14 stops, which is comparable to the human eye. Normal digital DSLRs can do 7-10 stops (128:1 to 1000:1), which is less than the eye. Slide film is lower still at 6-7 stops, and digital video is even lower at ~5.5 stops (45:1).

Recreating

How about output devices? If your camera can record 10 stops, how many stops can whatever is displaying the photo show? Remarkably (to me, anyway), paper can only do about 6-7 stops, which is about 100:1. The computer screen you're likely reading this on does a lot better at something up to 10 stops (1000:1). High dynamic range (HDR) monitors do a lot better at ~14 stops (30,000:1).

The situation gets a little more complicated with digital files, because of computational techniques that are applied to the data. The detector in the camera is like a grid (I'm oversimplifying here), and each cell in the grid is a pixel. so the camera hardware has to measure the light hitting the pixel and write that number to a file. For simplicity, think about a gray-scale image: the camera writes the light intensity as a number from 0 to whatever it goes up to. One wrinkle is that some image-file formats can't handle the same dynamic range as the camera's detector. If you're old enough to remember working on monitors with limited output (like the original Macs that had a dynamic range of 2, so pixels were either black or white), you'll know what this does to images. Since image files are binary formats, they tend to work in powers of 2, like the stops, so it's easy to go from the number of bits used to record light intensity to the number of stops. An 8-bit image has 256 levels of brightness, like 8 stops. The common format of JPEG is an 8-bit format (and "lossy" due to compression of the image), but it can achieve dynamic ranges up to about 11 stops due to non-linear transformation of the data coming off the camera detector (another wrinkle). If your camera can save raw files, those will typically have the maximum dynamic range that your camera can measure.

(I'm ignoring a lot of nuance to everything I just wrote about, but it's easy to immerse yourself in the details of this stuff, if you like.)

What's the problem again?

Back to the problem...
Like your eye, almost all cameras will make adjustments to maximize picture quality. Cheap ones do it automatically and don't let you make any changes, and more expensive ones will let you control just about everything. Insofar as they adjust then, cameras are like your eye. The problem is, they lack your brain which does a lot of heavy lifting in the background as your eye moves around the scene and adjusts, and an image taken by a camera has to be taken at only one setting. If the dynamic range of the image exceeds that of the camera, then part of the image will either be too dark or too light, and thus all the various shades of "too dark" will be black and the "too light" all white, even though the scene might have looked fine to your eye.

This happens a lot in indoor photos, where there's a window in the background (like this quick shot of my office I just took). The bookshelves look nice, but large portions of the window are over-exposed and therefore white, even though when I look out the window I can see the details of the tree out there.

My office with an over-exposed window.
The problem then is how to capture the entire dynamic range of a scene when it exceeds the camera's dynamic range. One way to do this is to take advantage of the camera's ability to capture the full range with a change of settings, that is, in multiple images.

Exposure fusion

The basic idea then is to combine multiple images so that the resulting image has the best exposed parts from each. Fortunately the software is out there to do the combining (or "fusing") automatically, so all you need is a set of appropriate pictures. In this case "appropriate" means images that range from very bright—where the darkest parts of the image don't register as completely black and have some detail in them—all the way to very dark—where the lightest parts of the image don't register as completely white and have some detail in them. You'll want at least one photo in the middle too. For example, the photo of my office is pretty good on the too-bright side. There's not a lot of contrast in the black of my chair and the books, but I don't think they're registering completely black either. I can easily check that with the histogram function in my image-editing software (Graphic Converter) which provides a plot of how many pixels in the image are at what level of brightness. (Histograms are a pretty standard feature of most image editors, including Mac OS X's standard Preview.)
Histogram of the over-exposed office photo.

The horizontal axis represents brightness level and goes from black on the left to white on the right. As you can see, this image has no pixels in the darker region on the left, and a lot in the very brightest on the right, confirming my guess that this would be a good "too bright" image.

So outside we go to take some pictures. Here are a few tips on doing that well:
  1. Use a tripod. You want to minimize the difference between the views in each image, so the software can more successfully combine them. You could try standing really still, and you might get decent results that way, but a tripod will keep the camera steady for sure.
  2. Take at least three photos. In most cases three will be enough (and you can get back to work on the site!). Your camera's histogram can help determine whether the photos are going to work well for this.
  3. Use exposure time to vary the amount of light, instead of varying the aperture size, which can affect the depth of field and so make the photos less similar.
  4. Take photos at least 2 stops to either side of the camera's automatic settings. This will be enough for most scenes, but you may find (as we will below) that scenes a larger range of brightness may need a bigger exposure range on the camera. (Here's where the histogram can show whether the photos have a wide enough range.) Many more expensive cameras as well as some cheaper ones can take a series of photos with the desired change in exposure automatically via their "exposure bracketing" function. Turn this on and when you press the button the camera will take one photo at the its automatic setting, one a fixed number of stops darker than that, and finally a third that same number of stops brighter. If you set the bracketing to go 2 stops in each direction, you'll get the desired range of 4. (The following photos were taken with a cheap Canon camera, which doesn't do exposure bracketing out of the box, but can with the free "Canon hack development kit" (CHDK) which adds scripting and a whole bunch of functionality to Canons.)
  5. If you're using an automatic process on the camera, set a short timer (2 seconds is a common one), so that any residual motion from your pushing the button will have stopped. Alternatively, f you have a remote control, use that instead.

An Example

My backyard provided the scene for this test case. It has an unusually wide dynamic range with a white garage in full sun and some heavily shaded grass.

Getting the images

Here are the three photos I started with, along with their histograms. They have a range of 4 stops:
Three photos with a range of 4 stops.
The camera's default (exposure time of 1/200s) is in the middle, and has some good detail in the shadows, but the sunlit part of the wall is completely washed out. You can see that in the histogram too: the darkest parts are close to black (on the left of the histogram), but a decent fraction of the brightest portions are at the maximum value. The darker image on the left has removed almost all the maxed out bright parts, and the darkest parts are still not at 0. The brighter image at right has a larger portion of the brighter sections at the max. Given that the darker image still hasn't removed the completely washed out portions, this turns out to be a scene that needs more than a 2-stop range. Fortunately I had the camera programmed to take a series of photos over an 8-stop range, so I had the other images already. Here's the new set, using 3 stops to either side of the default:
Three photos with a range of 6 stops.

You can see the the darkest image now has no pixels anywhere near the maximum brightness (and still very few at the maximum darkness). The bright image of course has even more washed-out areas, but it also has a lot more detail in the darkest parts too.

Fusing the images

Now that I have the images, they need to be combined. There are some commercial and free software applications out there that can do this for you. I use a free one, Hugin. (I've written about using Hugin before.) Hugin is designed to be a panorama maker, but making those requires a number of steps, including ones that do what we want here. While you can actually load up your photos into Hugin and use its graphical interface to do this (here's a tutorial), I'm going to break out the command line and directly access the underlying software goodness.

There are two applications to use from the command line. The first aligns the images and the second does the exposure fusion. Since I used a tripod, that first app really shouldn't have much to do, but I want to make sure that there are no problems with alignment, so I'm going to run it anyway. The app is rather straightforwardly called align_image_stack, and I use it with the -a option that tells it to align the images and save them to a series of similarly named tiff files. Once that's done, I use enfuse to do the fusing. Normally both processes take only a few minutes on my aging laptop. (As usual both applications can accept other parameters that modify how they work; I'm just keeping it simple here. RTFM for those details: enfuse and align_image_stack.) The exact commands (on my Mac) are (and ignore the leading numbers):

1> cd <directory of image files>

2> /Applications/Hugin/Hugin.app/Contents/MacOS/align_image_stack -a <newFileName>  <list of existing image files>

3> /Applications/Hugin/Hugin.app/Contents/MacOS/enfuse  -o <final filename> <list of files from previous step>

4> rm <intermediate files>

Here's what each command does:
  1. Change the working directory to the one that holds the images I want to fuse. Normally I keep each set of images in its own folder. That's not strictly necessary, but it makes some things easier and it's certainly more organized.
  2. Run align_image_stack on the images I want to work with. Since I'm in a directory without a lot of other stuff, I can often use a wildcard here: align_image_stack -a newFileName image*.tif. The newFileName is what the output files will be called. align_image_stack will automatically give them an extension of "tif", so you don't need to include one, and it will also suffix a number to this name, yielding, for example, "newFileName_0000.tf" as the first file. This process runs fairly quickly since the images were really already aligned, thanks to the tripod. When it's done, you'll see the new files in the directory.
  3. Run enfuse on the intermediate files. Again wildcards come in handy, so I typically give the intermediate files from step 2 a name different from the originals, like "aligned_image".
  4. Finally delete the intermediate aligned files from step 2.
This is pretty boring stuff, so I've created an AppleScript droplet that does it all for me. It's on GitHub.

The Result

Here's what the final image looks like (I turned the output tif into a jpeg to keep the download size reasonable):
Final enfused image (as a jpeg)


And here's the histogram for that:
Histogram of final output image

The histogram shows that both brightness extremes have been eliminated in the final image, and in the image itself both the bright sections (like the garage wall) and the dark ones (the shadows on the left) show a lot of detail. It's still possible to see where the shadow fell, but its intensity has been severely reduced.

Isn't this just "HDR"?

You'd think that, strictly speaking, high-dynamic-range images (HDR) would be images that have a greater than usual dynamic range, just like HDR monitors can produce a greater dynamic range than "normal" monitors. That might mean an image with more than the typical 10 stops your camera can capture. HDR can mean that, but what's usually meant is the result of taking such an image (often generated using techniques similar to exposure fusing) and manipulating it so that it can be viewed in a non-HDR medium (like a piece of photographic paper or your normal computer screen). This process is called tone mapping, and involves, as the Wikipedia page on HDR says, "reduc[ing] the dynamic range, or contrast ratio, of an entire image while retaining localized contrast." Depending on the algorithms used, this process can create vivid alterations in color and saturation, and give the image an artificial look. Such extreme applications are fairly popular these days, and probably what most people think of when you say "HDR".

But exposure fusion doesn't involve the creation of an HDR image with the subsequent tone-mapping, so it avoids the sometimes interesting effects you can find in some HDR photos. In fact the original paper on exposure fusion was called "Exposure Fusion: A Simple and Practical Alternative to High Dynamic Range Photography" [@Mertens2009]. That said, it is possible to use tone-mapping techniques that create images that look very similar to ones obtained from exposure fusion. The popular HDR mode on the iPhone seems to do this. (I haven't found a definitive statement on exactly what the iPhone does, but it's certainly taking multiple photos and combining them.) Here's a site comparing the two techniques. When you look at the histogram for our final image above, you can see that this is really a low-dynamic-range process.

Conclusion

So there you go. Next time you've got a scene with a very high dynamic range (which usually means some shadow on a sunny day) that you can't fix by other means, break out the tripod and your camera's bracketing mode and do some exposure fusion.

There's also a lot of information out there about this and other post-processing techniques (like HDR) that can improve your photos. You'll find that similar fusion techniques can be used for combining series of images with other variants, like creating large depths of field.

03 October 2015

A Practical application of markdown & pandoc

As I recounted in another post to this blog, I started using the text-only markdown format for my writing a few months ago. In part I did this to avoid future problems with file formats, and in part to be able to have one central file that I could convert to various formats as needed, ideally by using pandoc.
Most people probably will understand the problem with obsolete file formats. I’ve got some not so old WordPerfect files on my computer right now, for example (and you can find a bunch of them on my employer’s servers as well), but no WordPerfect app (or “program” as we used to call it). I can still read the ones I’ve tried to open, but I need to be a bit clever about it and I doubt many of my colleagues would succeed on their own. I also had a heck of a time converting my 1998 dissertation to a newer version of Word, losing a little formatting along the way.
But the major impetus for the effort in recent months has been the second goal of having one master file in markdown (mentioned very briefly in that other post). In particular I was thinking about streamlining the workflow for updating my CV, which I do as often as every few months. I keep a version of the CV on line as a web page, and another version in PDF for me to share more formally and for the curious to download from the web page. I wasn’t too thrilled with how the html printed, so I didn’t spend much time cleaning it up for that purpose and instead generated the PDF from an OpenDoc file which is easy to format (and which I edit with LibreOffice, BTW). My workflow then was something like:
  1. Edit the html with BBEdit.
  2. Edit the odt file with exactly the same content changes.
  3. Print the odt file to PDF.
  4. Post the html on line.
  5. Post the PDF on line.
  6. Archive the successive .odt and PDF versions by saving the new files with an appropriate name (I stuck the date in it).
In short, not a super long workflow, but still plenty of opportunity for those two versions to go out of sync (which they did). More annoyingly though, it was just too many steps.
When I learned about pandoc, I figured this was going to do it for me. Instead of worrying about odt and html, I’d just have one markdown file and spin that off into html and a PDF, avoiding odt altogether. That turned out to be harder than I’d expected, mainly because I knew virtually nothing about the tool pandoc uses to generate PDF, TeX/laTeX, and the default format didn’t look much like what I was used to generating. (My current default document format is based on the ideas in Matthew Butterick’s Practical Typography, which I highly recommend.) I also had the problem that the html wasn’t really printable in the format I was using: the links were underlined so visitors would know where to click, and it contained some info at the bottom including links that led to my department’s website and a few other places, all of which did not need to be included in the PDF.
At the heart of the problem is that markdown is a nice little tool, but it isn’t designed for formatting the text. That’s a pretty standard text-markup thing: put the document’s logical structure in it, but do the formatting elsewhere. Content not layout. So while you can say “this bit of text is a heading”, you can’t say what a heading should look like from within markdown. For that you need to use whatever formatting system your destination file requires (css for html, stylesheets for odt, and so on) and that’s where my ignorance of latex came into play. After a little bit of trying to get up to speed on laTeX, I decided to drop the idea of going directly to PDF and continue as I had been: print a file to PDF after it had been generated by pandoc. This was still an improvement over what I had been doing…as long as I could get the format right so I could use the html file for printing.
I put the project aside for a bit, but when I had to update the CV a few weeks ago, a few things occurred to me. First, I wasn’t using css to full advantage. I’d forgotten that it allows you to change styles depending on how the page is viewed. For example, the css can indicate one font for a big computer monitor and another for a smart phone. So I used that feature (specifically @media print coupled with the css pseudo-class last-of-type) to make that pesky final material disappear upon printing. I also made the links look like the rest of the text when printed. The other thing I realized was that my two-column format could easily be handled with standard html tables which I had abandoned years ago in an earlier version of the document and was now fudging in the markdown by using standard html…ugly. The last thing to do was use another css pseudo-class (first-of-type) to get my leading table to format differently from the other tables throughout the document. Here's what that last looks like. The contact details at the start are one table (the first in the document) and the educational institutions with dates are a second:

The last table is similar and disappears upon printing:
Where am I now with that workflow? Here’s the new one:
  1. Edit the markdown.
  2. Convert the markdown to html
  3. Print the html file to PDF.
  4. Post the html on line.
  5. Post the PDF on line.
  6. Archive the markdown and html on Github.
Strictly speaking the PDF in step 3 is only for me now, since viewers of the web page version could just print it themselves and get the right formatting. Nevertheless I post the PDF for downloading to make it easy. (Now that I think of it, I should probably keep track of downloads to see if that’s worthwhile.) Step 6 was my solution to all the archiving work I was doing. It still requires some action on my part, but with the OS X Github Desktop app it’s much easier and quicker than the old method. In fact although overall the number of steps is still the same, they entail a lot less work. Importantly the repeated and error-prone editing step is gone.

Lessons learned?

  • Pandoc is a great tool, but getting it to generate (nearly) identical-looking documents in different formats can be fairly hard. To be fair, it’s not really for that anyway.
  • If you don’t want to settle for default styles, you’re going to have to do a little work in your output format to get your document to look like you want. You may also have to compromise to get it to work in markdown.
  • It’s not always easy to step back and look at a problem with fresh eyes. I was so intent on rendering the original html in markdown that I didn’t realize that an easy(-ish) css solution was staring me in the face in the form of new html. Setting the project down for a while helped.
A final note: I continue to work (slowly) on my laTex knowledge, and have now got a generic document style going that looks fairly similar to my odt and html.
PS I don’t know why I’m capitalizing PDF, but not the other extensions.

19 August 2015

Archiving BMCR

Just a few days after my last post on archiving to fight link rot, Ryan Baumann (@ryanfb) wrote up his impressive efforts to make sure that all the links in the recently announced AWOL Index were archived. Since I was thinking about this sort of thing for the open-access Bryn Mawr Classical Review for which I'm on the editorial board, I figured I'd just use his scripts to make sure all the BMCR reviews were on the Internet Archive. (Thanks, Ryan!)

Getting all the URLs was fairly simple, though there was a little bit of brute-force work involved for the earlier years, before BMCR settled on a standard URL format. Actually there are still a few PDFs from scans of the old print versions which I completely missed on the first pass, but once I found out they were out there, it was easy enough to go get them. (I was looking for "html" as a way of pulling all the reviews, so the ".pdf" files got skipped.)

In the end less than 10% of the 10,000+ reviews weren't already on the Archive, but are now, assuming I got them all up there. Let me know if you find one I missed.

I'm still looking at WebCite too.

14 August 2015

Fight link rot!

Argh, it just happened to me again. I clicked on a link on a webpage only to find that the page on the other end of the link was gone. 404. A victim of link rot.

This kind of thing is more than a hassle. It's a threat to the way we work as scholars, where citing one's sources and evidence is at the heart of what we do. Ideally links would never go away, but the reality is that they do. How often? A study cited in a 2013 NYTimes article found that 49% of links in Supreme Court decisions were gone. The problem has gotten big enough that The New Yorker had a piece on it back in January.

What I wanted to do here is point out that there are ways for us scholars to fight link rot, mostly thanks to the good work of others (and isn't that the whole point of the Internet?). Back in that ideal world, publishers would take care that their links never died, even if they went out of business, but we users can help them out by making sure that their work gets archived when we use it. Instead of simply linking to a page, either link to an archived version of it or archive it when you cite it, so that others can go find it later.

I've used two archiving services, Archive.org and Webcite. Both services respect the policies of sites with regard to saving copies (i.e., via their robots.txt files), but Archive.org will actually keep checking policies, so it's possible that a page you archived will later disappear. That won't happen on WebCite. WebCite will also archive down a few links deep on the page you ask it to archive, while Archive.org just does that one page.

WebCite is certainly more targeted to the scholarly community, and their links are designed to be used in place of the originals in your work. But both of them are way better than nothing, and you'll find lots of sites using them. For convenience there are bookmarklets for each that you can put in your browser bar for quick archiving (WebCite, Archive.org).

So next time you cite a page, make sure you archive it. Maybe even use WebCite links in your stuff (like I did in this post on the non-Wikipedia links).

(FYI, another service is provided by Perma.cc, which is designed for the law, covered in this NPR story.)

Added 18 August 2015: Tom Elliott (@paregorios) notes this article on using WebCite in the field of public health (which I link to via its DOI).