08 July 2016

Google Sheets to GeoJSON

Part of my summer is being spent working with a Drew University College of Liberal Arts student, Alexis Ruark, on growing a database of Roman temples. This is part of Drew's new Digital Humanities Institute, partly funded by the fine people at the Mellon Foundation (on which more soon, I hope).

There are of course plenty of data out there on ancient places, including Pleiades, the Digital Atlas of the Roman Empire, and Vici.org, but what we're trying to do is create a database with more detail than those more generic projects (and I don't mean "generic" in a bad way here). In turn we hope to be able to contribute back to them (especially as we've relied on some of their data to kickstart our own work).

Alexis is working in a Google spreadsheet for a number of reasons, including easy sharing between us and the advantages that spreadsheets offer in general (e.g., sorting rows and moving columns around). But it isn't so easy to share data in that format, and there is already an existing format for sharing geographical data, namely, GeoJSON, so I'd like to be able to convert from the sheet to that format. (I'm also thinking ahead a little bit to when the projects grows up a little, and having the data in a different format will be more useful, if not necessary.)

First step, of course, was to do an internet search for converting Google sheets to JSON. Turns out the Google APIs themselves support conversion to one kind of JSON, so I figured this might be a nice little project for me to work on my coding skills while I learned more about JSON and the software that's out there already.

What I found

One page with some hints on converting Google sheets to JSON can be found here. In brief Google provides a feed of your spreadsheet in JSON format as long as you publish the spreadsheet to the web. Here's what that URI looks like:

https://spreadsheets.google.com/feeds/list/<sheet_ID>/1/public/values?alt=json

where the "<sheet_ID>" is that long code that shows up in the URI to your spreadsheet. One change that I had to make to the instructions on the site was to the part of the path that shows up right after that ID (a "1" here). It seems from the Google documentation to indicate the key of the sheet in your file that should be exported. (Happy to be corrected on that. See my comment on that article for some more links.)

The Process

Here's what I came up with:
  1. Get the JSON via the Google API and curl.
  2. That JSON isn't GeoJSON, so it needs to be processed. This was a chance for me to do some more work with the very powerful command-line app, jq, which I learned about from a great post by Matthew Lincoln on the Programming Historian. That took a few step:
    1. Remove the odd prefixes Google sticks on the column headers: "gsx$". It's not strictly necessary, but it does make the JSON—and the rest of this script—a bit more readable. For this I just used sed 's/gsx\$//g'.
    2. Pull out just the JSON for the rows, leaving out the info about the spreadsheet that is prepended to it. Here's the first use of jq: jq -c '.feed.entry[]'.
    3. Create a proper GeoJSON file with those rows, using only the necessary data (just longitude, latitude, and name for now): jq -c '{type: "Feature", geometry: {type: "Point", coordinates: [(.longitude."$t"|tonumber), (.latitude."$t"|tonumber)]}, "properties": {name: .temple."$t"}}' | tr '\n' ',' | sed 's/,$//g'. There are a couple of things going on there:
      • First, the coordinates had to be interpreted as numbers, but the API quotes them as if they were text. jq's tonumber function takes care of that, used inside parentheses with | (a new one for me).
      • jq also spits out each row as a separate JSON object, but they need to form part of a bigger object. This requires commas between them in place of the new lines that jq leaves when it's doing compact output, indicated by the -c option. tr took care of that, and sed removed the comma that got inserted at the end of the file.
      • The rest just uses jq to take the appropriate fields from Google's JSON and puts them where GeoJSON requires.
  3. Finally, I fill a file with this data, flanked by some needed opening and closing code:
    • Prefix for the GeoJSON file: {\"type\": \"FeatureCollection\",\"features\": [
    • All that nice JSON from the previous step.
    • Closing brackets: ]}
    • Then, for esthetics and readability, I use jq to reformat the JSON: jq '.'
    That file gets saved to my local copy of the GitHub repository for this project, so that when it gets synced, the work is backed up with a version history, and we get the added bonus that GitHub shows GeoJSON files as maps by default.
I saved the whole thing as a bash script with a little more error checking than I discussed here. You can check it out on GitHub.

Other methods

Turns out I should have searched for "google sheet to GeoJSON" instead of just "JSON" when I started this, as there are several existing ways to do this. My own offers some advantages for me (like saving to my GitHub repository), and I'm glad I took the time to work through the coding myself, but I'm looking more closely at these others to see if I can't use them or contribute to them to come up with a better solution.

One nice approach, called Geo, uses a script that you add to your spreadsheet. It will then let you export a GeoJSON file. Like my script (so far), it's limited to exporting just the geographical coordinates and an ID for the point. It will also look up addresses for you and fill in coordinates for them, which is not something that our project needs, but is very nice regardless.

A second method, csv2geojson, uses javascript to convert csv files to GeoJSON. In addition to making a collection of individual points, it can convert a list of points into another type of geographical entity, a line string.

A third looks very nice, but isn't working for me, gs2geojson. It adds a color option for the markers, which is appealing and suggests that it might not be too difficult to handle other columns as well. My javascript skills are poor, so I'm hoping it hasn't been abandoned...or maybe it's time to take on another student researcher who knows more than I do!

The last project I'll mention looks the most appealing to me right now: sheetsee.js, maintained by Jessica Lord, a software engineer at GitHub. It can read your sheet and grab all of the columns. The demo shows them being used in a pop-up upon hovering over the point. It also relies on tabletop, which is what actually reads the sheet and returns it as a simple array of JSON objects, so add that to the list.

The Future

Ultimately I may need to do some significant manipulation of some of the data in the sheet, so I think I'm going to talk to a few people who know more than I do about this to find out what they do, and I'll also delve a little more deeply into some of these other methods. At the very least, I'll learn more about what's out there and improve my coding skills.