Using BigQuery Public Datasets for Small Projects

Where my students are (or said) they’re from this week

I have written a small app to demonstrate integration between Google Sheets and AppEngine, essentially using a sheet as a database for the app. It is not the fastest or best way but it demonstrates the split between user and server space quite nicely. I decided recently to make it a little more interactive and attached a form to it and allowed people to let me know where they’re from. The problem with that is I had assumed people would put in ‘country’ when asked for the country. They didn’t. Many people are very proud of their cites, so I needed to add ‘New York’ to the list. Simple look up should solve that. Also, America, US and USA are all equivalent to ‘United States of America’. Another entry. Job done. I’ll just add the cities that don’t match when people put those in right?

I decided to poke around in the BigQuery public datasets for all the cities in the world. Cities have a well understood definition and need to be above a certain size.

The first thing I needed was to get all the cities’ point information. So I went and found the public OpenStreetMap dataset.

Great, fairly sure the data is sitting in `planet_features`. The data is tagged and used osm_id as the unique identifier. What I wanted was the city point (latitude and longitude) and the name in English if possible. Also the tags are repeated key/value pairs, which contain the name, the name in English (name:en), the population and various other metrics. The city geometry could be a point or a polygon depending on the size. As the data presentation doesn’t need to be too accurate I decided to just use the centroid. My query for the city data then became:

This gave me every city, with all the tags and the geometry. Now I needed to filter out the data so I could use name:en if available or name if not. Now I wanted to pivot the data a bit (no hardcore code here, I’m afraid, I knew exactly which columns I wanted). I grouped the data by osm_id, took `anyvalue` for the geometry and took the non-null values for the name and name:en

I also wanted to convert the geometry into a latitude and longitude floating point value that Google Sheets could use for the lookup and use name_en if it existed or else use the name. I also added a small piece that would give me a Google Maps link to check that the data was being materialized correctly before I imported it into sheets. Giving me the full query as:

The query cost is around $1 and gives just over 18100 cities that I could then use in my visualisation. A quick sanity check on coverage using BigQuery GeoViz shows pretty good coverage. I decided against using the population limit as using the city definition should give me a good lower bound on the data already.

City Coverage

Now I can carry on the D3 visualiztion of the sheets data in AppEngine because I had the data to support it.

Senior Google Cloud Trainer