Using BigQuery Public Datasets for Small Projects

Paul Leroy
4 min readAug 18, 2020
Where my students are (or said) they’re from this week

I have written a small app to demonstrate integration between Google Sheets and AppEngine, essentially using a sheet as a database for the app. It is not the fastest or best way but it demonstrates the split between user and server space quite nicely. I decided recently to make it a little more interactive and attached a form to it and allowed people to let me know where they’re from. The problem with that is I had assumed people would put in ‘country’ when asked for the country. They didn’t. Many people are very proud of their cites, so I needed to add ‘New York’ to the list. Simple look up should solve that. Also, America, US and USA are all equivalent to ‘United States of America’. Another entry. Job done. I’ll just add the cities that don’t match when people put those in right?

I decided to poke around in the BigQuery public datasets for all the cities in the world. Cities have a well understood definition and need to be above a certain size.

The first thing I needed was to get all the cities’ point information. So I went and found the public OpenStreetMap dataset.

$ bq — project_id=bigquery-public-data ls geo_openstreetmap
tableId Type Labels Time Partitioning Clustered Fields
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
history_changesets TABLE
history_nodes TABLE
history_relations TABLE
history_ways TABLE
planet_changesets TABLE
planet_features TABLE
planet_layers TABLE layer_code, geometry
planet_nodes TABLE
planet_relations TABLE
planet_ways TABLE

Great, fairly sure the data is sitting in `planet_features`. The data is tagged and used osm_id as the unique identifier. What I wanted was the city point (latitude and longitude) and the name in English if possible. Also the tags are repeated key/value pairs, which contain the name, the name in English (name:en), the population and various other metrics. The city geometry could be a point or a polygon depending on the size. As the data presentation doesn’t need to be too accurate I decided to just use the centroid. My query for the city data then became:

with prepared_data as (
SELECT
osm_id,
all_tags,
ST_CENTROID(geometry) AS geom
FROM
`bigquery-public-data.geo_openstreetmap.planet_features`
CROSS JOIN
UNNEST(all_tags) AS tags
WHERE
EXISTS (
SELECT
key,
value
FROM
UNNEST(all_tags)
WHERE
key='place'
AND value='city'))

This gave me every city, with all the tags and the geometry. Now I needed to filter out the data so I could use name:en if available or name if not. Now I wanted to pivot the data a bit (no hardcore code here, I’m afraid, I knew exactly which columns I wanted). I grouped the data by osm_id, took `anyvalue` for the geometry and took the non-null values for the name and name:en

SELECT
osm_id,
MAX(
IF
(key='name',
value,
NULL)) AS name,
MAX(
IF
(key='name:en',
value,
NULL)) AS name_en,
ANY_VALUE(geom) AS geom
FROM
prepared_data
CROSS JOIN
UNNEST(all_tags) AS tags
WHERE
key IN ('name',
'name:en',
'population')
GROUP BY
osm_id

I also wanted to convert the geometry into a latitude and longitude floating point value that Google Sheets could use for the lookup and use name_en if it existed or else use the name. I also added a small piece that would give me a Google Maps link to check that the data was being materialized correctly before I imported it into sheets. Giving me the full query as:

WITH
prepared_data AS (
SELECT
osm_id,
all_tags,
ST_CENTROID(geometry) AS geom
FROM
`bigquery-public-data.geo_openstreetmap.planet_features`
CROSS JOIN
UNNEST(all_tags) AS tags
WHERE
EXISTS (
SELECT
key,
value
FROM
UNNEST(all_tags)
WHERE
key='place'
AND value='city') )
SELECT
IF
(name_en IS NOT NULL,
name_en,
name) AS name,
st_x(geom) AS longitude,
st_y(geom) AS latitude,
CONCAT("https://www.google.com/maps/@",CAST(st_y(geom) AS string),",",CAST(st_x(geom) AS string),",16z")
FROM (
SELECT
osm_id,
MAX(
IF
(key='name',
value,
NULL)) AS name,
MAX(
IF
(key='name:en',
value,
NULL)) AS name_en,
ANY_VALUE(geom) AS geom
FROM
prepared_data
CROSS JOIN
UNNEST(all_tags) AS tags
WHERE
key IN ('name',
'name:en',
'population')
GROUP BY
osm_id)

The query cost is around $1 and gives just over 18100 cities that I could then use in my visualisation. A quick sanity check on coverage using BigQuery GeoViz shows pretty good coverage. I decided against using the population limit as using the city definition should give me a good lower bound on the data already.

City Coverage

Now I can carry on the D3 visualiztion of the sheets data in AppEngine because I had the data to support it.

--

--